Ok let’s Trie T9 in Python

Ok let’s Trie T9 in Python

Photo by Tai Bui on Unsplash

Today we are going to implement a Trie optimized for T9 predictive text in Python. Let’s get right into it. The repository with the result can be found here.

What is T9?

Well, well, well, tell me who knows what is T9? You can always look into Wikipedia and find out, though I’m still going to explain it a bit. It’s an input method widely used on older phones (yep these ancient ones, with buttons). There were 9 buttons, and each button had its own set of letters assigned to it. One click — one symbol. Ok, to make things easier to understand let’s go to an example, have a look at the layout:

So if you want to enter the word “cat” you need to click:

2 2 8

Smart (not yet rebellious) machine, uses magic algorithms and understands:

  1. mmm ok, the first letter is a, b, or c.

  2. the second letter is a, b, or c as well.

  3. the third is t, u, or v.

How to solve the task if you’re a regular guy, who never reached task 200 on LeetCode? The answer is obvious, what nonsense? Let’s iterate over all possible combinations and check which ones are in our dictionary. Ok, I’m kidding, this approach is ultra-slow and mega-bad. Since you and I are SMART developers we are going to use a Trie (Prefix Tree).

Trie?

Wait, we are SMART, but what is a Trie?

It’s a data structure that previously looked to me like something undescribably advanced, and generally speaking, Google uses it in their search bar completion — I will never figure out how it works. In fact, it’s simple and ultimately trivial, and I’m not going to teach you. OK, I will, though there are cool articles online, I need some info for the context.

Let’s make a tree and write nothing to its root (null, '\0', empty, blank, my NFT collection growth potential, in general, something that doesn’t have any meaning, but is at the beginning of any word).

All children of the root — are letters that stand on the first position of the word. Their children — second position and so on. That is each layer of the tree represents a position of a word.

It didn’t get easier, so let’s insert the word “cat” into our Trie, here is the structure:

.
└── c               <-- first letter
    └── a           <-- second letter
        └── t       <-- third letter

Yep, you are right, I’ve used an online service that generates directory hierarchy, but I hope you got the point. Our tree, growing down, doesn’t have any branches, and honestly looks more like a cucumber shoot running down the stairs. Let’s keep on with our gardening. We’re inserting the word “car”.

.
└── c
    └── a
        ├── t
        └── r

Well you know, our tree still doesn’t look amazing, besides my graphics cannot convey all the beauty of nature even close. Also, let's not ignore the fact the cat is too close to the car, hopefully, that won’t lead to trouble!

Having that said, I believe it's enough puns. The cool thing is we can already use the Trie. The regular workflow looks so:

  • Hi Trie, could you tell me all words starting from “ca”?

  • Sure, software developer, it’s “cat” and “car”, here you are.

  • Thanks, hey Trie what about words starting from “ta”?

  • Such questions will get you into trouble in our community.

In fact, to make everything work we have to add one more thing. Namely, we must keep information if the node is an end of a word or not. Why is it important? Because we might want to insert the word “carrot” into our tree, despite being a root vegetable it can grow on our tree (let’s put up with the fact that cats and cars grow on our tree as well). Keeping that in mind, let’s insert the word “carrot”.

.
└── c
    └── a
        ├── t
        └── r
            └── r
                └── o
                    └── t
  • Hi Trie, could you tell me the words starting from “ca”?

  • Hello, brother, I know only “cat” and “carrot”.

Keeping the number of times we’ve inserted a word ending at the node, would be my proposal. In that case, we could get frequently used words first, it will help us a lot in the context of T9 input implementation. The final structure might look so (let’s imagine we’ve inserted words a different number of times):

.
└── c
    └── a
        ├── t (inserts: 5)
        └── r (inserts: 2)
            └── r
                └── o
                    └── t (inserts: 1)

In total, we have a cat party in the parking (Midjourney didn’t want to generate a meaningful picture for that).

Summarizing the algorithm, on how to get all words with specified prefixes (”ca” in our case):

  1. Set the current node as root.

  2. Look for a child with the letter “c”.

  3. Found? Let’s set it as the current node.

  4. Look for a child with the letter “a”.

  5. Found? We’re good so far, now let’s start DFS (Depth-first search) from here and get all nodes that are ends of words.

  6. Return sorted by descending usage frequency (or in any order, as you wish).

T9, I’ve run out of jokes

Having spent half an hour learning what is a Trie, we didn’t yet get to know how to make a smart T9 input. Or any T9, maybe not even smart, but our own, our relative, self-made. Let’s make it right, I’m going to apply some primitive wooden magic (it’s still a tree right). But we need a bigger Trie, let’s add a “ball”, “bat” and a “battery”:

.
├── c
│   └── a
│       ├── t (inserts: 5)
│       └── r (inserts: 2)
│           └── r
│               └── o
│                   └── t (inserts: 1)
└── b
    └── a
        ├── t (inserts: 1)
        │   └── t
        │       └── e
        │           └── r
        │               └── y (inserts: 2)
        └── l
            └── l (inserts: 1)

If we follow the layout I’ve attached at the very beginning we have to press 2 2 8 to get a “cat”.

Our input might look like this:

inputs = ['abc', 'abc', 'tuv']
  1. We go to the root.

  2. Choose child nodes with values a, b, and c (first string in the list).

  3. Let’s memorize these nodes, in our case we have c and b.

  4. Then from each of these nodes we look among their children we look for the letter a, b, and c (second string in the list).

  5. In our case we have ca and ba.

  6. And then, from these prefixes, we look for letters from the last group (t, u, or v).

  7. We get cat and bat.

  8. Then from these nodes we start DFS, and find all the words starting from these prefixes

  9. We get cat, bat, and battery.

Awesome? Awesome.

Python Implementation

What about inserting the code of the class right here and calling it a day? I think it’s a promising idea. Here is a TrieNode implementation I’ve come up with (of course you’ll find the complete implementation on my GitHub, how else would I be getting green squares in my stats):

class TrieNode:
    def __init__(self, value: str, parent: "TrieNode | None" = None):
        self._children: dict[str, TrieNode] = {}
        self._insertion_count = 0
        self._value = value
        self._parent = parent

    def __repr__(self):
        return f"TrieNode<{self._value}>"  # pragma: no cover

    @cached_property
    def word(self) -> str:
        return "".join(n._value for n in self._bottom_up_traversal())[::-1]

    def _bottom_up_traversal(self) -> Iterable["TrieNode"]:
        current: "TrieNode | None" = self
        while current is not None:
            yield current
            current = current._parent

    @property
    def word_nodes(self) -> Iterable["TrieNode"]:
        result = []
        dfs = [self]
        while dfs:
            node = dfs.pop()
            if node._insertion_count:
                result.append(node)

            for child in node._children.values():
                dfs.append(child)

        return result

Let me explain a little.

I use dict to store children because I want to waste all our memory (but seriously, there is a more memory-optimized approach) and support all possible letters/symbols as values of the nodes.

_insertion_count is how many times a word with an end in this node was inserted (0 means this node is not an end of any word).

I also store _parent for each node, so as to get the whole word from a node, when needed.

The property word_nodes returns a list of nodes (it’s important, we don’t return words here, only nodes) that can be retrieved from these nodes. I’ve implemented DFS using a loop approach instead of recursion.

Ok, we’re done with the node, let’s get right into my Trie implementation.

First of all, let’s make a root empty node (I mentioned it at the beginning of the article).

class Trie:
    def __init__(self):
        self._root = TrieNode("")

Cool, now we can insert a word:

def insert(self, word: str, multiplier: int = 1) -> None:
    if not isinstance(word, str):  # pragma: no cover
        raise ValueError("Can insert only single word")

    if not word:  # pragma: no cover
        return

    current = self._root
    for letter in word:
        if letter not in current._children:
            current._children[letter] = TrieNode(letter, current)

        current = current._children[letter]

    current._insertion_count += multiplier

multiplier is used to avoid inserting a word hundreds of times, if it’s used so frequently. We can just specify how often the word is used.

Now to the interesting part, let’s start simple, count how many times the word was inserted:

def count(self, word: str) -> int:
    current = self._root
    for letter in word:
        if letter not in current._children:
            return 0

        current = current._children[letter]

    return current._insertion_count
  1. We start from the root.

  2. Look for the child with the required letter.

  3. Didn’t find — we don’t have that word.

  4. We did — cool, let’s go to this node.

  5. The word has ended — we can return the insertion count.

Now, my almost favorite method -- the simple completion of a prefix. As an argument we want to pass a prefix, and obtain all the words starting from this prefix in descending usage frequency order:

def complete_simple(self, prefix: str) -> Iterable[str]:
    current = self._root
    for letter in prefix:
        if letter not in current._children:
            return []

        current = current._children[letter]

    return self._flatten_nodes([node for node in current.word_nodes])

def _flatten_nodes(self, nodes: Iterable["TrieNode"]) -> Iterable[str]:
    return (
        node.word for node in sorted(nodes, key=lambda node: -node._insertion_count)
    )

First, we get all nodes from the found prefix-node, and then we sort them and get the words from the nodes.

And the most interesting, how to make T9. Actually, this implementation is what I have in my repo:

def complete(self, prefix_groups: "list[str] | str") -> Iterable[str]:
    candidates = [self._root]
    for group in prefix_groups:
        new_candidates = []
        for each in candidates:
            for letter in group:
                if letter in each._children:
                    new_candidates.append(each._children[letter])

        candidates = new_candidates

    return self._flatten_nodes([node for c in candidates for node in c.word_nodes])

In the repo you’ll find complete implementation, tests, benchmarks, and even examples of usage with a word vocabulary.

T9 is not T9 (outro)

I’ve started to explore prefix tree not because I was an average old phone fan (I identify myself as a smartphone enjoyer) and not because I couldn’t remember more than 9 buttons with their locations. It’s just marvelous how seemingly complex tasks become trivial with a proper toolkit. This input approach can be used in various ways, e.g. we can 3 or 20 buttons instead of 9. We can use various numbers of letters assigned to different buttons, letters can repeat, and so on. I believe there is potential, and if there is not we still have learned something new. But what about this layout?

Or this?

Thanks for the reading. And in the next series, we will make it a little faster, so stay tuned!

And the final cat for the balance:

New York Family Spends Thanksgiving Trying to Rescue Beloved Cat (and  Human) from a Tree | Daily Paws