I know it's all just an accident of language, but the fact that you can rearrange the letters in "the eyes" into "they see" feels… magical.

Ever since I was a kid, I've enjoyed anagrams. So, I thought I'd try to build an anagram generator in Python. It turned into a fun little weekend project with a dash of tree traversals and recursion.

✉️

Here's what I came up with.

How To Check If Two Strings Are Anagrams

string1 is an anagram of string2 if the characters in string1 are a rearrangement of the characters in string2.

The order of the characters doesn't really matter, though. What matters is that the count of each character in string1 is the same as the count of each character in string2. If each character appears the same number of times in each string, then the two strings are anagrams of each other.

Python's collections.Counter can calculate this:

from collections import Counter

string1 = "the eyes"
string2 = "they see"

Counter(string1)
# Counter({
#     'e': 3,
#     't': 1,
#     'h': 1,
#     ' ': 1,
#     'y': 1,
#     's': 1
# })

Counter(string1) == Counter(string2)
# True ✅

See that empty string ' ' in the Counter object? That's a problem. "Theater" and "the tear" are anagrams, even though one has a space and the other doesn't. Anagrams are case-insensitive. They may also have different punctuation, as in "vitalise" and "IT'S ALIVE!"

A robust solution needs to take all this into account, so collections.Counter won't cut it without some preprocessing:

string1 = "vitalise"
string2 = "IT'S ALIVE!"

Counter(string1) == Counter(string2)
# False ❌

Like most problems, there are many ways to solve this one.

I wanted a solution that is easy to understand and easy to change in case it doesn't handle every case it should. I went with the string .translate() method, which replaces each character in a string according to a translation dictionary. Characters mapped to None are removed from the string. Plus, I can easily create translation dictionaries with str.maketrans().

This gives me fine-grained control over how the strings are processed before being passed to Counter():

from collections import Counter
from string import punctuation, whitespace
from typing import Callable

def process(string_: str) -> str:
chars_to_remove = punctuation + whitespace
translation = str.maketrans("", "", chars_to_remove)
return string_.translate(translation).lower()

def are_anagrams(
string1: str,
string2: str,
process: Callable = process
) -> bool:
processed1 = process(string1)
processed2 = process(string2)
return Counter(processed1) == Counter(processed2)

It might be better to use .casefold() instead of .lower() in the process() function. But overall, this makes for a robust solution. It's easy to understand, provided you know about Counter, .translate(), and str.maketrans(). And it's easy to swap out processing functions if needed.

But what about finding anagrams for a particular string?

How To Generate Anagrams

It's harder than checking if two strings are anagrams of each other.

Simply generating all possible rearrangements of a string isn't enough. You have to insert spaces and punctuation in places that make sense. The resulting words in the string need to be actual dictionary words.

We need a way to efficiently generate words that use letters from a string.

Solving It On Paper

Set aside the problem of obtaining a list of words and start simple:

eyes
the
they
see
sea

The process goes like this:

1. Write down the phrase you want to generate anagrams for, such as "the eyes."
2. Pick a letter from the phrase (say, "t") and scan the list for words that start with that letter (words = "the" and "they"). Cross off one instance of the letter in the phrase, and write down the letter as the first letter of an anagram (anagram = "t").
3. Pick a letter from the phrase that hasn't been crossed off yet (say, "h"), and filter the words selected in the last step to those whose second letter is the new letter you picked (words = "the" and "they"). Cross off the letter in the phrase and add it to your anagram (anagram = "th").
4. Continue this process of selecting unused letters and filtering the list of words. When you reach the end of a word, check if the phrase you've generated so far is an anagram of the original phrase. If it is, you're done. If not, start over with the complete list of words but only use letters that haven't been crossed off.
5. Repeat the whole process using different initial letters from the phrase.

If you follow this algorithm step-by-step on the small list of words above, you'll end up with four anagrams:

• the eyes
• eyes the
• they see
• see they

Of course, "the eyes" is the phrase you started with. That's fine. Just discard it.

Wouldn't it be great if, instead of scanning the entire list for all words starting with the letter T, we could jump straight to them? Or jump to all the words that start with TH? Or begin with THE?

Sounds like we need a hash table!

What we really want is a hash table of hash tables.

A hash table of hash tables is one way to represent a tree. And, if you stop and think about it, the method I've described for generating anagrams feels a lot like a depth-first search. That's a big clue for how to implement the algorithm.

And Python has all of the pieces we need.

📚
Need a refresher on algorithms?

My go-to book is Grokking Algorithms by Aditya Bhargava. It covers hash tables, trees, recursion, depth-first search, and so much more, in a friendly and approachable way.

Get instant access from Manning or pick up a copy from Amazon.

Building a Tree Of Dictionaries

The tree we're going to build will look something like this:

Python dictionaries are hash tables, so they're the natural choice for the basis of our tree. The keys for each dictionary are strings containing a single letter, and the values are more dictionaries. We'll need to know when we're at the end of a word, so we'll mark those places with the dictionary {" ": {}}.

The tree of dictionaries for the words eyes, the, they, see, and sea looks like this:

{'e': {'y': {'e': {'s': {' ': {}}}}},
's': {'e': {'a': {' ': {}}, 'e': {' ': {}}}},
't': {'h': {'e': {' ': {}, 'y': {' ': {}}}}}}

But how do you construct a dictionary like this?

The key observation is that you can add words to the dictionary recursively. Start with an empty dictionary. For each word, pop off the first letter — call it char. If char is a key in the dictionary, grab its value. Otherwise, set the key char in the dictionary with an empty dictionary for its value. Then repeat the process, using the dictionary mapped to char and the word without its first character.

Assume word has no whitespace and tack the string " " on to the end so that we get the terminal dictionaries right.

Here's the code:

from typing import Iterable

def add_word(tree: dict, word: str) -> None:
if word != "":
char = word[0]
branch = tree.get(char, {})
tree[char] = branch

def build_word_tree(words: Iterable[str]) -> dict:
word_tree = {}
for word in words:
return word_tree

Recursion is nice because it can compactly describe a repetitive process. But it can be difficult to understand what a recursive algorithm does without reasoning through the steps by hand. Documentation with some good examples can help a reader (including your future self!) understand a recursive function.

📚
Want to get better at recursion?

Al Sweigart, of Automate The Boring Stuff fame, has a new book all about recursion with examples in Python and JavaScript.

Get The Recursive Book of Recursion from No Starch Press or Amazon, or read it for free on Al's website.

Now that we've got a tree to work with, we can start generating anagrams.

Traversing The Tree To Make Anagrams

You can implement the anagram algorithm as a walk over the nodes in a word tree.

We pick an initial node — one of the characters in the phrase we want anagrams for — and mark that node as visited in an array. Then move to the branch in the tree rooted at that node. Keep repeating this until you visit all of the characters in the phrase or reach the end of a word. More recursion!

It's essentially a depth-first search algorithm, except that the nodes available to walk to are limited by the nodes in the tree and the characters in the phrase that haven't been visited yet.

This is captured in the following Python code:

from typing import Generator

def walks(
tree: dict,
filter_: Callable,
_visited: tuple = ()
) -> Generator:
if tree == {}:
yield _visited
else:
for node in filter_(tree, _visited):
yield from walks(
tree[node], filter_, (*_visited, node)
)

The filter_() function should return a tuple of nodes available to walk to. There's some nuance to how exactly this works for finding anagrams. We'll get to that in a bit and hold off on defining a filter function for now.

We're still not done implementing the walk for the anagram algorithm.

walks() yields tuples of nodes visited by a walk — characters from a phrase in this case — but walks end as soon as they reach the end of a word.

To produce multi-word anagrams you need to circle around and continue walking the tree from the root. I call this "walking around" the tree. You keep walking around and around, collecting words until the walk is done.

Here's the code:

def walk_around(
tree: dict,
done: Callable,
filter_: Callable,
_visited: tuple = (),
) -> Generator:
for walk in walks(tree, filter_, _visited):
if done(walk):
yield walk
else:
yield from walk_around(
tree, done, filter_, walk
)

We need to implement done and filter functions to pass to walk_around(), both of which depend on the phrase we're generating anagrams for.

A walk is done when the string of characters visited in the walk is an anagram of the phrase. We already know how to do that! We can use the are_anagrams() function:

def done(phrase: str, walk: tuple) -> bool:
return are_anagrams(phrase, "".join(walk))

Now we need to define filter_().

I said there was some nuance to this.

At each step in the walk, we only need to move to a character in the phrase that we haven't visited yet. Or a punctuation character, since we want to include words like contractions. Oh, and we can also move to a " " character, since that is how we know we're at the end of a word.

We can express these rules cleanly using sets:

def filter_(
phrase: str,
tree: dict,
_visited: tuple
) -> tuple:
remaining = set(Counter(phrase) - Counter(_visited))
_punctuation = set(punctuation)
end_of_word = {" "}
allowed = remaining | _punctuation | end_of_word
nodes = set(tree.keys())
anagram_nodes = nodes.intersection(allowed)
return tuple(anagram_nodes)

Counter makes it easy to calculate the characters in the phrase that haven't been visited yet.

Subtracting a Counter from another returns a Counter with the counts subtracted. Characters whose count is zero are removed. For instance, Counter("tea") - Counter("e") returns Counter({'t': 1, 'a': 1}).

We have all the pieces in place to write an anagram generator:

from functools import partial

def anagrams(phrase: str, words: Iterable) -> Generator:
tree = build_word_tree(words)
_done = partial(done, phrase)
_filter = partial(filter_, phrase)
for walk in walk_around(tree, _done, _filter):
anagram = "".join(walk).rstrip()
if anagram != phrase:
yield anagram

functools.partial() is used here to "fill in" the phrase parameters of the done() and filter_() functions. We have to do this because the done and filter_ parameters of walk_around() expect functions without a phrase parameter. Whitespace in the anagram is removed with .rstrip() since walks produced by walk_around() end with a space.

Let's take anagrams() for a spin on a small list of words:

words = ["eyes", "the", "they", "see", "sea"]
phrase = "the eyes"

for anagram in anagrams(phrase, words):
print(anagram.rstrip())

# eyes the
# they see
# see they

Hey, it works!

Testing On a Large Set Of Words

If your computer has a Unix-like OS, you have a large list of words available right now.

On most machines, there's a words file located at /usr/share/dict/words or /usr/dict/words. It's a newline delimited list of dictionary words. If you have a Windows machine or don't have a words file, you can find a tarball for the file in your language on the GNU website.

Let's try it out:

from pathlib import Path

words_path = Path("/usr/share/dict/words")
# The words file ends with a blank line so we
# need to remove trailing whitespace
words = words.lower().strip().split("\n")
phrase = "tea"

for anagram in anagrams(phrase, words):
print(anagram)

# a t e
# a te
# a e t
# at e
# ate
# ae t
# t a e
# ...

It's working great, but the words file includes single letters and two-letter words like "te." That adds a bunch of noise to our anagrams. We should probably clean it up.

Let's filter out some words we know that we don't want:

from string import ascii_lowercase

SINGLE_LETTERS = set(ascii_lowercase) - {"a", "i"}
TWO_LETTER_WORDS = {"te", "ae", "ea"}
FORBIDDEN = SINGLE_LETTERS | TWO_LETTER_WORDS

words_path = Path("/usr/share/dict/words")
words = words.lower().strip().split("\n")
words = (word for word in words if word not in FORBIDDEN)

phrase = "tea"

for anagram in anagrams(phrase, words):
print(anagram)

# eat
# eta
# ate
# tae

This definitely gives us better results. For a better anagram generator, you'll need to spend some time curating words. And the words file isn't necessarily the best set of words to start with, since it doesn't contain any contractions.

You can generate more interesting word lists using aspell's SCOWL tool.

Happy anagram hunting!

Do you ever look back at some of the challenging problems you solved when you were a beginner, now that you have more experience? I did this exercise with the game Rock Paper Scissors, and now I'm sold on doing this more regularly:

Dig Deeper

Learn about several important and foundational algorithms in Aditya Bhargava's friendly and accessible book Grokking Algorithms. It's full of fun (and useful!) illustrations, and easy-to-digest explanations with example Python implementations.

Get instant access from Manning, or buy a print version from Amazon.