# Exercise: Find the anagrams for all words in a list

* You are given an English dictionary containing M words (“the dictionary”), and a separate list of N words (“the input”, saved in the file `words_to_search.txt`)
* For each word in the input, find all the anagrams in the dictionary (e.g., for input 'acme' the anagrams are `['acme', 'came', 'mace']`)

How to proceed?
1. Write an algorithm to find all anagrams for one input word first
2. What is the Big-O class of this algorithm when executed the full N-words input?
3. Is there a way to pre-process the dictionary to improve the Big-O performance?

# 1. Load the system dictionary and the input words

In [7]:
# Load the system dictionary
with open('/usr/share/dict/words', 'r') as f:
 dict_words = [w.strip() for w in f.readlines()]

In [8]:
# Print the start and end of the dictionary
dict_words[:5] + ['...'] + dict_words[-5:]

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 '...',
 'zythem',
 'Zythia',
 'zythum',
 'Zyzomys',
 'Zyzzogeton']

In [13]:
# Load the input words
with open('words_to_search.txt', 'r') as f:
 words = [w.strip() for w in f.readlines()]

In [14]:
# Print the start and end of the input list
words[:5] + ['...'] + words[-5:]

['acer',
 'acers',
 'aces',
 'aches',
 'acme',
 '...',
 'yap',
 'yaw',
 'yea',
 'zendo',
 'zoned']

# 2. Look for the anagrams of one input word, e.g. "organ"

* There are several anagrams, including "groan" and "argon".

* What is the Big-O performance oh your algorithm? In terms of M, the number of words in the dictionary, and K, the number of letters in a word

In [15]:
word = 'organ'

In [16]:
anagrams = []
for dict_word in dict_words: # O(M)
 if sorted(word) == sorted(dict_word): # 2 * O(K log K) ~ O(K log K)
 anagrams.append(dict_word) # O(1)

In [17]:
anagrams

['angor',
 'argon',
 'goran',
 'grano',
 'groan',
 'nagor',
 'orang',
 'organ',
 'rogan']

The performance of this implementation is O(M * K log K).

Note that instead of sorting , we could use a dictionary mapping letters to letter counts. This would make the performance even better, O(M * K)! However, in practice it would make little difference and would make the code more complicated, so we'll leave it like this.

# 3. Look for the anagrams of the words in the input list

* How does the Big-O performance of your one-word implementation scale to an input list of M words?
* Is there a way to pre-process the dictionary words in a data structure that is better suited for this task?

In [18]:
# Naive implementation, it takes a long time

def find_anagrams(word, dict_words):
 anagrams = []
 for dict_word in dict_words: # O(N)
 if sorted(word) == sorted(dict_word): # O(K log K)
 anagrams.append(dict_word) # O(1)
 return anagrams

words_anagrams = []
for word in words: # O(M)
 anagrams = find_anagrams(word, dict_words)
 words_anagrams.append((word, anagrams)) # O(1)

KeyboardInterrupt: 

The complexity of this algorithm is O(M * N * KlogK), where M is the length of the list of words to search, N is the length of words in the dictionary, and K is the length of words.

# What if we pre-process the dictionary in a data structure that is better suited for this task?

In [19]:
anagrams_map = {}
for dict_word in dict_words: # O(N)
 letters = tuple(sorted(dict_word)) # O(K log K)
 if letters not in anagrams_map: # O(1)
 anagrams_map[letters] = [] # O(1)
 anagrams_map[letters].append(dict_word) # O(1)

In [20]:
words_anagrams = []
final_words = []
for word in words: # O(M)
 letters = tuple(sorted(word)) # O(K log K)
 if letters not in anagrams_map: # O(1)
 print('This word is not in the system dictionary -- skipping', word)
 continue
 else:
 final_words.append(word)
 words_anagrams.append((word, anagrams_map[letters])) # O(1)

In [21]:
words_anagrams

[('acer', ['acre', 'care', 'crea', 'race']),
 ('acers', ['carse', 'caser', 'ceras', 'scare', 'scrae']),
 ('aces', ['case', 'esca']),
 ('aches', ['chase']),
 ('acme', ['acme', 'came', 'mace']),
 ('acned', ['dance', 'decan']),
 ('acre', ['acre', 'care', 'crea', 'race']),
 ('acres', ['carse', 'caser', 'ceras', 'scare', 'scrae']),
 ('act', ['act', 'cat']),
 ('acts', ['cast', 'scat']),
 ('acyl', ['acyl', 'clay', 'lacy']),
 ('add', ['add', 'dad']),
 ('adverb', ['adverb']),
 ('aesc', ['case', 'esca']),
 ('aether', ['heater', 'hereat', 'reheat']),
 ('aethers', ['thereas']),
 ('afield', ['afield', 'defial']),
 ('aft', ['aft', 'fat']),
 ('agree', ['agree', 'eager', 'eagre']),
 ('agrees', ['grease']),
 ('ags', ['gas', 'sag']),
 ('ah', ['ah', 'ha']),
 ('ahs', ['ash', 'sah', 'sha']),
 ('aide', ['aide', 'idea']),
 ('aides', ['aside']),
 ('airings', ['raising']),
 ('airmen', ['marine', 'remain']),
 ('alloy', ['alloy', 'loyal']),
 ('almes', ['amsel', 'mesal', 'samel']),
 ('alp', ['alp', 'lap', 'pal'])

The complexity of this algorithm is O(M * KlogK + N), much much faster!