| ||||||||
| Sometimes I see a quiz that gives some unsorted letters, and asks for the Italian words composed with those letters. It's an anagram problem. I have found a very good list of all Italian words ("lemmario", list of Italian entries) (6.2 MB of just different words, separated by newline), including all forms of all verbs, so I have tried to create a simple program to find the anagrams. The basic starting point is to use word signatures, that is an invariant of anagramming, that is the sequence of the sorted letters of the word: rinverdita ==> adeiinrrtvUsing Python you can find such word with just: "".join(sorted(word))So the program is very simple, you can find the signature of the single input string (sign), then for each line of the lemmario, you can compute its signature, and print the line if the signatures are the same (the following code is untested, I use Python as pseudocode): signature = lambda s: "".join(sorted( s.strip().lower() ))
sign = signature(input_word)
for line in file("lemmario"):
if sign == signature(line)
print line.strip()
This works, but it's very slow, it may take ~20+ seconds on my very old PC. A solution is to relax the test, to make it faster. There are many ways to do it, this is one of the faster ways, we can create a less precise signature function, the faster one is just to use the set of the letters of the word. I have added a newline to the input_word to compute raw_sign because lines contain a newline too.
raw_sign = set(input_word + "\n")
sign = signature(input_word)
for line in file("lemmario"):
if raw_sign == set(line) and sign == signature(line):
print line.strip()
Python uses a shortcut, it doesn't evaluate the signature(line) if the first part of the and is False. So the first part removes most of the lines in a faster way. The resulting code takes ~6 seconds to run. 6 seconds are acceptable, but a quicker answer is appreciated when I play with friends with words. In this situation to speed up the code I don't think you need a faster language, but a smarter algorithm. So I have created a list of 200 defaultdicts (with Python 2.5), one for each possible word length. I have computed the signature for each word of the original file. Each dictionary has the signature as key and as values the list of the words with that signature. Then I have saved each defaultdictionary in a different file, so the successive loads are fast:
dicts = [defaultdict(list) for i in xrange(200)]
infile = file(words_filename)
for line in infile:
word = line.strip()
signature = "".join(sorted(word.lower()))
dicts[len(word)][signature].append(word)
for i in xrange(200):
if dicts[i]:
out_filename = "anags" + str(i).zfill(2) + ".pik"
binarysave(out_filename, dicts[i]) #using cPickle
This is a pre-processing stage done once only. After that, given a word you can load just the pickle that contains the words of that length, and search for the signature as key.Such pickle load takes most of the running time, about 1.2-1.7 seconds. Can we go faster? The processing now is very little, most of the time is used loading a very big dict, and it's done using cPickle, so probably it's not easy to go faster (there is the marshal module too, but probably that can't reduce loading time much). A dict is a complex data structure that requires some time to be decoded, so maybe using a simpler format we can read the data in a faster way. The simpler format is a text file. Maybe we can trade the dict loading, followed by an O(1) access, but a faster load time, followed by an O(n) sequential search inside the text. The sequential search is very fast because memory is contiguous (less cache misses), and the Python 2.5 string search algorithm is really optimised (it uses some very smart tricks). So it's easy to create many text files like this (each file is relative to words of the same length, as before): abdiin#bandii abdiir#adibir#ibrida abdiln#blandi#blinda abdino#badino#bidona#bionda abdinr#bandir#bardin#brinda abdins#sbandi abdior#bordaiInside each line the first part is the signature, the successive ones are the anagrams of that signature. Each different group is separated by "\r" (here shown as newline). I have chosen "\r" and "#" characters because they aren't present inside the lemmario. The resulting code is very simple: SEPARATOR = "#" GROUP_SEPARATOR = "\n" word = input_word.strip() txt_filename = compute_filename(len(word)) if not os.path.exists(txt_filename): return NOT_FOUND anags = file(txt_filename, "rb").read() word_sign = GROUP_SEPARATOR + signature(word) pos = anags.find(word_sign) if pos < 0: return NOT_FOUND else: next_pos = anags.find(GROUP_SEPARATOR, pos+1) if next_pos < 0: next_pos = None # for last line print anags[pos : next_pos].split(SEPARATOR)[1:] The anags.find(word_sign) comes out really fast. The whole running time is around 0.3-0.4 seconds. The signatures are sorted, so maybe there is a way to use a binary search to speed up this code even more. Python has the bisect function inside the bisect standard module: bisect.bisect()Given a point into the string you can choose the right half looking just a bit forward, taking the first group of letters between \r------#, and using that as comparing string. I have not tried this optimisation because 0.3 seconds is fast enough for my purposes, and with some timings I have seen that the normal running time (with an already loaded file in cache) is about 0.02 seconds, so most of those 0.3 seconds are the interpreter start-up time. To test that I have tried to remove the compilation time of the Python program with: python -OO -c "from py_compile import compile; compile('fast_anagrams.py')" This produces a 'fast_anagrams.pyo' bytecode executable for the Python VM, that takes about 0.02 seconds less than before to run, so the total running time is about 0.21-0.3 seconds still. | ||||||||
| comments: Leave a comment |