| I have improved the nhuffman code: - Now the function that generates the Huffman tree takes the Node class as parameter, so it can be used a different Node, this changes this code to "templated", because different ordering functions are possibile. (The node sorting is based on their frequency still, so it's not a fully templated Huffman yet, but it's not difficult to modify it). - The function that generates the Huffman tree was rather slow if the input contain many different input symbols (like 8000 instead of the usual possibile 256 of an ASCII text), so I have improved its speed some, and then I have used a heap to take the tree roots with less frequency (instead of sorting all tree roots every time), so now that function is about 15-20 times faster. Using the right data structure is important in Python too.
http://www.fantascienza.net/leonardo/so/index.html#nhuffman | comments: Leave a comment  |
| To compress the Chakasa Tail Signed Language (and because it's an interesting algorithm) I've written a n-way Huffman encoder/decoder with Python V.2.5. It optionally uses Psyco too. The code is slow compared to a C Huffman implementation, but it's flexible enough, and it can be modified to manage different kinds of symbol ordering, etc (Templated Huffman).
http://www.fantascienza.net/leonardo/so/index.html#nhuffman
At the moment it accepts strings, unicode strings, and lists of hashable objects:
Input data: 'abbcccdddd' freqs: {'a': 1, 'c': 3, 'b': 2, 'd': 4} nways: 2 Huffman tree: 'd' 0 'c' 11 'a' 100 'b' 101 huff_dict: {'a': '100', 'c': '11', 'b': '101', 'd': '0'} len(encoded) (Huffman tree not included): 19 bit(s) ~= 3 byte(s). entropy(text) * len(text): 19 bits. encoded: 1001011011111110000
Input data: 'abbcccdddd' freqs: {'a': 1, 'c': 3, 'b': 2, 'd': 4} nways: 3 Huffman tree: 'd' 0 'a' 10 'b' 11 'c' 12 huff_dict: {'a': '10', 'c': '12', 'b': '11', 'd': '0'} len(encoded) (Huffman tree not included): 16 symbol(s) (base 3) entropy(text) * len(text): 19 bits. encoded: 1011111212120000
Input data: ['hello', 'how', 'are', 'how', 'you', 'hello', 'are', 'how', 'you', 'do', 'hello', 'how'] nways: 2 Huffman tree: 'are' 00 'hello' 10 'how' 11 'do' 010 'you' 011 huff_dict: {'are': '00', 'how': '11', 'you': '011', 'hello': '10', 'do': '010'} len(encoded) (Huffman tree not included): 27 bit(s) ~= 4 byte(s). entropy(seq) * len(seq): 27 bits. encoded: 101100110111000110110101011
The code optionally uses a tiny compiled module, my first Pyrex module named "mixed". It contains a charfreq() function that given a string, returns a dict of {char:char_freq}. On long strings it is many times faster than my faster Python version. | comments: Leave a comment  |
|