| My software updates:
A simple order-0 arithmetic file compressor for stationary sources (independent bytes with a uniform distribution throughout the file). Original C++ GPL code (http://cs.fit.edu/%7Emmahoney/compression/fpaq0.cpp ) by Matt Mahoney, translated to D by me. Using the current DMD compiler (v.1.025) it's almost 2 time slower than the C++ code compiled with MinGW. http://www.fantascienza.net/leonardo/so/fpaq0.zip
I have updated my D libs: http://www.fantascienza.net/leonardo/so/libs_d.zip - I have added the "select" module to my D libs, it's useful to quickly find the the n-th rank ordered item. There's a function to find the median too. - Added xfilter, like filter, but lazy, predicate is optional. - Added xfilterAA, like filterAA, but lazy (there is no version without predicate). - More smaller improvements.
I'd like to add second order functions like itemgetter() and attrgetter() of Python, (http://docs.python.org/lib/module-operator.html#l2h-1184 ), but it seems a too much difficult job. Despite its qualities, and its full closures, D isn't much fit for functional-style programming (yet).
----------------
Few interesting links:
A visual dictionary, 80 million tiny images project: >We present a visualization of all the nouns in the English language arranged by semantic meaning. Each of the tiles in the mosaic is an arithmetic average of images relating to one of 53,464 nouns. The images for each word were obtained using Google's Image Search and other engines. A total of 7,527,697 images were used, each tile being the average of 140 images. The average reveals the dominant visual characteristics of each word.< http://people.csail.mit.edu/torralba/tinyimages/
Functional language for computing with geometry: http://www.plasm.net/gallery/ | comments: Leave a comment  |
| | Tags: | data compression | | Subject: | Large file collection compression scatterplot | | Time: | 12:04 am |
|
| In the Maximum Compression site the autor (Werner Bergmans) performs two groups of tests on many data compression programs. The "Large file collection" is newer and uses much more data (510 files, 301 MB in total). He shows pages with compressors ranked by compressed file size, compression time and "efficiency", this the list ranked according to the compressed file size: http://www.maximumcompression.com/data/summary_mf.php
But I think it can be quite interesting to perform a 2D scatterplot too of such table. To reduce the clutter I have filtered the data, keeping only the Pareto frontier: http://en.wikipedia.org/wiki/Pareto_efficiency#Pareto_frontier (It means the compressors that don't have other compressors better in both compression time and compressed file size). It's easy to perform such filtering with few lines of easy Python:
lines = [line.rstrip().split("\t") for line in file("summary5.txt")]
lines = [(txt, int(m), int(t)) for txt,m,t in lines]
result = []
for txt1, m1, t1 in lines:
ok = True
for txt2, m2, t2 in lines:
if txt1 != txt2:
if m2 < m1 and t2 < t1:
ok = False
if ok:
print "%s\t%s\t%s" % (txt1, m1, t1)
This is the result:
Program and parameters Compr. size C.time
PAQ8O6 -7 64087180 43267
WinRK 3.0.3 Maximum 64217766 30603
WinRK 3.0.3 High 68947835 6317
DURILCA 0.5 -m800 -t1 72175363 1997
DURILCA 0.5 -m800 -t1 -o32 72246808 1838
CCMx 1.25 6 77399214 392
CCM 1.25 6 78261470 315
CCM 1.25 (none) 78923905 298
FreeARC 0.40 (none) 81657471 215
SBC 0.970 rev2 -b9 85740264 138
SBC 0.970 rev2 (none) 86245287 129
FreeARC 0.40 -m3 86537094 103
StuffIt 11.0 Method=5 M26 opt/bm=on 97657644 84
UHARC 0.6b -mz -md32768 -mm+ 99251157 72
WinTurtle 1.30 (none) 106465497 67
LZTurbo 0.1 -44 112046732 46
THOR 0.96 e5 112275112 24
PKZIP 2.50 (none) 114501263 22
Tornado 0.1 (none) 122285619 21
THOR 0.96 e3 124588656 6
SLUG 1.1b (none) 140327799 5
THOR 0.96 e1 151936954 3
 In the scatterplot the X axis is logarithmic, and it shows the compression time, the Y axis is the compressed file size. The original data size is 316,355,757. | comments: 2 comments or Leave a comment  |
| I have improved the nhuffman code: - Now the function that generates the Huffman tree takes the Node class as parameter, so it can be used a different Node, this changes this code to "templated", because different ordering functions are possibile. (The node sorting is based on their frequency still, so it's not a fully templated Huffman yet, but it's not difficult to modify it). - The function that generates the Huffman tree was rather slow if the input contain many different input symbols (like 8000 instead of the usual possibile 256 of an ASCII text), so I have improved its speed some, and then I have used a heap to take the tree roots with less frequency (instead of sorting all tree roots every time), so now that function is about 15-20 times faster. Using the right data structure is important in Python too.
http://www.fantascienza.net/leonardo/so/index.html#nhuffman | comments: Leave a comment  |
| To compress the Chakasa Tail Signed Language (and because it's an interesting algorithm) I've written a n-way Huffman encoder/decoder with Python V.2.5. It optionally uses Psyco too. The code is slow compared to a C Huffman implementation, but it's flexible enough, and it can be modified to manage different kinds of symbol ordering, etc (Templated Huffman).
http://www.fantascienza.net/leonardo/so/index.html#nhuffman
At the moment it accepts strings, unicode strings, and lists of hashable objects:
Input data: 'abbcccdddd' freqs: {'a': 1, 'c': 3, 'b': 2, 'd': 4} nways: 2 Huffman tree: 'd' 0 'c' 11 'a' 100 'b' 101 huff_dict: {'a': '100', 'c': '11', 'b': '101', 'd': '0'} len(encoded) (Huffman tree not included): 19 bit(s) ~= 3 byte(s). entropy(text) * len(text): 19 bits. encoded: 1001011011111110000
Input data: 'abbcccdddd' freqs: {'a': 1, 'c': 3, 'b': 2, 'd': 4} nways: 3 Huffman tree: 'd' 0 'a' 10 'b' 11 'c' 12 huff_dict: {'a': '10', 'c': '12', 'b': '11', 'd': '0'} len(encoded) (Huffman tree not included): 16 symbol(s) (base 3) entropy(text) * len(text): 19 bits. encoded: 1011111212120000
Input data: ['hello', 'how', 'are', 'how', 'you', 'hello', 'are', 'how', 'you', 'do', 'hello', 'how'] nways: 2 Huffman tree: 'are' 00 'hello' 10 'how' 11 'do' 010 'you' 011 huff_dict: {'are': '00', 'how': '11', 'you': '011', 'hello': '10', 'do': '010'} len(encoded) (Huffman tree not included): 27 bit(s) ~= 4 byte(s). entropy(seq) * len(seq): 27 bits. encoded: 101100110111000110110101011
The code optionally uses a tiny compiled module, my first Pyrex module named "mixed". It contains a charfreq() function that given a string, returns a dict of {char:char_freq}. On long strings it is many times faster than my faster Python version. | comments: Leave a comment  |
|