leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 5 entries.

Tags:, ,
Subject:Updates and links
Time:03:29 pm
My software updates:

A simple order-0 arithmetic file compressor for stationary sources (independent bytes with a uniform distribution throughout the file). Original C++ GPL code (http://cs.fit.edu/%7Emmahoney/compression/fpaq0.cpp ) by Matt Mahoney, translated to D by me. Using the current DMD compiler (v.1.025) it's almost 2 time slower than the C++ code compiled with MinGW.
http://www.fantascienza.net/leonardo/so/fpaq0.zip


I have updated my D libs:
http://www.fantascienza.net/leonardo/so/libs_d.zip
- I have added the "select" module to my D libs, it's useful to quickly find the the n-th rank ordered item. There's a function to find the median too.
- Added xfilter, like filter, but lazy, predicate is optional.
- Added xfilterAA, like filterAA, but lazy (there is no version without predicate).
- More smaller improvements.

I'd like to add second order functions like itemgetter() and attrgetter() of Python, (http://docs.python.org/lib/module-operator.html#l2h-1184 ), but it seems a too much difficult job. Despite its qualities, and its full closures, D isn't much fit for functional-style programming (yet).

----------------

Few interesting links:

A visual dictionary, 80 million tiny images project:
>We present a visualization of all the nouns in the English language arranged by semantic meaning. Each of the tiles in the mosaic is an arithmetic average of images relating to one of 53,464 nouns. The images for each word were obtained using Google's Image Search and other engines. A total of 7,527,697 images were used, each tile being the average of 140 images. The average reveals the dominant visual characteristics of each word.<
http://people.csail.mit.edu/torralba/tinyimages/

Functional language for computing with geometry:
http://www.plasm.net/gallery/
comments: Leave a comment Add to Memories Tell a Friend

Tags:
Subject:Large file collection compression scatterplot
Time:12:04 am
In the Maximum Compression site the autor (Werner Bergmans) performs two groups of tests on many data compression programs. The "Large file collection" is newer and uses much more data (510 files, 301 MB in total). He shows pages with compressors ranked by compressed file size, compression time and "efficiency", this the list ranked according to the compressed file size:
http://www.maximumcompression.com/data/summary_mf.php

But I think it can be quite interesting to perform a 2D scatterplot too of such table. To reduce the clutter I have filtered the data, keeping only the Pareto frontier:
http://en.wikipedia.org/wiki/Pareto_efficiency#Pareto_frontier
(It means the compressors that don't have other compressors better in both compression time and compressed file size). It's easy to perform such filtering with few lines of easy Python:
lines = [line.rstrip().split("\t") for line in file("summary5.txt")]
lines = [(txt, int(m), int(t)) for txt,m,t in lines]
result = []
for txt1, m1, t1 in lines:
    ok = True
    for txt2, m2, t2 in lines:
        if txt1 != txt2:
            if m2 < m1 and t2 < t1:
                ok = False
    if ok:
        print "%s\t%s\t%s" % (txt1, m1, t1)
This is the result:
Program and parameters             Compr. size C.time

PAQ8O6 -7                             64087180  43267
WinRK 3.0.3 Maximum                   64217766  30603
WinRK 3.0.3 High                      68947835   6317
DURILCA 0.5 -m800 -t1                 72175363   1997
DURILCA 0.5 -m800 -t1 -o32            72246808   1838
CCMx 1.25 6                           77399214    392
CCM 1.25 6                            78261470    315
CCM 1.25 (none)                       78923905    298
FreeARC 0.40 (none)                   81657471    215
SBC 0.970 rev2 -b9                    85740264    138
SBC 0.970 rev2 (none)                 86245287    129
FreeARC 0.40 -m3                      86537094    103
StuffIt 11.0 Method=5 M26 opt/bm=on   97657644     84
UHARC 0.6b -mz -md32768 -mm+          99251157     72
WinTurtle 1.30 (none)                106465497     67
LZTurbo 0.1 -44                      112046732     46
THOR 0.96 e5                         112275112     24
PKZIP 2.50 (none)                    114501263     22
Tornado 0.1 (none)                   122285619     21
THOR 0.96 e3                         124588656      6
SLUG 1.1b (none)                     140327799      5
THOR 0.96 e1                         151936954      3

In the scatterplot the X axis is logarithmic, and it shows the compression time, the Y axis is the compressed file size. The original data size is 316,355,757.
comments: 2 comments or Leave a comment Add to Memories Tell a Friend

Tags:, ,
Subject:n-way templated Huffman
Time:02:36 pm
I have improved the nhuffman code:
- Now the function that generates the Huffman tree takes the Node class as parameter, so it can be used a different Node, this changes this code to "templated", because different ordering functions are possibile. (The node sorting is based on their frequency still, so it's not a fully templated Huffman yet, but it's not difficult to modify it).
- The function that generates the Huffman tree was rather slow if the input contain many different input symbols (like 8000 instead of the usual possibile 256 of an ASCII text), so I have improved its speed some, and then I have used a heap to take the tree roots with less frequency (instead of sorting all tree roots every time), so now that function is about 15-20 times faster. Using the right data structure is important in Python too.

http://www.fantascienza.net/leonardo/so/index.html#nhuffman
comments: Leave a comment Add to Memories Tell a Friend

Tags:, ,
Subject:n-way Huffman done in Python
Time:12:32 pm
To compress the Chakasa Tail Signed Language (and because it's an interesting algorithm) I've written a n-way Huffman encoder/decoder with Python V.2.5. It optionally uses Psyco too. The code is slow compared to a C Huffman implementation, but it's flexible enough, and it can be modified to manage different kinds of symbol ordering, etc (Templated Huffman).

http://www.fantascienza.net/leonardo/so/index.html#nhuffman

At the moment it accepts strings, unicode strings, and lists of hashable objects:

Input data: 'abbcccdddd'
freqs: {'a': 1, 'c': 3, 'b': 2, 'd': 4}
nways: 2
Huffman tree:
  'd'    0
  'c'    11
  'a'    100
  'b'    101
huff_dict: {'a': '100', 'c': '11', 'b': '101', 'd': '0'}
len(encoded) (Huffman tree not included): 19 bit(s) ~= 3 byte(s).
entropy(text) * len(text): 19 bits.
encoded: 1001011011111110000


Input data: 'abbcccdddd'
freqs: {'a': 1, 'c': 3, 'b': 2, 'd': 4}
nways: 3
Huffman tree:
  'd'    0
  'a'    10
  'b'    11
  'c'    12
huff_dict: {'a': '10', 'c': '12', 'b': '11', 'd': '0'}
len(encoded) (Huffman tree not included): 16 symbol(s) (base 3)
entropy(text) * len(text): 19 bits.
encoded: 1011111212120000


Input data: ['hello', 'how', 'are', 'how', 'you', 'hello', 'are', 'how', 'you', 'do', 'hello', 'how']
nways: 2
Huffman tree:
  'are'    00
  'hello'  10
  'how'    11
  'do'     010
  'you'    011
huff_dict: {'are': '00', 'how': '11', 'you': '011', 'hello': '10', 'do': '010'}
len(encoded) (Huffman tree not included): 27 bit(s) ~= 4 byte(s).
entropy(seq) * len(seq): 27 bits.
encoded: 101100110111000110110101011


The code optionally uses a tiny compiled module, my first Pyrex module named "mixed". It contains a charfreq() function that given a string, returns a dict of {char:char_freq}. On long strings it is many times faster than my faster Python version.
comments: Leave a comment Add to Memories Tell a Friend

Tags:,
Subject:A small article, on data repetition statistics
Time:05:02 pm
Some explorations of string repetition statistics:
http://www.fantascienza.net/leonardo/ar/index.html#strrep
comments: Leave a comment Add to Memories Tell a Friend

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 5 entries.