leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 4 entries.

Tags:, , , ,
Subject:Pythagorean triples
Time:06:05 pm
A little puzzle:
Given an array of n integers, find all Pythagorean triples in the array, that is, three elements such that a^2 + b^2 = c^2. Do this in O(n^2) time.

A simple solution is to test the equation with a hash of the squares on every unordered (a^2,b^2) pair (see the first Python version below).

First I have written a Python solution that uses a set, then I have translated it to D using my set module, then using D associative arrays, then I have translated the code to C adapting a simple hash code to integer keys. Then I have simplified the C code, removing the values from the hash.

That's the faster general solution, but the code can be improved if the problem is made less general. So I have tried to count the number of distinct Pythagorean triples among the integers [1,10_000].
So I have tried to simplify the hash some more. I have tried a really simple hash, and while counting the collision frequency I have found a very simple hash configuration where every number is present at its place in the array, or it's absent. Then as usual I have backported that code to Psyco and D.

Click to see the source code
Click here )

Using MinGW 4.2.1 I have compiled that C code with:
gcc -O3 -s -march=pentiumpro -fprofile-generate triples.c -o triples
Followed by a run, followed by:
gcc -O3 -s -march=pentiumpro -fprofile-use triples.c -o triples

The general D version works with any group of input numbers (not tested below).

The D1 version is really close to the C version. The D2 version is faster and higher-level, it uses my d libs:
www.fantascienza.net/leonardo/so/libs_d.zip

The timing results (Pentium3 500 MHz):
Timings triples, N = 10_000:
  D general: 13.38 s
  Psyco:      6.89 s   (4.12 X)   (1.87 X)
  D1:         4.02 s   (2.40 X)   (1.09 X)
  D2:         3.68 s   (2.20 X)   (1    X)
  C:          1.67 s   (1    X)   (0.45 X)

Triples in [1, 10000): 12467
Triples in [1, 20000): 27171
Triples in [1, 40000): 58728
Triples in [1, 80000): 394138

The timings are quite good for Psyco but quite bad for D: the C-like D version is 2.4 times slower than the C version, and the Psyco version is just 1.7 times slower than the C-like D version (and the Psyco version is 1.9 times slower than the faster D version). This means that with this tiny program there's not that much purpose in using D instead of Psyco.
comments: 1 comment or Leave a comment Add to Memories Tell a Friend

Tags:, ,
Subject:Bugs in C and D
Time:07:11 pm
An article of mine that shows how safer is the D language compared to the C, but it also shows dome of its downsides:

http://www.fantascienza.net/leonardo/ar/c_d_bugs.html
comments: Leave a comment Add to Memories Tell a Friend

Tags:, ,
Subject:Another little speed test
Time:02:49 am
This is a simple test of the relative speed of D (DMD) and C (MinGW), it's just the sum of few complex numbers.
You can find the sources here:
http://www.fantascienza.net/leonardo/js/creal_tests.zip
Minimum running time on many runs,
on a PIII @ 500 MHz, N = 1_024, M = 200_000:

  1.77 s   C
  3.98 s   D 1
  3.61 s   D 2a
  3.52 s   D 2b (with new)
  3.51 s   D 3a
  3.61 s   D 3b (with new)
  4.37 s   D 4a
  4.60 s   D 4b (with new)
 89.6  s   Python V. 2.5.1

C code compiled with MinGW 3.4.2 with:
  -O2 -s -fomit-frame-pointer

D code compiled with DMD v1.023 with:
  -O -release -inline

Source file sizes (bytes):
  694 testC.c
  537 testD1.d
  551 testD2a.d
  562 testD2b.d
  390 testD3a.d
  401 testD3b.d
  257 testD4a.d
  268 testD4b.d
  134 testP.py

Note that with Python you can find the correct answer with just:
print sum(n-n*1J for n in xrange(1024))
comments: Leave a comment Add to Memories Tell a Friend

Tags:, ,
Subject:Notes on the binary trees benchmark
Time:07:39 pm
I belive the Computer Shootout site (recently renames "The Computer Language Benchmarks Game") is quite interesting:

http://shootout.alioth.debian.org/

From such comparisons you can learn lot of computer science (not just the languages themselves), because the benchmarks are well designed, and because those benchmark sources are processed by many different kind of compilers and interpreters, and inside them there is a lot of computer science.

Here you can find some sources related to one of such tests, the "binary trees"
http://www.fantascienza.net/leonardo/js/bintrees.zip

At the bottom of this page you can find some explanations about this benchmark:
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=all#about

You don't need to understand all the details of this benchmark, it essentially allocates lot of complete binary trees, scans them, and destroyes them, to build some more. Bigger and begger trees are built, but less and less of them.

To take a global look at this benchmark you can look at bintreesP1.py (from the bintrees.zip archive). In that program a tree node is made with tuple of len 3, like:

(item, make_tree(item2 - 1, depth), make_tree(item2, depth))

bintreesP2.py is a similar program, to represent a tree node it uses a class with 3 attributes, but it's slow than the tuple-based version, even with Psyco(but this version can be translated by ShedSkin from Python to C++).

The simples D language version (bintreesD02.d) is really close to the bintreesP2.py version (it leaves the whole memory management to the D Garbage Collector), and it's also quite similar to the Java version, that contains some "private static" too. But the bintreesD2.d version seem much slower than the Java version (compiled with java 6), probably because for this benchmark the java garbage collector manages memory much better than the D GC. This Java program requires much more RAM, so I think the GC keeps a large number of objects (tree nodes) and uses their memory again when they are freed and used again. The funny thing is that bintreesD02.d is also quite slower than the Python+Psyco version bintreesP1.py, this probably means that the D GC is quite primitive still.

bintreesD01.d uses a manual memory allocation and deallocation of nodes (structs) and it's quite faster than the bintreesD02.d, but not faster than the java version (java GC is optimized for such object management. Probably the GC of dotnet is able to act in a similar way).

The C version bintreesC1.c is quite similar to bintreesD01.d (it allocates and deallocates memory manually), but it's a bit faster. It uses free function instead of struct methods.

To speed up the D code I have used a standard way to manage memory when you have to frequently allocate and deallocate lot of the same objects, a free list. bintreesD03.d adds a static list header pointer, plus a link to the successive object inside each tree node, to be used when the object is "deleted" and added to the free list (that is a simply linked list of unused tree nodes, their contents are rewritten when such nodes are taken from the free list). This version is much faster than the bintreesD2.d, and faster than the manually managed bintreesD01.d too.

The C code bintreesC2.c too uses a free list, but it uses a trick, the left pointer of each node is used as pointer for the free list too, saving memory. It's quite fast, probably the faster version. I have tried to create a D version on the base of bintreesC2.c (called bintreesD04.d), and it's faster than bintreesD03.d (it uses free functions like bintreesC2.c instead of methods).

Update Nov 14 2007: I have added five new D versions:

bintreesD05.d is like bintreesD04.d, but it disables the garbage collector at the beginning (and the code is more clean).

bintreesD06.d performs all the tree node allocations up front, avoiding ckeeks later and helping the GC allocate memory faster.

bintreesD07.d does the same as bintreesD06.d, but all the tree nodes are allocated in a single block, as an array (then linked internally). This strategy can be used in C too, of course. A more realistic version of this is to allocate the structs in chunks, so the prgram doesn't need to know how many tree nodes to allocate at the beginning of the code (see below).

bintreesD08.d is similar to bintreesD07.d, but uses indexes to array elements, this speeds up the creation of the free list, but all the other accesses are slower, so the overall timing is worse.

Update Nov 17 2007: I have added seven new D versions.

bintreesD09.d: is derived from bintreesD07.d, but instead of allocating all the nodes up-front (that requires to know how many nodes it will need) it allocates a *block* of them when a new node is needed and the freelist is empty. The block size is 5_000 nodes, but such value can be tuned (see below). The tests show that 5000 is good enough, and it equals to an allocation of about 64 KB. (So in successive version I have used this value, with the sizeof of the node, to compute the right number of structs to allocate). This version is slower than bintreesD07.d for large n (16).

bintreesD10.d: this is similar to bintreesD10.d, but instead of allocating blocks of fixed size, it allocates them geometrically bigger and bigger (growth factor = 2). This version is slower than bintreesD07.d and bintreesD09.d probably because it allocates blocks of the wrong size. Allocating lots of tiny structs is slow, but allocating a very large chunk of memory is slow too. So probably in the middle there is an optimal chunk memory size that gives the best speed. To test this I have created the version bintreesD11.d.

bintreesD11.d: is like the bintreesD07.d, but the nodes are allocated in some chunks, instead in a single one. This version is quite fast.

bintreesD12.d: this is like bintreesD11.d, but the memory chunks are allocated with std.gd.malloc, and it's a tiny bit faster.

bintreesD13.d: like bintreesD12.d, but it uses std.c.stdlib.malloc instead of the std.gd.malloc one (and this isn't initialized to zero. I don't know if the memory given by std.gd.malloc is cleared). This is the faster version I have invented so far.

bintreesD14.d: derived from bintreesD09.d, but blocks are allocated with std.c.stdlib.malloc, so it's quite faster than bintreesD09.d. The block size is computed so it is contained in 64 K.

bintreesD15.d: like bintreesD14.d, but the blocks are allocated with an array, so the GC can delete them when the free list ins't needed anymore. It's a bit slower than bintreesD14.d.

Tests performed on a PentiumIII at 500MHz.
Output piped to file.
Best of 3-4 runs.

Timings, n=15, seconds:
  C1:      18.72 s
  C2:       2.97 s
  ------
  D01:      8.18 s
  D02:     40.7  s
  D03:      4.34 s
  D04:      2.90 s
  D05:      2.82 s
  D06:      2.62 s
  D07:      2.41 s
  D08:      3.01 s
  D09:      2.55 s (toAlloc = 500)
  D09:      2.48 s (toAlloc = 1_000)
  D09:      2.50 s (toAlloc = 5_000)
  D09:      2.61 s (toAlloc = 50_000)
  D10:      2.55 s (starting toAlloc = 1)
  D10:      2.59 s (starting toAlloc = 10)
  D11:      2.22 s
  D12:      2.20 s
 *D13:      2.16 s
  D14:      2.44 s
  D15:      2.49 s
  ------
  P01:     85    s (without Psyco)
  P01:     29.4  s (with Psyco)
  P02:    311    s (without Psyco)
  P02:     71.6  s (with Psyco)
  SS:       9.75 s (P02 compiled with ShedSkin 0.0.25)
  ------
  Java 1:   4.39 s
  Java 2:   4.34 s 


Timings, n=16, seconds:
  D01:     20.32 s
  D03:     12.6  s
  D04:      8.04 s
  D06:      7.48 s
  D07:      6.74 s
  D09:      6.70 s (toAlloc = 5_000)
  D10:      6.94 s (starting toAlloc = 1)
  D11:      6.07 s
  D12:      6.04 s
 *D13:      5.94 s
  D14:      6.58 s
  D15:      6.67 s


Memory used, n=16, KB:
  D01:     7140
  D02:    17884
  D03:     9408   
  D04:     5016
  D10:     3944
  D11:     3956
  D12:     3956
  D13:     4156
  D14:     4228
  D15:     3940
  ------
  P01:    17200+ (with Psyco)
  SS:      9976  (P02 compiled with ShedSkin 0.0.25)
  ------
  Java 1: 48000+
  Java 2: 37000+
  Java 3: 23000


C versions compiled with:
  gcc -s -O3 -fomit-frame-pointer bintreesC1.c -o bintreesC1

D version compiled with:
  dmd -O -inline -release bintreesD1.d

Java (1) executed with:
  java -Xms64m -Xbatch binarytrees 16

Java (2) executed with:
  java -server -Xms64m -Xbatch binarytrees 16

Java (3) executed with:
  java binarytrees 16


In pratical usage, where you don't know how many objects to allocate, and you want the freelist be able to be deleted by the GC, the D15 version is probably the most useful. If speed is all you want, the D13 version is better, if you can use it.

Regarding the useless free list in Java with HotSpot:
http://java.sun.com/docs/hotspot/HotSpotFAQ.html#gc_pause
>Many systems have less efficient memory management than in HotSpot. To work around this, some programs keep an "object pool", saving previously allocated objects in some freelist-like data structure and reusing them instead of allocating new ones. But... Don't use object pools! Using object pools will fool the collector into thinking objects are live when they really aren't. This may have worked before exact garbage collection became popular, but this is just not a good idea for any modern Java Virtual Machines.
comments: Leave a comment Add to Memories Tell a Friend

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 4 entries.