leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 10 entries.
Missed some entries? Then simply jump back 10 entries

Tags:, , , , , , , ,
Subject:Olden "em3d" benchmark in D language
Time:12:38 am
All the code shown in this article:
http://www.fantascienza.net/leonardo/js/dolden_em3d.zip


This is a partial port of the Olden benchmarks to the D language. I'll add more Olden benchmarks in future as time allows me to.

Info on the Olden benchmarks and their Java translation (JOlden):
http://www-ali.cs.umass.edu/DaCapo/benchmarks.html
http://www.sable.mcgill.ca/~bdufou1/ashes2/


General comments:

One of my purposes is to test the efficiency of the D1 code compiled with LDC. LDC is usually able to produce efficient binaries when I translate C code to C-like D code. So now I want to see how D is efficient if I start from higher level code, like Java code. It's usually easy to translate such Java code to D, D looks like a super set of C and Java, with some C++ mixed in.

Olden benchmarks are designed to stress first of all the memory allocations. The D Garbage Collector (GC) is quite less efficient than the Java HotSpot GC, so literal translations of the Olden benchmarks from Java to D are usually lower or quite lower performance (up to 4-16 times slower). So I have to work some to regain performance. Usually D allows me to produce final programs that are 2-6 times faster than the original Java versions, but such optimization requires time, experience, and sometimes it's a little bug-prone.

Beside having a more efficient GC, Java performs other optimizations not done by LDC, like inlining of virtual methods, and other things. In future the LLVM back-end of LDC will hopefully perform some of such optimizations (obsoleting some of my manual optimizations).

In all JOlden benchmarks I have packed the Java code of a benchmark in a single source file, that later I have first translated to D, and then optimized for speed and memory usage. Sometimes I have cleaned up some the D code.

Below I explain the various versions I've created.


See the Python2 code near the bottom to see the algorithm used by this "em3d" benchmark. This benchmark models the propagation of electromagnetic waves through objects in 3 dimensions. It is a simple computation on an irregular bipartite graph containing nodes representing electric and magnetic field values.

Update 1.2, Oct 26 2006:

One of the main purposes of this article is to show show and teach how some manual optimizations are done.

Timings of the various versions (i is the number of iterations):
          i=50   i=200  i=2000
Java1:    4.37    7.13
Java2     3.62    6.87
D01:      6.55   11.40
D02:      4.98    9.36
D03:      3.53    8.01
D04:      3.32    6.44
D05:      2.94    5.95
D06:      2.14    5.23
D07:      1.99    4.72
D08:      1.82    4.04
D09:      1.82    4.03
D10:      1.20    3.06
D11:      1.27    2.96   24.62 (TyIndex=ushort)
D11:      1.52    3.60         (TyIndex=uint)
D11b:     1.27    2.97   23.25 (TyIndex=ushort, TyDelta=ubyte)
D11b:     1.31    3.00         (TyIndex=uint, TyDelta=ushort)
D12:      1.61    3.62
Java3:    3.14    6.24
D01b:     6.33   10.60
Python1: 64.3    ---


------------------------------------

Java1

The original JOlden Java code packed in a single source file.

I have fixed a bug present in the Java code of JOlden Em3d but absent in the original C code:
At line 131 of the Em3d1.java the line:
if (otherNode == toNodes[filled]) break;
Has to be fixed as:
if (otherNode == toNodes[k]) break;

Such bug was not found probably because this benchmark suite has no unit-tests that test well each class and each method of each class.

Timings (best of 6, seconds):

...$ time java -server Em3d1 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 2.725
EM3D compute time 1.277
EM3D total time 4.01
Done!
real 0m4.370s
user 0m3.376s
sys 0m0.380s


...$ time java -server Em3d1 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 2.458
EM3D compute time 4.35
EM3D total time 6.812
Done!
real 0m7.128s
user 0m6.568s
sys 0m0.440s

Used:
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)

Code running on Ubuntu, running on VirtualBox, running on Windows Vista, on CPU Celeron 2.13 GHz.

------------------------------------

Java2

It's like the Java1 version, but it use a very simple portable pseudo-random generator. So when I translate this program to D I can see if it gives the same results.

The Random.nextInt is a little different, it accepts a max value (min is 0).

Timings:

...$ time java -server Em3d2 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 2.209
EM3D compute time 1.109
EM3D total time 3.318
Done!
real 0m3.622s
user 0m3.100s
sys 0m0.416s

...$ time java -server Em3d2 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 2.197
EM3D compute time 4.354
EM3D total time 6.551
Done!
real 0m6.875s
user 0m6.292s
sys 0m0.448s

------------------------------------

D01

This is the most direct translation of the Java2 code to D, with the portable pseudo-random generator. I have used printf for maximum portability across different D standard libraries. I have replaced the enumerated with the D opApply.
As you can see D1 allows to program in a style almost equal to Java.

Code compiled with:
...$ ldc -O5 -release -inline em3d01.d

With:
LDC compiler, based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)

Timings:

...$ time ./em3d01 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 4.96
EM3D compute time 1.38
EM3D total time 6.34
Done!
real 0m6.547s
user 0m5.856s
sys 0m0.576s

...$ time ./em3d01 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 4.88
EM3D compute time 5.97
EM3D total time 10.86
Done!
real 0m11.398s
user 0m10.801s
sys 0m0.492s


The first D version, despite being compiled with LDC, is slower than the Java version. As you can see the build time is much bigger, more than two times, that's mostly because of the D GC.

To test that no bugs are introduced I have created a small Python script that acts like the "diff" command, but ignored small differences in floating point values. The programs of this benchmark are able to print the results, so I can test them:

...$ java Em3d2 -n 10 -d 4 -i 5 -m -p > outj
...$ ./em3d01 -n 10 -d 4 -i 5 -m -p > outd01

With it I can test that the results are the same:

...$ python approx_diff.py outj2 outd01
3) line=26: ['EM3D', 'build', 'time', '0.0040'] ['EM3D', 'build', 'time', '0.00'] 0.004 0.0
3) line=28: ['EM3D', 'total', 'time', '0.0040'] ['EM3D', 'total', 'time', '0.00'] 0.004 0.0

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB11_2:
movl 32(%eax), %esi
movl (%esi,%edx,4), %esi
movl 40(%eax), %edi
movsd (%edi,%edx,8), %xmm1
mulsd 8(%esi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, 8(%eax)
incl %edx
cmpl %ecx, %edx
jl .LBB11_2

------------------------------------

D02

Olden benchmarks are quite memory-based, so the first way to optimize them is to reduce the number of memory allocations. Also, in D a good basic optimization for this kind of programs is to convert classes that are instantiated many times, into structs.

A quick scan of the code shows that it contains a single instance of Em3d and BiGraph, and many instances of Node. So this version translated Node into a struct, allocated on the heap and managed by pointer. The code is almost the same. I have also reformatted code to 4 spaces.

Timings:

...$ time ./em3d02 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 3.39
EM3D compute time 1.40
EM3D total time 4.79
Done!
real 0m4.982s
user 0m4.416s
sys 0m0.472s

...$ time ./em3d02 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 3.35
EM3D compute time 5.84
EM3D total time 9.19
Done!
real 0m9.365s
user 0m8.821s
sys 0m0.460s

The code is faster, but there's a lot to do still. The code is not good yet, there are several inefficiencies.

------------------------------------

D03

Timings:

I have converted all the classes (Random, BiGraph and Em3d) to structs with minimal performance changes. So the problem is (as expected) elsewhere.
I have reformatted the Java comments to produce a more compact code that (despite losing being fit for ddoc) allows me to understand the code better.

I have renamed some Node fields to better names, because better names allow to understand code better:
toNodes => outbound
fromNodes => inbound
fromCount => n_inbound

Now the main() is a independent function.

Then I have disabled the GC just before the graph creation, and enabled again after that.

I have also added extra time printings inside the graph creation method, to study better where the running time goes.

Then I have added an exit(0) at the end of the main, it kills the program quickly, saving the time needed to call destructors and to free the GC memory. In general this is not a safe thing to do, because destructors may be necessary, but in this program this is OK.

Timings:

...$ time ./em3d03 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.14 0.94 0.48 0.43
Propagating field values for 50 iteration(s)...
EM3D build time 1.99
EM3D compute time 1.36
EM3D total time 3.35
Done!
real 0m3.530s
user 0m2.920s
sys 0m0.524s

...$ time ./em3d03 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.15 0.93 0.48 0.43
Propagating field values for 200 iteration(s)...
EM3D build time 1.99
EM3D compute time 5.80
EM3D total time 7.79
Done!
real 0m8.012s
user 0m7.348s
sys 0m0.540s

As expected the compute time is unchanged, but the build time is decreased significantly.
Now the build time is lower than the Java2 code! So the first benchmark (with just 50 loops) is faster than the Java2 version.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB10_2:
movl 24(%eax), %esi
movl (%esi,%edx,4), %esi
movl 32(%eax), %edi
movsd (%edi,%edx,8), %xmm1
mulsd (%esi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %edx
cmpl %ecx, %edx
jl .LBB10_2

------------------------------------

D04

I have renamed a method:
BiGraph.compute() => graph.computeStep();

And I have added a test for the sanity of the input numDegree:
if (numDegree > numNodes) {...

The program allocates a linked list of nodes, and then doesn't change that topology any more, just iterates on them. Both allocating many single nodes, and iterating on a linked list, today are slow operations (the original C code was old). So today it's much more efficient to allocated many or all Nodes in an array (this program contains a bipartite graph, so there are two arrays).

So I can also remove the "next" pointer field of Node, that was used for the links of the list.

In this D version the arcs are kept as pointers. Now the opApply isn't needed (this increases speed because I now loop on arrays, but loses some encapsulation).

The fillTable() method becomes the createTable() because it allocates an array of the nodes.

Timings:

...$ time ./em3d04 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.13 0.94 0.48 0.40
Propagating field values for 50 iteration(s)...
EM3D build time 1.95
EM3D compute time 1.18
EM3D total time 3.13
Done!
real 0m3.326s
user 0m2.696s
sys 0m0.540s

...$ time ./em3d04 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.13 0.92 0.46 0.40
Propagating field values for 200 iteration(s)...
EM3D build time 1.91
EM3D compute time 4.35
EM3D total time 6.26
Done!
real 0m6.443s
user 0m5.908s
sys 0m0.432s

The build time is not changed, I don't know why, probably disabling the GC allows for an efficient enough memory allocation. But now the compute time is decreased, because the program can scan the arrays more efficiently.

If a program like this gets used in practice, the build time is not important, because the number of iterations is probably large, so the compute time has to be as small as possible.

Now both timings are better than the Java2 timings.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 12(%eax), %esi
movl (%esi,%edx,4), %esi
movl 20(%eax), %edi
movsd (%edi,%edx,8), %xmm1
mulsd (%esi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %edx
cmpl %ecx, %edx
jb .LBB9_2

------------------------------------

D05

The CPU is able to use the pointers among Nodes in a quick way, but the number of nodes is probably limited. The program can increase its speed if there is less data traffic through the CPU cache. So I have replaced the pointers by 16 bit (ushort) indexes. This allows up to 2^16 nodes. If you need more, you can replace:
alias ushort TyIndex;
With:
alias uint TyIndex;
But this may produce code slower than D04.

Timings:

...$ time ./em3d05 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.06 0.94 0.38 0.42
Propagating field values for 50 iteration(s)...
EM3D build time 1.80
EM3D compute time 1.00
EM3D total time 2.80
Done!
real 0m2.937s
user 0m2.540s
sys 0m0.324s

...$ time ./em3d05 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.07 0.95 0.37 0.39
Propagating field values for 200 iteration(s)...
EM3D build time 1.78
EM3D compute time 4.02
EM3D total time 5.80
Done!
real 0m5.955s
user 0m5.508s
sys 0m0.360s

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 20(%eax), %edi
movsd (%edi,%esi,8), %xmm1
movl 12(%eax), %edi
movzwl (%edi,%esi,2), %edi
imull $40, %edi, %edi
mulsd (%edx,%edi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %esi
cmpl %ecx, %esi
jb .LBB9_2

------------------------------------

D06

Now I look for possible algorithmic improvements.
makeUniqueNeighbors() contains stupid quadratic code to avoid duplicating Neighbors. I have tried to use a D associative array, and then a set implemented as a bit vector (using bt/bts intrinsics), but the most efficient seems a set of booleans implemented with just an array of ubytes.

I have renamed:
coeffs => inCoeffs

Timings:

...$ time ./em3d06 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.07 0.11 0.40 0.39
Propagating field values for 50 iteration(s)...
EM3D build time 0.97
EM3D compute time 1.03
EM3D total time 2.00
Done!
real 0m2.145s
user 0m1.748s
sys 0m0.324s

...$ time ./em3d06 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.07 0.11 0.37 0.40
Propagating field values for 200 iteration(s)...
EM3D build time 0.95
EM3D compute time 4.11
EM3D total time 5.06
Done!
real 0m5.231s
user 0m4.804s
sys 0m0.352s

The build time for 5000 nodes with 300 arcs is half as before.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 20(%eax), %edi
movsd (%edi,%esi,8), %xmm1
movl 12(%eax), %edi
movzwl (%edi,%esi,2), %edi
imull $40, %edi, %edi
mulsd (%edx,%edi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %esi
cmpl %ecx, %esi
jb .LBB9_2

------------------------------------

D07

Now the most common optimizations are done, and the code looks good enough. To improve the code some more we have to think more about the CPU cache.

The computations (of "compute time") are done on an array of large structs Node. But only few fields are actually necessary for such computation: value, inbound and inCoeffs (total 6 words), so if I remove the useless data, the iterations on the array will be (hopefully) faster.

I don't know how to remove items from the struct. The simpler way to solve this problem is to define a second Node struct, a Node2, with just the essential fields. Creating the two arrays of Node2 is fast, because all arrays of inbound and inCoeffs are just copied by reference.

So I add the eNodes2 and hNodes2 dynamic arrays of Node2 to BiGraph (now encapsulation looks gone, but in practice I don't think the original design was good. The only data structure that has to know about a collection of Nodes2 has to be BiGraph, and not Node2 itself).

There's some redundancy too here in a Node2: the inbound and inCoeffs arrays have the same length, so the Node2 can use just 5 words. There are few ways to remove this field, but the first field (the value) is a double, and for max performance it must be aligned to 8 bytes. So I have done few experiments, but they are failed, the program was slower, with a small memory usage reduction. So I have kept all the 6 words.

Timings:

...$ time ./em3d07 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.08 0.10 0.37 0.39 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.94
EM3D compute time 0.92
EM3D total time 1.86
Done!
real 0m1.993s
user 0m1.600s
sys 0m0.316s

...$ time ./em3d07 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.10 0.39 0.40 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.96
EM3D compute time 3.62
EM3D total time 4.58
Done!
real 0m4.718s
user 0m4.292s
sys 0m0.360s

The build time of the graph is the same, but the computing time is decreased.
The time needed to build the arrays of Node2 is very small.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 20(%eax), %edi
movsd (%edi,%esi,8), %xmm1
movl 12(%eax), %edi
movzwl (%edi,%esi,2), %edi
imull $24, %edi, %edi
mulsd (%edx,%edi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %esi
cmpl %ecx, %esi
jb .LBB9_2

-----------------------------------

D08

We can improve the code more taking a look at how hValues are accessed, a sequential access is faster. But I've seen that the inbound[i] are already sorted.

So I have to take a better look at the very small loop that performs one of the two halves of a computation step:
void computeNewValue(Node2[] otherTable) {
  for (int i; i < inbound.length; i++)
    value -= inCoeffs[i] * otherTable[inbound[i]].value;
}

There's not much that can be improved here. I have drawn on paper the simple data structures involved here, and with that I've seen that inbound[i] performs (forward but) large jumps in an array of largish structs (6 words each) to find the "value" fields. This forces the CPU cache to a lot of traffic. If we reduce this traffic the iterations will get faster.

The simple way to do this is to pull the values out of the nodes2 and put them in a uniform array, that's used in parallel to the Node2 array.
// Node2 of D07:
struct Node2 {
    double value;
    Node.TyIndex[] inbound;
    double[] inCoeffs;
}

// Node2 of D08:
struct Node2 {
    Node.TyIndex[] inbound;
    double[] inCoeffs;
    static double[] eValues, hValues;
}

Now the jumps are performed on arrays (eValues and hValues) with just 2 words/item.

To perform a better computation I have had to split computeNewValue() into eValuesStep() and hValuesStep(). I have pulled the outer loop like foreach(j, ref n; hNodes2) inside them, this may help the DMD compiler too, because it doesn't inline functions/methods that have a loop inside.

I also have had to add a "aux_val" auxiliary variable inside the eValuesStep/hValuesStep functions, because it seems LDC/LLVM was not able to pull the access to Node2.hValues[j] out of the loop. I don't know why [see below, the explanation after D01b].

Timings:

...$ time ./em3d08 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.10 0.37 0.40 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.94
EM3D compute time 0.74
EM3D total time 1.68
Done!
real 0m1.817s
user 0m1.416s
sys 0m0.332s

...$ time ./em3d08 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.10 0.37 0.39 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.93
EM3D compute time 2.96
EM3D total time 3.89
Done!
real 0m4.038s
user 0m3.628s
sys 0m0.336s

The compute time is decreased enough.

This is the asm of inner loop of eValuesStep() compiled with LDC:

.LBB9_4:
movzwl (%esi,%edi,2), %ebp
movsd (%edx,%edi,8), %xmm1
mulsd (%ebx,%ebp,8), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jb .LBB9_4

------------------------------------

D09

There's one last possible optimization of the memory usage that I can see. I can remove again the redundancy from Node2, because now it doesn't contain a double, and now probably I can use 3 32-bit words.

The new Node2 is:
struct Node2 {
  Node.TyIndex* inbound;
  double* inCoeffs;
  int inbound_len;
  static double[] eValues, hValues;
}

Timings:

...$ time ./em3d09 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.08 0.10 0.37 0.39 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.94
EM3D compute time 0.74
EM3D total time 1.68
Done!
real 0m1.821s
user 0m1.444s
sys 0m0.296s

...$ time ./em3d09 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.11 0.37 0.39 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.94
EM3D compute time 2.95
EM3D total time 3.89
Done!
real 0m4.030s
user 0m3.632s
sys 0m0.332s

The computing time is about the same. The memory saving compared to D08 is very small, something like 40 KB.

It may be possible to vectorize the computing loop, but I stop here.

Compared to the Java2 version the speed of this last D09 version is not much higher, because the computing loop is very tight and the JavaVM is able to optimize it very well, quite better than LDC, see the compute time of Java2 compared to the compute time of D01: 4.35 compared to 5.97. A shame for D/LDC/LLVM :-)

Other Olden benchmarks show a bigger gap between the performance of the Java and optimized D code.


This is the asm of inner loop of eValuesStep() compiled with LDC:

.align 16
.LBB9_4:
movzwl (%esi,%edi,2), %ebp
movsd (%edx,%edi,8), %xmm1
mulsd (%ebx,%ebp,8), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jl .LBB9_4

------------------------------------

D10

Update 1.2, Oct 27 2006:

There is another simple performance optimization that can be done. The memory for the Nodes and their arcs can be allocated from the C heap (and in some cases such memory can even be allocated with malloc instead of calloc, but in this programs this has no significant performance difference).

The memory coming from the C heap is usually aligned to 4 bytes, so it's not fit for storing values and coefficients that in this program are doubles, that are used much more efficiently when aligned to 8 bytes. But in this program the actual computations performed on double FP numbers are done on arrays allocated in Node2, that are allocated from the D GC heap that returns memory aligned to 16 bytes. The only problem may come from Node.inCoeffs that's copied as is to Node2 inCoeffs. So this allocation needs a little of extra care (see makeFromNodes()).

I can even allocate all "outbound" arrays at once, because they are all of the same length, but this change will not change performance significantly, because that time is part of the first timing of the "Detailed creation timings:", that's about 0.03-0.04 seconds for 5000 nodes.

I have moved the disable() of the GC below, when the computing loops are done, because there's no point in keeping the GC active during those loops, that produce no garbage.

Timings:

...$ time ./em3d10 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.10 0.14 0.19 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.47
EM3D compute time 0.60
EM3D total time 1.07
Done!
real 0m1.203s
user 0m0.968s
sys 0m0.148s

...$ time ./em3d10 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.03 0.10 0.16 0.18 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.47
EM3D compute time 2.46
EM3D total time 2.93
Done!
real 0m3.061s
user 0m2.844s
sys 0m0.124s

Not just (as expected) the build times are lower, but the compute ones too are lower, I don't know why, maybe the memory from the C heap has more coherence (more contiguous, reducing cache misses. The GC adds some spaces, because it returns memory blocks aligned to 16 bytes).

------------------------------------

D11

Update 1.2, Oct 27 2006:

To reduce the time used by makeFromNodes(), and hopefully to increase the computation loops (with a better cache coherence) we can cut the inbound and inCoeffs arrays of Node (later copied into Node2) from a large memory block allocated all at once (again, doubles are better alighed to 8 bytes) to do this I have created two memory arenas, with the DoubleArena and IndexArena structs.

Timings:

...$ time ./em3d11 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.10 0.00 0.42 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.56
EM3D compute time 0.58
EM3D total time 1.14
Done!
real 0m1.271s
user 0m1.028s
sys 0m0.164s

...$ time ./em3d11 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.09 0.00 0.43 0.01
Propagating field values for 200 iteration(s)...
EM3D build time 0.57
EM3D compute time 2.23
EM3D total time 2.80
Done!
real 0m2.961s
user 0m2.652s
sys 0m0.184s

...$ time ./em3d11 -n 5000 -d 300 -i 2000 -m
Initializing em3d random graph...
Detailed creation timings: 0.03 0.11 0.00 0.43 0.00
Propagating field values for 2000 iteration(s)...
EM3D build time 0.57
EM3D compute time 23.39
EM3D total time 23.96
Done!
real 0m24.621s
user 0m23.869s
sys 0m0.144s


Now the build time is a little higher (I don't know why), while the computing loops are faster (I don't know why), so when the number of iterations is 200 the total time is lower.

In the detailed creation timings you can also see that makeFromNodes() is now very fast, while updateFromNodes() is quite slower (I don't know why. Here DMD is twice faster).

This version also needs a little less RAM compared to D10, about 39 MB with -n 5000 -d 300. Java3 needs about 197 MB for the same graph.

Now the D11 program is fast enough, and uses a low enough memory, that the limit of 65000 nodes given by the ushort indexes can be felt. Using an uint as TyIndex the memory used by D11 with -n 5000 -d 300 is 51 MB.

Timings of D11 with TyIndex=uint:

...$ time ./em3d11 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.09 0.11 0.00 0.48 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.68
EM3D compute time 0.68
EM3D total time 1.36
Done!
real 0m1.525s
user 0m1.220s
sys 0m0.200s

...$ time ./em3d11 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.06 0.12 0.00 0.50 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.68
EM3D compute time 2.71
EM3D total time 3.39
Done!
real 0m3.597s
user 0m3.252s
sys 0m0.184s

Using such larger indexes the running time is not much higher, but in this case using just pointers is probably faster.

This is the asm of inner loop of eValuesStep() of D11 compiled with LDC, when TyIndex=uint:

.LBB14_4:
movl (%esi,%edi,4), %ebp
movsd (%edx,%edi,8), %xmm1
mulsd (%ebx,%ebp,8), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jl .LBB14_4

------------------------------------

D11b
(lateral branch)

Update 1.2, Oct 27 2006:

When the number of nodes is high, instead of storing ushort indexes, I can store ushort deltas. And because the indexes are ordered such deltas are never negative. If links are well spread, then such deltas are usually small, and they can be used to index far more than 2^16 nodes. If such links are randomly distriuited, you can probably manage millions of nodes in a safe enough way, and it's easy to add a runtime test when they are created to be sure they don't overflow.

I have done a test, using the usual -n 5000 -d 300 arguments with the same seed=783, in such case the maximum delta is 240, so just one ubyte suffices! ubyte indexes reduce traffic through the cache, so they may speed up the program a little more. The D11b uses such TyDelta in Node2 but I have kept usign TyIndex in Node to keep the program simpler.

Timings of D11b with TyIndex=ushort, TyDelta=ubyte:

...$ time ./em3d11b -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.10 0.00 0.42 0.02
Propagating field values for 50 iteration(s)...
EM3D build time 0.59
EM3D compute time 0.56
EM3D total time 1.15
Done!
real 0m1.266s
user 0m1.032s
sys 0m0.156s

...$ time ./em3d11b -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.42 0.02
Propagating field values for 200 iteration(s)...
EM3D build time 0.59
EM3D compute time 2.20
EM3D total time 2.79
Done!
real 0m2.968s
user 0m2.676s
sys 0m0.160s

...$ time ./em3d11b -n 5000 -d 300 -i 2000 -m
Initializing em3d random graph...
Detailed creation timings: 0.03 0.12 0.00 0.48 0.02
Propagating field values for 2000 iteration(s)...
EM3D build time 0.65
EM3D compute time 21.98
EM3D total time 22.63
Done!
real 0m23.252s
user 0m22.477s
sys 0m0.204s

The iteration is a bit faster, but the build time of Nodes2 is a little slower (by about 0.02 seconds). So for 200 iterations the running time is about the same, while for 2000 you can see some difference. For TyDelta==ubyte this D11b program is not very useful (unless you have few nodes), but for larger graphs such delta encoding can be useful to keep using ushorts even when there are more than 2^16 nodes.


Timings of D11b with TyIndex=uint, TyDelta=ushort:

...$ time ./em3d11b -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.44 0.02
Propagating field values for 50 iteration(s)...
EM3D build time 0.61
EM3D compute time 0.56
EM3D total time 1.17
Done!
real 0m1.309s
user 0m1.032s
sys 0m0.180s

...$ time ./em3d11b -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.05 0.10 0.00 0.43 0.02
Propagating field values for 200 iteration(s)...
EM3D build time 0.60
EM3D compute time 2.20
EM3D total time 2.80
Done!
real 0m2.997s
user 0m2.684s
sys 0m0.176s

D11b with TyDelta=ushort is just a little slower than D11b with TyDelta=ubyte, so in most situations it can be enough for a number of nodes >> 2^16 if the arcs are spread in an uniform random way.

This em3d program is just a benchmark, but in a real program you want to add class invariants, unittests, several safeties for the memory, and ways to deallocate memory when not used any more, etc.

------------------------------------

D12
(this ignores the D11b branch)

Update 1.3, Oct 27 2006:

The items of inbound and inCoeffs are accessed in parallel, so it can be useful for them to be close to each other. Ideally the items of inbound and inCoeffs are better kept in the same cache line, that's a block of 64 bytes (and I think it has to be aligned too). But I can't put them into a struct because it wastes a lot of space (such struct has to be 16 bytes long to keep the double it contains aligned to 8 bytes), and wasting space increases cache misses.

If I want to pack pairs of double + ushort in a struct that inside an array is aligned to 8 bytes, I need to pack 4 doubles and 4 ushorts, this needs 8*4 + 2*4 = 32 + 8 = 40 bytes. But this straddles a cache line, and I think this is negative. So to pack as much as possible in a cache line I can use 6 pairs, that need 8*6 + 2*6 = 48 + 12 = 60 bytes, that needs 4 bytes of padding, this wastes 6.25% of space, I think it's acceptable. I can define it like this:
struct PairsPack {
  static if (TyIndex.sizeof == 2)
    const int NPAIRS = 6; // 8*6 + 2*6 + 4 = 48 + 12 + 4 = 64
  else static if (TyIndex.sizeof == 4)
    const int NPAIRS = 5; // 8*5 + 4*5 + 4 = 40 + 20 + 4 = 64
  else
      static assert(0);

  double[NPAIRS] weights;
  TyIndex[NPAIRS] inbounds;
  uint _padding;
}

Just to be sure I'll allocate the array of PairsPack aligned to 64 bytes.

But generally the number of inbound arcs in a node isn't a multiple of 6, so I need to waste some of those pairs, or use part of an already used PairsPack. Both solutions have advantages and disadvantages. I choose the option to not waste memory, this requires a little more complex indexing of the sub-part of a PairsPack, but allows me to allocate the array of PairsPack in a simpler way.

This D12 was the hardest version to create, the code is quite hairy, and hard to maintain and understand. And it's slow (the Range!() that I have used is not necessary. LLVM was able to unroll that loop inside eValuesStep/hValuesStep by itself).

Timings:

...$ time ./em3d12o -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.43 0.19
Propagating field values for 50 iteration(s)...
EM3D build time 0.78
EM3D compute time 0.68
EM3D total time 1.46
Done!
real 0m1.614s
user 0m1.240s
sys 0m0.284s

...$ time ./em3d12o -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.41 0.20
Propagating field values for 200 iteration(s)...
EM3D build time 0.76
EM3D compute time 2.65
EM3D total time 3.41
Done!
real 0m3.623s
user 0m3.176s
sys 0m0.308s


The asm of inner loop of eValuesStep() of D11 compiled with LDC:

.LBB14_4:
movzwl 50(%eax), %ebp
movsd 8(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
movzwl 48(%eax), %ebp
movsd (%eax), %xmm2
mulsd (%esi,%ebp,8), %xmm2
subsd %xmm2, %xmm0
subsd %xmm1, %xmm0
movzwl 52(%eax), %ebp
movsd 16(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
movzwl 54(%eax), %ebp
movsd 24(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
movzwl 56(%eax), %ebp
movsd 32(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
movzwl 58(%eax), %ebp
movsd 40(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
addl $64, %eax
decl %ebx
jne .LBB14_4

------------------------------------

Java3

Now I can think porting back some of the optimizations I have implemented in D to Java. Java don't uses structs, so several things are not possible.

In this Java3 I have used the uniqueNeighborsSet set, implemented as the D code.

Timings:

...$ time java -server Em3d3 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 1.681
EM3D compute time 1.14
EM3D total time 2.821
Done!
real 0m3.138s
user 0m2.600s
sys 0m0.416s

...$ time java -server Em3d3 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 1.649
EM3D compute time 4.29
EM3D total time 5.943
Done!
real 0m6.239s
user 0m5.656s
sys 0m0.452s

More optimizations may be ported from D to Java.

------------------------------------

D01b
(lateral branch)

Looking at the asm of the inner loop of computeNewValue for D01 and D09, I think the D01 can be improved with an auxiliary variable:
void computeNewValue() {
  auto aux = this.value;
  for (int i = 0; i < fromCount; i++)
    aux -= coeffs[i] * fromNodes[i].value;
  this.value = aux;
}

Timings:

...$ time ./em3d01b -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 4.89
EM3D compute time 1.28
EM3D total time 6.17
Done!
real 0m6.335s
user 0m5.804s
sys 0m0.464s

...$ time ./em3d01b -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 4.92
EM3D compute time 5.23
EM3D total time 10.15
Done!
real 0m10.601s
user 0m10.105s
sys 0m0.440s

LDC/LLVM must be able to perform that optimization (pulling this.value out of the loop) by itself.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB11_2:
movl (%edx,%edi,4), %ebx
movsd (%esi,%edi,8), %xmm1
mulsd 8(%ebx), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jl .LBB11_2

This asm is very short, but it's quite slower than D09 anyway. I don't know what kind of optimizations are done by the JavaVM on the Java1 code (there is a way to look at the asm produced by the JVM, but it's not handy).


Update 1.1, Oct 26 2006: now I think I know why LDC isn't able to perform that optimization in the computeNewValue() loop: the compiler doesn't know that "value" on the right is always distinct from "value" on the left, because this is a bipartite graph, where no self-loops are allowed. So it can't pull the "value" of the left out of the loop.

To perform the optimization the compiled need more semantics. Future programming languages may allow the programmer to give such semantics to the compiler.

I have done a similar optimization (pulling "value" out of the loop) in the Java3 version, with no change in performance, so probably the JavaVM is somehow able to perform that optimization.

------------------------------------

Python1

I have created a Python version too, that uses Psyco. It's similar to the Java3 version.
I have seen that the faster unique_neighbors_set is using just a Python set.

I have added __slots__ to both Random and Node, this doubles the performance.

Timings:

...$ time python em3d.py -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.14 3.86 0.16 4.32
Propagating field values for 50 iteration(s)...
EM3D build time 8.48
EM3D compute time 54.70
EM3D total time 63.19
Done!
real 1m4.320s
user 0m59.628s
sys 0m3.876s

I have used:
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
Psyco 1.6.0 final

------------------------------------

Python2

While improving the D code, and especially when I have created the Python1 version, I have felt the code as too much complex for the simple operations it performs. So I have reduced the Python code, removing useless parts. The Python2 version is almost the shortest Python code that produces the same output (there are ways to shorten it a little more, but I am not doing code golf here, my purposes are different).

The Python2 code is 67 lines long, while the Java3 is 481 lines including the comments.

------------------------------------

Python3

Then I have replaced the random() and sample() with ones from the Python standard library, I have removed the printing code, and I have reorganized the code a little more.

The result is short enough (21 nonempty lines) to be shown here too:
from random import random, sample

class Node:
    def __init__(self, value):
        self.value = value
        self.in_arcs = []

n_nodes, in_degree, n_steps = 15, 5, 5

h_nodes = [Node(random()) for i in xrange(n_nodes)]
e_nodes = [Node(random()) for i in xrange(n_nodes)]

def make_in_arcs(nodes, other_nodes, in_degree):
    for n1 in nodes:
        for n2 in sample(other_nodes, in_degree):
            n2.in_arcs.append((n1, random()))

make_in_arcs(h_nodes, e_nodes, in_degree)
make_in_arcs(e_nodes, h_nodes, in_degree)

def half_step(nodes):
    for n1 in nodes:
        for n2, weight in n1.in_arcs:
            n1.value -= weight * n2.value

for i in xrange(n_steps):
    half_step(e_nodes)
    half_step(h_nodes)

This code shows the essential, and for me it's much simpler to follow and understand. This code may be conveted to C/D again, if necessary. The original C code was harder to understand, and much more complex. Removing the noise and useless parts allows me to understand the algorithm better, this is a simple algorithm. Even if you don't understand this code immediately, the complexity left is necessary, it's not spurious. Generally good programs aren't the clever and complex ones, but ones that look "obvious". This Python code shows that lot of the complexity of the original Java code was spurious.

------------------------------------

Python4

This is similar to the Python3 version, I have just encoded a node in a simpler way, using a single list. The first item is the "value" and all the following ones are pairs of node-weight. This code is just 17 nonempty lines long. For me Python3 version is more readable.

------------------------------------
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , ,
Subject:Updates and links
Time:11:25 pm
"Boostbench", a benchmark, in C, Java and D, the zip contains all the code, timings and information:
http://www.fantascienza.net/leonardo/js/boostbench.zip
Original code by Rene Grothmann:
http://mathsrv.ku-eichstaett.de/MGF/homes/grothmann/java/bench/

With the LDC compiler the performance is poor without Link-Time Optimization plus Interning, I don't know why. (If someone is able to tell me I'll be interested to know.)

-------------------

Theo Jansen demonstrates the amazingly lifelike kinetic sculptures he builds from plastic tubes and lemonade bottles, and then lets walk and live on a beach. They have even a brain made of bottles. Windosaurs, etc:
http://www.youtube.com/watch?v=b694exl_oZo
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , ,
Subject:Jarvis March in D
Time:03:17 pm
Robert C. Martin in his blog has written an implementation of the Jarvis March algorithm in Clojure, and was looking for ways to speed up the code:
http://blog.objectmentor.com/articles/2009/08/11/jarvis-march-in-clojure

See also:
http://www.butunclebob.com/ArticleS.UncleBob.ConvexHullTiming

So I have translated the Java code to D (and later C), and then I have done some timing tests.

The Java results are quite good compared to the D ones.

Number of points in the hull in all tests is just 14.
Timings, n = 2_000_000:

On Windows XP, best of 6, seconds:
  D 1:    1.88 (DMD compiler)
  Java 1: 1.83
  C 2:    1.55 (GCC compiler)
  Java 2: 1.55
  Java 1: 1.47 (-server)
  Java 2: 1.47 (-server -Xms40M)
  C 2:    1.46 (LLVM-GCC compiler)
  Java 2: 1.45 (-server)
  D 2:    1.32 (DMD compiler)

On Pubuntu, best of 6, seconds:
  Java 1: 2.71
  C 2:    2.59 (GCC compiler)
  Java 1: 2.42 (-server)
  Java 2: 2.39
  Java 2: 2.37 (-server)
  D 1:    2.06 (LDC compiler)
  D 1:    2.06 (LDC compiler, LTO+I)
  D 2:    2.02 (LDC compiler, LTO+I)  
  D 2:    2.00 (LDC compiler)
Notes:
- The code of the D 1 version is similar to the Java 1 version. The code of the D 2 version is similar to the Java 2 version and C 2 version.
- On Pubuntu all timings are increased because it has a slower access to memory.
- I have used a (low quality) portable rnd generator to assure equal test cases on all compilers and operating systems. (And because while the D std library Tango has a gaussian generator, the Phobos std lib of Dv.1 doesn't have it).
- Currently the seed of the random generator can't be changed.
- All the programs give the same output, but the results aren't equal up to the last floating point digits anyway because different FP instructions produce sligtly different results. FP numbers are approximations.
- The C code is slower, I don't know why.
- The D 2 code is much faster than D 1 with DMD because DMD has limited inlining capabilities (this is why I have created the D 2 version).
- In the D code the abs function is not taken from Tango because LDC has a performance bug, and otherwise it's not able to inline it.
- For the LTO+I see below.

More info and all the code can be found here:
http://www.fantascienza.net/leonardo/js/jarvism.zip
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , ,
Time:08:10 pm
I have improved the code of the Sphereflake raytracer and thanks to Tomas Lindquist Olsen I have fixed my LDC installation, so there I have redone the benchmarks on Pubuntu:
Timings on Pubuntu, w=h=1024, lvl=6, best of 3, seconds:
(66_430 spheres, WITH_SHADOWS=true, FASTER_LDC=true)
  D:    4.33  (claiming 4.6 MB) (first fast version + LTO+I)
  D:    4.51  (claiming 2.5 MB) (second fast version + LTO+I)
  D:    4.55  (claiming 4.6 MB) (basic version + LTO+I)
  D:    4.70  (claiming 2.5 MB) (second fast version)
  D:    4.72  (claiming 4.6 MB) (first fast version)
  D:    4.97  (claiming 4.6 MB) (basic version)
  C++:  5.04  (claiming 4.6 MB) (+4 bytes padding)
  C++:  5.75  (claiming 4.3 MB) (original version)

Timings on Pubuntu, w=h=1024, lvl=7, best of 3, seconds:
(597_871 spheres, WITH_SHADOWS=true, FASTER_LDC=true)
  D:    5.51  (claiming 23 MB) (second fast version + LTO+I)
  D:    5.74  (claiming 41 MB) (first fast version + LTO+I)

Timings on Pubuntu, w=h=1024, lvl=8, best of 3, seconds:
  D:   10.84  (claiming 205 MB) (second fast version + LTO+I)
  D:   11.21  (claiming 205 MB) (second fast version)
  D:   23.34  (claiming 369 MB) (first fast version + LTO+I)
  D:   23.84  (claiming 369 MB) (basic version + LTO+I)
  D:   24.06  (claiming 369 MB) (first fast version)
  D:   24.55  (claiming 369 MB) (basic version)
  C++: 24.61  (claiming 369 MB) (+4 bytes padding)

For more information, and for the updated Sphereflake code:
http://www.fantascienza.net/leonardo/js/index.html#flake

----------------

I have added a new benchmark, a sparse matrix multiplication, this time the performance of D code (with LDC) isn't much good. Info and timings inside the zip:
http://www.fantascienza.net/leonardo/js/index.html#jaspa
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , ,
Subject:Linpack benchmark in D
Time:01:32 pm
I have translated a C Linpack benchmark to D, and I have timed it compared to the C and a similar Java version.

Some of the D Link-Time optimized (+ interning) versions segfault (binaries produced by DMD here never segfault). The C code compiled with GCC with -msse3 -march=core2 compilation arguments too segfaults (that's why I have used llvm-gcc that has never produced a segfault), I don't know why, it may be a pointer aliasing problem.

For bigger matrices Java shows to be about as fast as gcc-C and LDC-D. Both float and real versions are very slow. Static arrays are a bit faster with LDC but not much, I don't know why.

See inside the zip for the full timings:
http://www.fantascienza.net/leonardo/js/linpack.zip
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , ,
Subject:Richards benchmark
Time:11:57 pm
This old post of mine is about virtual methods and devirtualizations:
http://leonardo-m.livejournal.com/76547.html

I have performed more benchmarks about virtual methods and its costs with LDC. I have used a little well known benchmark, the Richards one. I have used the code from here (on the Web Archive because it seems to be not online anymore):
http://web.archive.org/web/20060715074131/lissett.port5.com/ben/bench1.htm
http://web.archive.org/web/20060715074131/http://lissett.port5.com/ben/bench3.htm

All the code of the following benchmarks:
http://www.fantascienza.net/leonardo/js/richards.zip
Timing results:

Windows, n = 10_000_000:
  C:       1.26
  D ~C:    2.01
  Java:    2.04 (final classes, -server)
  D2 ~C#:  2.36 (final classes)
  D3 ~C#:  2.36 (final classes, no getters/setters)  
  Java:    2.73 (final classes)
  D4 ~C#:  3.07 (no getters/setters)  
  C#:      3.98
  D1 ~C#:  4.23
  
Windows, n = 100_000_000:
  C:      12.16
  Java:   18.73 (final classes, -server)
  D ~C:   18.86
  D2 ~C#: 23.11 (final classes)
  D3 ~C#: 23.12 (final classes, no getters/setters)  
  Java:   25.40 (final classes)
  D4 ~C#: 30.16 (no getters/setters)    
  C#:     38.39
  D1 ~C#: 41.64


Pubuntu, n = 10_000_000:
  D ~C:    1.35
  C:       1.39
  D2 ~C#:  1.98 (final classes)  
  D3 ~C#:  2.00 (final classes, no getters/setters) 
  Java:    2.73 (final classes)  
  D4 ~C#:  2.94 (no getters/setters)      
  C#:      -  
  D1 ~C#:  4.03
  
Pubuntu, n = 100_000_000:
  D ~C:   13.24
  C:      13.77
  D2 ~C#: 19.64 (final classes)
  D3 ~C#: 19.92 (final classes, no getters/setters) 
  Java:   25.16 (final classes)    
  D4 ~C#: 29.16 (no getters/setters)       
  C#:      -
  D1 ~C#:  40.17

Key:
  D ~C# means D code that comes from the C# version.
  D ~C means D code that comes and looks from the C version.

Note that the classes in C# code aren't final. As usual Java shows very good performance.


On WindowsXp:
DMD Digital Mars D Compiler v1.042
gcc version 4.3.3-dw2-tdm-1 (GCC)

dmd used with:
dmd -O -release -inline

ldc used with:
ldc -O5 -release -inline

gcc used with:
gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2


On Pubuntu:
gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
ldc based on DMD v1.045 and llvm 2.6svn (Thu Jul 2 23:07:48 2009)

ldc used with:
ldc -O5 -release -inline

gcc used with:
gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native
comments: 8 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , , ,
Subject:SciMark 2.0 in C and D
Time:11:08 pm
SciMark 2.0 is a Java benchmark for scientific and numerical computing, developed by Roldan Pozo and Bruce R Miller:
http://math.nist.gov/scimark2/index.html

On my PC the applet (run locally) scores 388, I have a Core2 at 2 Ghz. (The site shows that some people have scored 7300, I don't know how).

The site also offers C code, that I have run and then ported to D, for benchmarks (it can also be possible to port the Java code to D, but probably the C code leads to higher performance). (Update1: I have converted the Java code too to D).

Here you can find the C and D code (for Phobos and Tango standard libraries), plus all the detailed results of the benchmarks:
http://www.fantascienza.net/leonardo/js/scimark2.zip

Thanks to the latest LLVM backend the results of LDC are quite good, better than C compiled with GCC 4.2.1.

I have tested LDC on the same PC using Pubuntu, that is a "Portable Ubuntu", that leads to a performance a bit decreased compared to a native Ubuntu running on the same CPU.
Timings on WinXp 32 bit, 2.00 seconds min time:
               small   large
  D DMD:         299    207
  D DMD:         401    268 (OOP version)
  C GCC:         578    314
  C LLVM-GCC:    609    351

Timings on Pubuntu 32 bit, 2.00 seconds min time:
               small  large
  D LDC:        462     272 (OOP version)
  C GCC:        568     315
  D LDC:        575     327

CPU: Core2 at 2 Ghz, 2 GB RAM.

Compilers used on WindowsXp:
  gcc version 4.3.3-dw2-tdm-1 (GCC)
  gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)

Compilation argumenents on WindowsXp:
  dmd -O -release -inline scimark2_d_phobos.d
  gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2 scimark2_c.c -o scimark2_c_gcc
  llvm-gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2 scimark2_c.c -o scimark2_c_llvm


Compilers uses on Pubuntu:
  gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  LLVM D Compiler rev. based on DMD v1.045 and llvm 2.6svn (Mon Jun  1 22:54:33 2009)

Compilation argumenents on Pubuntu:
  ldc -O5 -release -inline scimark2_d.d
  gcc -O3 -s -fomit-frame-pointer -msse3 -march=native -lm scimark2_c.c -o scimark2_c

-----------------

Update 1, Jun 5 2009: added OOP D version translated from the Java code (the code is a bit raw, it's not an example of good code), plus its timings. The class-based version is slower with LDC but faster with DMD, I don't know why.

-----------------

Update Jul 17 2009:

I have used an updated LDC, based on DMD v1.045 and llvm 2.6svn (Mon Jul 13 06:47:53 2009).

Scores, without LTO:
Timings on Pubuntu 32 bit, 2.00 seconds min time:
  C GCC:        570     317
  D LDC:        570     326 (OOP version)
  D LDC:        595     317


Then I have tried to use Link-Time optimizations (not currently done by LDC yet), to activate them I have used:
ldc -O5 -release -inline -output-bc scimark2_d.d

opt -std-compile-opts scimark2_d.bc > sci.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread sci.bc


Scores, with LTO:
Timings on Pubuntu 32 bit, 2.00 seconds min time:
  C GCC:        570     317
  D LDC:        582     325 (OOP version)
  D LDC:        626     329


With LTO:
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:          625.55
FFT             Mflops:   514.53    (N=1024)
SOR             Mflops:   573.50    (100 x 100)
MonteCarlo:     Mflops:   305.04
Sparse matmult  Mflops:   672.16    (N=1000, nz=5000)
LU              Mflops:  1062.52    (M=100, N=100)


====================================================

Another update Jul 17 2009:

I have found a way to internalize the main correctly, this allows for more improvements during the LTO Phase, new score is 680!

Compile the code with:
ldc -O5 -release -inline -output-bc scimark2_d.d

opt -std-compile-opts scimark2_d.bc > scimark2_d_opt.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=scimark2_d scimark2_d_opt.bc
Scores, with LTO + internalize:
Timings on Pubuntu 32 bit, 2.00 seconds min time:
  D LDC:        612     348 (OOP version)
  D LDC:        680     369

**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:          680.44
FFT             Mflops:   520.67    (N=1024)
SOR             Mflops:   573.50    (100 x 100)
MonteCarlo:     Mflops:   568.12
Sparse matmult  Mflops:   677.37    (N=1000, nz=5000)
LU              Mflops:  1062.52    (M=100, N=100)
comments: 4 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , ,
Current Location:programming
Subject:List deletions problem
Time:06:15 pm
Once in a while in my programs I need to keep an ordered sequence of items, process them all iteratively, and during each iteration remove few of them that aren't good anymore. So in each iteration only a small number of items is removed, and such removals happen in random positions.

I have written a small paper about how to solve this common programming problem in an efficient way, in Python+Psyco, D language and C:
http://www.fantascienza.net/leonardo/ar/list_deletions.html

The code of the paper:
http://www.fantascienza.net/leonardo/js/list_deletions.zip
comments: 2 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , ,
Subject:Two towers problem and links
Time:07:05 pm
A simple paper of mine about a small programming problem, "Two towers":

http://www.fantascienza.net/leonardo/ar/two_towers/two_towers.html

The problem is: "Suppose that we are given n uniquely sized cubic blocks and that each block has a face area between 1 and n. Build two towers by stacking these blocks. How close can we get the heights of the two towers?"

I have written Python, Psyco, D, and C code, solving it for up to 41 blocks, using an optimized implementation of the brute-force enumerating algorithm. If you know a better algorithm or a better implementation I am interested to know aboit it.

-----------------

Few links:

April 2009 Molecule of the Month by David Goodsell, Oct and Sox Transcription Factors:
http://www.rcsb.org/pdb/static.do?p=education_discussion/molecule_of_the_month/pdb112_1.html


The book "Machine Learning: An Algorithmic Perspective" has lot of Python code examples that you can find here for free, among them there some interesting code (like a Kalman filter):
http://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html


The paper about the Coherence language, it looks interesting:
http://coherence-lang.org/Onward09.pdf


"The Algebra of Data, and the Calculus of Mutation", shows algebra of types in a quite simple way:
http://blog.lab49.com/archives/3011
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , ,
Subject:Sum of file
Time:03:55 pm
The following is related to a small benchmark:
http://ikanobori.jp/weblog/2009/04/11/language-performance-on-the-sum-of-a-file/

I have added a small Python generator program that generates the testing data, "data.txt" of 17_643_996 bytes. The correct total of its numbers is 17_737_165_333.
Code: http://www.fantascienza.net/leonardo/js/sumfile.zip
Timings (faster program for each language):
  C:      0.15  ( 1   X) (GCC)
  D:      0.21  ( 1.4 X) (DMD)
  Java:   0.30  ( 2   X)
  Psyco:  0.45  ( 3   X)
  Python: 4.20  (28   X)
Notes:
- Python/Psyco programs work with numbers bigger than 2^64 too.
- The D program can be as fast as the C one, using a better compiler. DMD has a very old backend... Probably GDC or LDC give better timings.
- I have adapted the Psyco version from the C one, and it's surprisingly fast. Only three times slower than C for such benchmark is very good.
- The Psyco version uses an hardcoded file name (or the file name can be given in stdin). If you want it to take data from the stdin, then you have to use "python -u" (on Windows. I don't know on Linux).
- Currently ShedSkin can't be used for this benchmark, because it uses 32 bit numbers by default.
comments: Leave a comment Add to Memories Tell a Friend

Advertisement

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 10 entries.
Missed some entries? Then simply jump back 10 entries