leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 10 entries.
Missed some entries? Then simply jump back 10 entries

Tags:, ,
Subject:Updates
Time:07:44 pm
Lot of time from the last update. I have spent lot of time improving the Scimark2 benchmark, see below.

In my page of junk software I have added two new D benchmarks, dolden_tsp and dolden_bisort that are ports of the famous memory-based Olden benchmarks. I have included several alternative versions that show improvements. When I can I plan to add two more Olden benchmarks.


In the junk software page I have also addes multibase_happy, various solutions in Python and D of a Google Code Jam 2009 problem. You can find info inside the zip.


And then I have done many improvements to the Scimark2 benchmark, I have added some faster D not-OOP versions, to reach the performance of the Java version. See the zip for a lot more information. It turns out that Java HotSpot is able to partially unroll a loop with a number of iterations known only at runtime. LLVM is not currently able to perform this optimization, so the code was slower. I have unrolled the inner loop of the SOR and LU parts using a static foreach, now the performance is good.
comments: 1 comment or Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , , ,
Subject:Olden "em3d" benchmark in D language
Time:12:38 am
All the code shown in this article:
http://www.fantascienza.net/leonardo/js/dolden_em3d.zip


This is a partial port of the Olden benchmarks to the D language. I'll add more Olden benchmarks in future as time allows me to.

Info on the Olden benchmarks and their Java translation (JOlden):
http://www-ali.cs.umass.edu/DaCapo/benchmarks.html
http://www.sable.mcgill.ca/~bdufou1/ashes2/


General comments:

One of my purposes is to test the efficiency of the D1 code compiled with LDC. LDC is usually able to produce efficient binaries when I translate C code to C-like D code. So now I want to see how D is efficient if I start from higher level code, like Java code. It's usually easy to translate such Java code to D, D looks like a super set of C and Java, with some C++ mixed in.

Olden benchmarks are designed to stress first of all the memory allocations. The D Garbage Collector (GC) is quite less efficient than the Java HotSpot GC, so literal translations of the Olden benchmarks from Java to D are usually lower or quite lower performance (up to 4-16 times slower). So I have to work some to regain performance. Usually D allows me to produce final programs that are 2-6 times faster than the original Java versions, but such optimization requires time, experience, and sometimes it's a little bug-prone.

Beside having a more efficient GC, Java performs other optimizations not done by LDC, like inlining of virtual methods, and other things. In future the LLVM back-end of LDC will hopefully perform some of such optimizations (obsoleting some of my manual optimizations).

In all JOlden benchmarks I have packed the Java code of a benchmark in a single source file, that later I have first translated to D, and then optimized for speed and memory usage. Sometimes I have cleaned up some the D code.

Below I explain the various versions I've created.


See the Python2 code near the bottom to see the algorithm used by this "em3d" benchmark. This benchmark models the propagation of electromagnetic waves through objects in 3 dimensions. It is a simple computation on an irregular bipartite graph containing nodes representing electric and magnetic field values.

Update 1.2, Oct 26 2006:

One of the main purposes of this article is to show show and teach how some manual optimizations are done.

Timings of the various versions (i is the number of iterations):
          i=50   i=200  i=2000
Java1:    4.37    7.13
Java2     3.62    6.87
D01:      6.55   11.40
D02:      4.98    9.36
D03:      3.53    8.01
D04:      3.32    6.44
D05:      2.94    5.95
D06:      2.14    5.23
D07:      1.99    4.72
D08:      1.82    4.04
D09:      1.82    4.03
D10:      1.20    3.06
D11:      1.27    2.96   24.62 (TyIndex=ushort)
D11:      1.52    3.60         (TyIndex=uint)
D11b:     1.27    2.97   23.25 (TyIndex=ushort, TyDelta=ubyte)
D11b:     1.31    3.00         (TyIndex=uint, TyDelta=ushort)
D12:      1.61    3.62
Java3:    3.14    6.24
D01b:     6.33   10.60
Python1: 64.3    ---


------------------------------------

Java1

The original JOlden Java code packed in a single source file.

I have fixed a bug present in the Java code of JOlden Em3d but absent in the original C code:
At line 131 of the Em3d1.java the line:
if (otherNode == toNodes[filled]) break;
Has to be fixed as:
if (otherNode == toNodes[k]) break;

Such bug was not found probably because this benchmark suite has no unit-tests that test well each class and each method of each class.

Timings (best of 6, seconds):

...$ time java -server Em3d1 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 2.725
EM3D compute time 1.277
EM3D total time 4.01
Done!
real 0m4.370s
user 0m3.376s
sys 0m0.380s


...$ time java -server Em3d1 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 2.458
EM3D compute time 4.35
EM3D total time 6.812
Done!
real 0m7.128s
user 0m6.568s
sys 0m0.440s

Used:
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)

Code running on Ubuntu, running on VirtualBox, running on Windows Vista, on CPU Celeron 2.13 GHz.

------------------------------------

Java2

It's like the Java1 version, but it use a very simple portable pseudo-random generator. So when I translate this program to D I can see if it gives the same results.

The Random.nextInt is a little different, it accepts a max value (min is 0).

Timings:

...$ time java -server Em3d2 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 2.209
EM3D compute time 1.109
EM3D total time 3.318
Done!
real 0m3.622s
user 0m3.100s
sys 0m0.416s

...$ time java -server Em3d2 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 2.197
EM3D compute time 4.354
EM3D total time 6.551
Done!
real 0m6.875s
user 0m6.292s
sys 0m0.448s

------------------------------------

D01

This is the most direct translation of the Java2 code to D, with the portable pseudo-random generator. I have used printf for maximum portability across different D standard libraries. I have replaced the enumerated with the D opApply.
As you can see D1 allows to program in a style almost equal to Java.

Code compiled with:
...$ ldc -O5 -release -inline em3d01.d

With:
LDC compiler, based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)

Timings:

...$ time ./em3d01 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 4.96
EM3D compute time 1.38
EM3D total time 6.34
Done!
real 0m6.547s
user 0m5.856s
sys 0m0.576s

...$ time ./em3d01 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 4.88
EM3D compute time 5.97
EM3D total time 10.86
Done!
real 0m11.398s
user 0m10.801s
sys 0m0.492s


The first D version, despite being compiled with LDC, is slower than the Java version. As you can see the build time is much bigger, more than two times, that's mostly because of the D GC.

To test that no bugs are introduced I have created a small Python script that acts like the "diff" command, but ignored small differences in floating point values. The programs of this benchmark are able to print the results, so I can test them:

...$ java Em3d2 -n 10 -d 4 -i 5 -m -p > outj
...$ ./em3d01 -n 10 -d 4 -i 5 -m -p > outd01

With it I can test that the results are the same:

...$ python approx_diff.py outj2 outd01
3) line=26: ['EM3D', 'build', 'time', '0.0040'] ['EM3D', 'build', 'time', '0.00'] 0.004 0.0
3) line=28: ['EM3D', 'total', 'time', '0.0040'] ['EM3D', 'total', 'time', '0.00'] 0.004 0.0

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB11_2:
movl 32(%eax), %esi
movl (%esi,%edx,4), %esi
movl 40(%eax), %edi
movsd (%edi,%edx,8), %xmm1
mulsd 8(%esi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, 8(%eax)
incl %edx
cmpl %ecx, %edx
jl .LBB11_2

------------------------------------

D02

Olden benchmarks are quite memory-based, so the first way to optimize them is to reduce the number of memory allocations. Also, in D a good basic optimization for this kind of programs is to convert classes that are instantiated many times, into structs.

A quick scan of the code shows that it contains a single instance of Em3d and BiGraph, and many instances of Node. So this version translated Node into a struct, allocated on the heap and managed by pointer. The code is almost the same. I have also reformatted code to 4 spaces.

Timings:

...$ time ./em3d02 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 3.39
EM3D compute time 1.40
EM3D total time 4.79
Done!
real 0m4.982s
user 0m4.416s
sys 0m0.472s

...$ time ./em3d02 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 3.35
EM3D compute time 5.84
EM3D total time 9.19
Done!
real 0m9.365s
user 0m8.821s
sys 0m0.460s

The code is faster, but there's a lot to do still. The code is not good yet, there are several inefficiencies.

------------------------------------

D03

Timings:

I have converted all the classes (Random, BiGraph and Em3d) to structs with minimal performance changes. So the problem is (as expected) elsewhere.
I have reformatted the Java comments to produce a more compact code that (despite losing being fit for ddoc) allows me to understand the code better.

I have renamed some Node fields to better names, because better names allow to understand code better:
toNodes => outbound
fromNodes => inbound
fromCount => n_inbound

Now the main() is a independent function.

Then I have disabled the GC just before the graph creation, and enabled again after that.

I have also added extra time printings inside the graph creation method, to study better where the running time goes.

Then I have added an exit(0) at the end of the main, it kills the program quickly, saving the time needed to call destructors and to free the GC memory. In general this is not a safe thing to do, because destructors may be necessary, but in this program this is OK.

Timings:

...$ time ./em3d03 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.14 0.94 0.48 0.43
Propagating field values for 50 iteration(s)...
EM3D build time 1.99
EM3D compute time 1.36
EM3D total time 3.35
Done!
real 0m3.530s
user 0m2.920s
sys 0m0.524s

...$ time ./em3d03 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.15 0.93 0.48 0.43
Propagating field values for 200 iteration(s)...
EM3D build time 1.99
EM3D compute time 5.80
EM3D total time 7.79
Done!
real 0m8.012s
user 0m7.348s
sys 0m0.540s

As expected the compute time is unchanged, but the build time is decreased significantly.
Now the build time is lower than the Java2 code! So the first benchmark (with just 50 loops) is faster than the Java2 version.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB10_2:
movl 24(%eax), %esi
movl (%esi,%edx,4), %esi
movl 32(%eax), %edi
movsd (%edi,%edx,8), %xmm1
mulsd (%esi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %edx
cmpl %ecx, %edx
jl .LBB10_2

------------------------------------

D04

I have renamed a method:
BiGraph.compute() => graph.computeStep();

And I have added a test for the sanity of the input numDegree:
if (numDegree > numNodes) {...

The program allocates a linked list of nodes, and then doesn't change that topology any more, just iterates on them. Both allocating many single nodes, and iterating on a linked list, today are slow operations (the original C code was old). So today it's much more efficient to allocated many or all Nodes in an array (this program contains a bipartite graph, so there are two arrays).

So I can also remove the "next" pointer field of Node, that was used for the links of the list.

In this D version the arcs are kept as pointers. Now the opApply isn't needed (this increases speed because I now loop on arrays, but loses some encapsulation).

The fillTable() method becomes the createTable() because it allocates an array of the nodes.

Timings:

...$ time ./em3d04 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.13 0.94 0.48 0.40
Propagating field values for 50 iteration(s)...
EM3D build time 1.95
EM3D compute time 1.18
EM3D total time 3.13
Done!
real 0m3.326s
user 0m2.696s
sys 0m0.540s

...$ time ./em3d04 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.13 0.92 0.46 0.40
Propagating field values for 200 iteration(s)...
EM3D build time 1.91
EM3D compute time 4.35
EM3D total time 6.26
Done!
real 0m6.443s
user 0m5.908s
sys 0m0.432s

The build time is not changed, I don't know why, probably disabling the GC allows for an efficient enough memory allocation. But now the compute time is decreased, because the program can scan the arrays more efficiently.

If a program like this gets used in practice, the build time is not important, because the number of iterations is probably large, so the compute time has to be as small as possible.

Now both timings are better than the Java2 timings.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 12(%eax), %esi
movl (%esi,%edx,4), %esi
movl 20(%eax), %edi
movsd (%edi,%edx,8), %xmm1
mulsd (%esi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %edx
cmpl %ecx, %edx
jb .LBB9_2

------------------------------------

D05

The CPU is able to use the pointers among Nodes in a quick way, but the number of nodes is probably limited. The program can increase its speed if there is less data traffic through the CPU cache. So I have replaced the pointers by 16 bit (ushort) indexes. This allows up to 2^16 nodes. If you need more, you can replace:
alias ushort TyIndex;
With:
alias uint TyIndex;
But this may produce code slower than D04.

Timings:

...$ time ./em3d05 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.06 0.94 0.38 0.42
Propagating field values for 50 iteration(s)...
EM3D build time 1.80
EM3D compute time 1.00
EM3D total time 2.80
Done!
real 0m2.937s
user 0m2.540s
sys 0m0.324s

...$ time ./em3d05 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.07 0.95 0.37 0.39
Propagating field values for 200 iteration(s)...
EM3D build time 1.78
EM3D compute time 4.02
EM3D total time 5.80
Done!
real 0m5.955s
user 0m5.508s
sys 0m0.360s

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 20(%eax), %edi
movsd (%edi,%esi,8), %xmm1
movl 12(%eax), %edi
movzwl (%edi,%esi,2), %edi
imull $40, %edi, %edi
mulsd (%edx,%edi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %esi
cmpl %ecx, %esi
jb .LBB9_2

------------------------------------

D06

Now I look for possible algorithmic improvements.
makeUniqueNeighbors() contains stupid quadratic code to avoid duplicating Neighbors. I have tried to use a D associative array, and then a set implemented as a bit vector (using bt/bts intrinsics), but the most efficient seems a set of booleans implemented with just an array of ubytes.

I have renamed:
coeffs => inCoeffs

Timings:

...$ time ./em3d06 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
0.07 0.11 0.40 0.39
Propagating field values for 50 iteration(s)...
EM3D build time 0.97
EM3D compute time 1.03
EM3D total time 2.00
Done!
real 0m2.145s
user 0m1.748s
sys 0m0.324s

...$ time ./em3d06 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
0.07 0.11 0.37 0.40
Propagating field values for 200 iteration(s)...
EM3D build time 0.95
EM3D compute time 4.11
EM3D total time 5.06
Done!
real 0m5.231s
user 0m4.804s
sys 0m0.352s

The build time for 5000 nodes with 300 arcs is half as before.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 20(%eax), %edi
movsd (%edi,%esi,8), %xmm1
movl 12(%eax), %edi
movzwl (%edi,%esi,2), %edi
imull $40, %edi, %edi
mulsd (%edx,%edi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %esi
cmpl %ecx, %esi
jb .LBB9_2

------------------------------------

D07

Now the most common optimizations are done, and the code looks good enough. To improve the code some more we have to think more about the CPU cache.

The computations (of "compute time") are done on an array of large structs Node. But only few fields are actually necessary for such computation: value, inbound and inCoeffs (total 6 words), so if I remove the useless data, the iterations on the array will be (hopefully) faster.

I don't know how to remove items from the struct. The simpler way to solve this problem is to define a second Node struct, a Node2, with just the essential fields. Creating the two arrays of Node2 is fast, because all arrays of inbound and inCoeffs are just copied by reference.

So I add the eNodes2 and hNodes2 dynamic arrays of Node2 to BiGraph (now encapsulation looks gone, but in practice I don't think the original design was good. The only data structure that has to know about a collection of Nodes2 has to be BiGraph, and not Node2 itself).

There's some redundancy too here in a Node2: the inbound and inCoeffs arrays have the same length, so the Node2 can use just 5 words. There are few ways to remove this field, but the first field (the value) is a double, and for max performance it must be aligned to 8 bytes. So I have done few experiments, but they are failed, the program was slower, with a small memory usage reduction. So I have kept all the 6 words.

Timings:

...$ time ./em3d07 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.08 0.10 0.37 0.39 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.94
EM3D compute time 0.92
EM3D total time 1.86
Done!
real 0m1.993s
user 0m1.600s
sys 0m0.316s

...$ time ./em3d07 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.10 0.39 0.40 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.96
EM3D compute time 3.62
EM3D total time 4.58
Done!
real 0m4.718s
user 0m4.292s
sys 0m0.360s

The build time of the graph is the same, but the computing time is decreased.
The time needed to build the arrays of Node2 is very small.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB9_2:
movl 20(%eax), %edi
movsd (%edi,%esi,8), %xmm1
movl 12(%eax), %edi
movzwl (%edi,%esi,2), %edi
imull $24, %edi, %edi
mulsd (%edx,%edi), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%eax)
incl %esi
cmpl %ecx, %esi
jb .LBB9_2

-----------------------------------

D08

We can improve the code more taking a look at how hValues are accessed, a sequential access is faster. But I've seen that the inbound[i] are already sorted.

So I have to take a better look at the very small loop that performs one of the two halves of a computation step:
void computeNewValue(Node2[] otherTable) {
  for (int i; i < inbound.length; i++)
    value -= inCoeffs[i] * otherTable[inbound[i]].value;
}

There's not much that can be improved here. I have drawn on paper the simple data structures involved here, and with that I've seen that inbound[i] performs (forward but) large jumps in an array of largish structs (6 words each) to find the "value" fields. This forces the CPU cache to a lot of traffic. If we reduce this traffic the iterations will get faster.

The simple way to do this is to pull the values out of the nodes2 and put them in a uniform array, that's used in parallel to the Node2 array.
// Node2 of D07:
struct Node2 {
    double value;
    Node.TyIndex[] inbound;
    double[] inCoeffs;
}

// Node2 of D08:
struct Node2 {
    Node.TyIndex[] inbound;
    double[] inCoeffs;
    static double[] eValues, hValues;
}

Now the jumps are performed on arrays (eValues and hValues) with just 2 words/item.

To perform a better computation I have had to split computeNewValue() into eValuesStep() and hValuesStep(). I have pulled the outer loop like foreach(j, ref n; hNodes2) inside them, this may help the DMD compiler too, because it doesn't inline functions/methods that have a loop inside.

I also have had to add a "aux_val" auxiliary variable inside the eValuesStep/hValuesStep functions, because it seems LDC/LLVM was not able to pull the access to Node2.hValues[j] out of the loop. I don't know why [see below, the explanation after D01b].

Timings:

...$ time ./em3d08 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.10 0.37 0.40 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.94
EM3D compute time 0.74
EM3D total time 1.68
Done!
real 0m1.817s
user 0m1.416s
sys 0m0.332s

...$ time ./em3d08 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.10 0.37 0.39 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.93
EM3D compute time 2.96
EM3D total time 3.89
Done!
real 0m4.038s
user 0m3.628s
sys 0m0.336s

The compute time is decreased enough.

This is the asm of inner loop of eValuesStep() compiled with LDC:

.LBB9_4:
movzwl (%esi,%edi,2), %ebp
movsd (%edx,%edi,8), %xmm1
mulsd (%ebx,%ebp,8), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jb .LBB9_4

------------------------------------

D09

There's one last possible optimization of the memory usage that I can see. I can remove again the redundancy from Node2, because now it doesn't contain a double, and now probably I can use 3 32-bit words.

The new Node2 is:
struct Node2 {
  Node.TyIndex* inbound;
  double* inCoeffs;
  int inbound_len;
  static double[] eValues, hValues;
}

Timings:

...$ time ./em3d09 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.08 0.10 0.37 0.39 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.94
EM3D compute time 0.74
EM3D total time 1.68
Done!
real 0m1.821s
user 0m1.444s
sys 0m0.296s

...$ time ./em3d09 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.07 0.11 0.37 0.39 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.94
EM3D compute time 2.95
EM3D total time 3.89
Done!
real 0m4.030s
user 0m3.632s
sys 0m0.332s

The computing time is about the same. The memory saving compared to D08 is very small, something like 40 KB.

It may be possible to vectorize the computing loop, but I stop here.

Compared to the Java2 version the speed of this last D09 version is not much higher, because the computing loop is very tight and the JavaVM is able to optimize it very well, quite better than LDC, see the compute time of Java2 compared to the compute time of D01: 4.35 compared to 5.97. A shame for D/LDC/LLVM :-)

Other Olden benchmarks show a bigger gap between the performance of the Java and optimized D code.


This is the asm of inner loop of eValuesStep() compiled with LDC:

.align 16
.LBB9_4:
movzwl (%esi,%edi,2), %ebp
movsd (%edx,%edi,8), %xmm1
mulsd (%ebx,%ebp,8), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jl .LBB9_4

------------------------------------

D10

Update 1.2, Oct 27 2006:

There is another simple performance optimization that can be done. The memory for the Nodes and their arcs can be allocated from the C heap (and in some cases such memory can even be allocated with malloc instead of calloc, but in this programs this has no significant performance difference).

The memory coming from the C heap is usually aligned to 4 bytes, so it's not fit for storing values and coefficients that in this program are doubles, that are used much more efficiently when aligned to 8 bytes. But in this program the actual computations performed on double FP numbers are done on arrays allocated in Node2, that are allocated from the D GC heap that returns memory aligned to 16 bytes. The only problem may come from Node.inCoeffs that's copied as is to Node2 inCoeffs. So this allocation needs a little of extra care (see makeFromNodes()).

I can even allocate all "outbound" arrays at once, because they are all of the same length, but this change will not change performance significantly, because that time is part of the first timing of the "Detailed creation timings:", that's about 0.03-0.04 seconds for 5000 nodes.

I have moved the disable() of the GC below, when the computing loops are done, because there's no point in keeping the GC active during those loops, that produce no garbage.

Timings:

...$ time ./em3d10 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.10 0.14 0.19 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.47
EM3D compute time 0.60
EM3D total time 1.07
Done!
real 0m1.203s
user 0m0.968s
sys 0m0.148s

...$ time ./em3d10 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.03 0.10 0.16 0.18 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.47
EM3D compute time 2.46
EM3D total time 2.93
Done!
real 0m3.061s
user 0m2.844s
sys 0m0.124s

Not just (as expected) the build times are lower, but the compute ones too are lower, I don't know why, maybe the memory from the C heap has more coherence (more contiguous, reducing cache misses. The GC adds some spaces, because it returns memory blocks aligned to 16 bytes).

------------------------------------

D11

Update 1.2, Oct 27 2006:

To reduce the time used by makeFromNodes(), and hopefully to increase the computation loops (with a better cache coherence) we can cut the inbound and inCoeffs arrays of Node (later copied into Node2) from a large memory block allocated all at once (again, doubles are better alighed to 8 bytes) to do this I have created two memory arenas, with the DoubleArena and IndexArena structs.

Timings:

...$ time ./em3d11 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.10 0.00 0.42 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.56
EM3D compute time 0.58
EM3D total time 1.14
Done!
real 0m1.271s
user 0m1.028s
sys 0m0.164s

...$ time ./em3d11 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.09 0.00 0.43 0.01
Propagating field values for 200 iteration(s)...
EM3D build time 0.57
EM3D compute time 2.23
EM3D total time 2.80
Done!
real 0m2.961s
user 0m2.652s
sys 0m0.184s

...$ time ./em3d11 -n 5000 -d 300 -i 2000 -m
Initializing em3d random graph...
Detailed creation timings: 0.03 0.11 0.00 0.43 0.00
Propagating field values for 2000 iteration(s)...
EM3D build time 0.57
EM3D compute time 23.39
EM3D total time 23.96
Done!
real 0m24.621s
user 0m23.869s
sys 0m0.144s


Now the build time is a little higher (I don't know why), while the computing loops are faster (I don't know why), so when the number of iterations is 200 the total time is lower.

In the detailed creation timings you can also see that makeFromNodes() is now very fast, while updateFromNodes() is quite slower (I don't know why. Here DMD is twice faster).

This version also needs a little less RAM compared to D10, about 39 MB with -n 5000 -d 300. Java3 needs about 197 MB for the same graph.

Now the D11 program is fast enough, and uses a low enough memory, that the limit of 65000 nodes given by the ushort indexes can be felt. Using an uint as TyIndex the memory used by D11 with -n 5000 -d 300 is 51 MB.

Timings of D11 with TyIndex=uint:

...$ time ./em3d11 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.09 0.11 0.00 0.48 0.00
Propagating field values for 50 iteration(s)...
EM3D build time 0.68
EM3D compute time 0.68
EM3D total time 1.36
Done!
real 0m1.525s
user 0m1.220s
sys 0m0.200s

...$ time ./em3d11 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.06 0.12 0.00 0.50 0.00
Propagating field values for 200 iteration(s)...
EM3D build time 0.68
EM3D compute time 2.71
EM3D total time 3.39
Done!
real 0m3.597s
user 0m3.252s
sys 0m0.184s

Using such larger indexes the running time is not much higher, but in this case using just pointers is probably faster.

This is the asm of inner loop of eValuesStep() of D11 compiled with LDC, when TyIndex=uint:

.LBB14_4:
movl (%esi,%edi,4), %ebp
movsd (%edx,%edi,8), %xmm1
mulsd (%ebx,%ebp,8), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jl .LBB14_4

------------------------------------

D11b
(lateral branch)

Update 1.2, Oct 27 2006:

When the number of nodes is high, instead of storing ushort indexes, I can store ushort deltas. And because the indexes are ordered such deltas are never negative. If links are well spread, then such deltas are usually small, and they can be used to index far more than 2^16 nodes. If such links are randomly distriuited, you can probably manage millions of nodes in a safe enough way, and it's easy to add a runtime test when they are created to be sure they don't overflow.

I have done a test, using the usual -n 5000 -d 300 arguments with the same seed=783, in such case the maximum delta is 240, so just one ubyte suffices! ubyte indexes reduce traffic through the cache, so they may speed up the program a little more. The D11b uses such TyDelta in Node2 but I have kept usign TyIndex in Node to keep the program simpler.

Timings of D11b with TyIndex=ushort, TyDelta=ubyte:

...$ time ./em3d11b -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.10 0.00 0.42 0.02
Propagating field values for 50 iteration(s)...
EM3D build time 0.59
EM3D compute time 0.56
EM3D total time 1.15
Done!
real 0m1.266s
user 0m1.032s
sys 0m0.156s

...$ time ./em3d11b -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.42 0.02
Propagating field values for 200 iteration(s)...
EM3D build time 0.59
EM3D compute time 2.20
EM3D total time 2.79
Done!
real 0m2.968s
user 0m2.676s
sys 0m0.160s

...$ time ./em3d11b -n 5000 -d 300 -i 2000 -m
Initializing em3d random graph...
Detailed creation timings: 0.03 0.12 0.00 0.48 0.02
Propagating field values for 2000 iteration(s)...
EM3D build time 0.65
EM3D compute time 21.98
EM3D total time 22.63
Done!
real 0m23.252s
user 0m22.477s
sys 0m0.204s

The iteration is a bit faster, but the build time of Nodes2 is a little slower (by about 0.02 seconds). So for 200 iterations the running time is about the same, while for 2000 you can see some difference. For TyDelta==ubyte this D11b program is not very useful (unless you have few nodes), but for larger graphs such delta encoding can be useful to keep using ushorts even when there are more than 2^16 nodes.


Timings of D11b with TyIndex=uint, TyDelta=ushort:

...$ time ./em3d11b -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.44 0.02
Propagating field values for 50 iteration(s)...
EM3D build time 0.61
EM3D compute time 0.56
EM3D total time 1.17
Done!
real 0m1.309s
user 0m1.032s
sys 0m0.180s

...$ time ./em3d11b -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.05 0.10 0.00 0.43 0.02
Propagating field values for 200 iteration(s)...
EM3D build time 0.60
EM3D compute time 2.20
EM3D total time 2.80
Done!
real 0m2.997s
user 0m2.684s
sys 0m0.176s

D11b with TyDelta=ushort is just a little slower than D11b with TyDelta=ubyte, so in most situations it can be enough for a number of nodes >> 2^16 if the arcs are spread in an uniform random way.

This em3d program is just a benchmark, but in a real program you want to add class invariants, unittests, several safeties for the memory, and ways to deallocate memory when not used any more, etc.

------------------------------------

D12
(this ignores the D11b branch)

Update 1.3, Oct 27 2006:

The items of inbound and inCoeffs are accessed in parallel, so it can be useful for them to be close to each other. Ideally the items of inbound and inCoeffs are better kept in the same cache line, that's a block of 64 bytes (and I think it has to be aligned too). But I can't put them into a struct because it wastes a lot of space (such struct has to be 16 bytes long to keep the double it contains aligned to 8 bytes), and wasting space increases cache misses.

If I want to pack pairs of double + ushort in a struct that inside an array is aligned to 8 bytes, I need to pack 4 doubles and 4 ushorts, this needs 8*4 + 2*4 = 32 + 8 = 40 bytes. But this straddles a cache line, and I think this is negative. So to pack as much as possible in a cache line I can use 6 pairs, that need 8*6 + 2*6 = 48 + 12 = 60 bytes, that needs 4 bytes of padding, this wastes 6.25% of space, I think it's acceptable. I can define it like this:
struct PairsPack {
  static if (TyIndex.sizeof == 2)
    const int NPAIRS = 6; // 8*6 + 2*6 + 4 = 48 + 12 + 4 = 64
  else static if (TyIndex.sizeof == 4)
    const int NPAIRS = 5; // 8*5 + 4*5 + 4 = 40 + 20 + 4 = 64
  else
      static assert(0);

  double[NPAIRS] weights;
  TyIndex[NPAIRS] inbounds;
  uint _padding;
}

Just to be sure I'll allocate the array of PairsPack aligned to 64 bytes.

But generally the number of inbound arcs in a node isn't a multiple of 6, so I need to waste some of those pairs, or use part of an already used PairsPack. Both solutions have advantages and disadvantages. I choose the option to not waste memory, this requires a little more complex indexing of the sub-part of a PairsPack, but allows me to allocate the array of PairsPack in a simpler way.

This D12 was the hardest version to create, the code is quite hairy, and hard to maintain and understand. And it's slow (the Range!() that I have used is not necessary. LLVM was able to unroll that loop inside eValuesStep/hValuesStep by itself).

Timings:

...$ time ./em3d12o -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.43 0.19
Propagating field values for 50 iteration(s)...
EM3D build time 0.78
EM3D compute time 0.68
EM3D total time 1.46
Done!
real 0m1.614s
user 0m1.240s
sys 0m0.284s

...$ time ./em3d12o -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Detailed creation timings: 0.04 0.11 0.00 0.41 0.20
Propagating field values for 200 iteration(s)...
EM3D build time 0.76
EM3D compute time 2.65
EM3D total time 3.41
Done!
real 0m3.623s
user 0m3.176s
sys 0m0.308s


The asm of inner loop of eValuesStep() of D11 compiled with LDC:

.LBB14_4:
movzwl 50(%eax), %ebp
movsd 8(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
movzwl 48(%eax), %ebp
movsd (%eax), %xmm2
mulsd (%esi,%ebp,8), %xmm2
subsd %xmm2, %xmm0
subsd %xmm1, %xmm0
movzwl 52(%eax), %ebp
movsd 16(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
movzwl 54(%eax), %ebp
movsd 24(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
movzwl 56(%eax), %ebp
movsd 32(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
movzwl 58(%eax), %ebp
movsd 40(%eax), %xmm1
mulsd (%esi,%ebp,8), %xmm1
subsd %xmm1, %xmm0
addl $64, %eax
decl %ebx
jne .LBB14_4

------------------------------------

Java3

Now I can think porting back some of the optimizations I have implemented in D to Java. Java don't uses structs, so several things are not possible.

In this Java3 I have used the uniqueNeighborsSet set, implemented as the D code.

Timings:

...$ time java -server Em3d3 -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 1.681
EM3D compute time 1.14
EM3D total time 2.821
Done!
real 0m3.138s
user 0m2.600s
sys 0m0.416s

...$ time java -server Em3d3 -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 1.649
EM3D compute time 4.29
EM3D total time 5.943
Done!
real 0m6.239s
user 0m5.656s
sys 0m0.452s

More optimizations may be ported from D to Java.

------------------------------------

D01b
(lateral branch)

Looking at the asm of the inner loop of computeNewValue for D01 and D09, I think the D01 can be improved with an auxiliary variable:
void computeNewValue() {
  auto aux = this.value;
  for (int i = 0; i < fromCount; i++)
    aux -= coeffs[i] * fromNodes[i].value;
  this.value = aux;
}

Timings:

...$ time ./em3d01b -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Propagating field values for 50 iteration(s)...
EM3D build time 4.89
EM3D compute time 1.28
EM3D total time 6.17
Done!
real 0m6.335s
user 0m5.804s
sys 0m0.464s

...$ time ./em3d01b -n 5000 -d 300 -i 200 -m
Initializing em3d random graph...
Propagating field values for 200 iteration(s)...
EM3D build time 4.92
EM3D compute time 5.23
EM3D total time 10.15
Done!
real 0m10.601s
user 0m10.105s
sys 0m0.440s

LDC/LLVM must be able to perform that optimization (pulling this.value out of the loop) by itself.

This is the asm of inner loop of computeNewValue() compiled with LDC:

.LBB11_2:
movl (%edx,%edi,4), %ebx
movsd (%esi,%edi,8), %xmm1
mulsd 8(%ebx), %xmm1
subsd %xmm1, %xmm0
incl %edi
cmpl %ecx, %edi
jl .LBB11_2

This asm is very short, but it's quite slower than D09 anyway. I don't know what kind of optimizations are done by the JavaVM on the Java1 code (there is a way to look at the asm produced by the JVM, but it's not handy).


Update 1.1, Oct 26 2006: now I think I know why LDC isn't able to perform that optimization in the computeNewValue() loop: the compiler doesn't know that "value" on the right is always distinct from "value" on the left, because this is a bipartite graph, where no self-loops are allowed. So it can't pull the "value" of the left out of the loop.

To perform the optimization the compiled need more semantics. Future programming languages may allow the programmer to give such semantics to the compiler.

I have done a similar optimization (pulling "value" out of the loop) in the Java3 version, with no change in performance, so probably the JavaVM is somehow able to perform that optimization.

------------------------------------

Python1

I have created a Python version too, that uses Psyco. It's similar to the Java3 version.
I have seen that the faster unique_neighbors_set is using just a Python set.

I have added __slots__ to both Random and Node, this doubles the performance.

Timings:

...$ time python em3d.py -n 5000 -d 300 -i 50 -m
Initializing em3d random graph...
Detailed creation timings: 0.14 3.86 0.16 4.32
Propagating field values for 50 iteration(s)...
EM3D build time 8.48
EM3D compute time 54.70
EM3D total time 63.19
Done!
real 1m4.320s
user 0m59.628s
sys 0m3.876s

I have used:
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
Psyco 1.6.0 final

------------------------------------

Python2

While improving the D code, and especially when I have created the Python1 version, I have felt the code as too much complex for the simple operations it performs. So I have reduced the Python code, removing useless parts. The Python2 version is almost the shortest Python code that produces the same output (there are ways to shorten it a little more, but I am not doing code golf here, my purposes are different).

The Python2 code is 67 lines long, while the Java3 is 481 lines including the comments.

------------------------------------

Python3

Then I have replaced the random() and sample() with ones from the Python standard library, I have removed the printing code, and I have reorganized the code a little more.

The result is short enough (21 nonempty lines) to be shown here too:
from random import random, sample

class Node:
    def __init__(self, value):
        self.value = value
        self.in_arcs = []

n_nodes, in_degree, n_steps = 15, 5, 5

h_nodes = [Node(random()) for i in xrange(n_nodes)]
e_nodes = [Node(random()) for i in xrange(n_nodes)]

def make_in_arcs(nodes, other_nodes, in_degree):
    for n1 in nodes:
        for n2 in sample(other_nodes, in_degree):
            n2.in_arcs.append((n1, random()))

make_in_arcs(h_nodes, e_nodes, in_degree)
make_in_arcs(e_nodes, h_nodes, in_degree)

def half_step(nodes):
    for n1 in nodes:
        for n2, weight in n1.in_arcs:
            n1.value -= weight * n2.value

for i in xrange(n_steps):
    half_step(e_nodes)
    half_step(h_nodes)

This code shows the essential, and for me it's much simpler to follow and understand. This code may be conveted to C/D again, if necessary. The original C code was harder to understand, and much more complex. Removing the noise and useless parts allows me to understand the algorithm better, this is a simple algorithm. Even if you don't understand this code immediately, the complexity left is necessary, it's not spurious. Generally good programs aren't the clever and complex ones, but ones that look "obvious". This Python code shows that lot of the complexity of the original Java code was spurious.

------------------------------------

Python4

This is similar to the Python3 version, I have just encoded a node in a simpler way, using a single list. The first item is the "value" and all the following ones are pairs of node-weight. This code is just 17 nonempty lines long. For me Python3 version is more readable.

------------------------------------
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , ,
Subject:Code in Computer Science papers
Time:11:42 am
A problem with Computer Science papers is that they usually don't give enough importance to the programs that expresses the ideas shown in the paper. Sometimes such code can be found on University servers.

For very old CS papers it's often hard or impossible to find such code, this is bad. A possible way to solve this problem is to keep the code listing close to the paper itself, at the end of the paper.

But using tens of pages for code listings (that later may need to be scanned, given to an OCR software, and then OCR mistakes have to be fixed) may be suboptimal both in shelf space and time wasted for reading the code.

So a better encoding can be used. The code can be compressed by a standard version of the ZPAQ compressor (http://cs.fit.edu/~mmahoney/compression/ )  (compiling before it some hundred KB of code written in few programming languages and commented in few natural languages, later this header part has to be removed from the printed data. This is useful to form an entropy dictionary to compress the following code better), and then error-protected with a good Turbo Code (http://en.wikipedia.org/wiki/Turbo_code ), and finally encoded on paper in a standard way with something like Xerox DataGlyphs (http://www.xerox.com/Static_HTML/xsis/dataglph.htm ) not using error protection again.

The idea is that the glyphs on paper use a small space and are reliable, while the encoding speed is not important.

In this way code may be read on non-acid paper even a century after the printing or more, and the space used on paper is probably limited enough.
comments: Leave a comment Add to Memories Tell a Friend

Tags:,
Subject:Updates and links
Time:11:18 am
I've updated many benchmarks on the "Junk software" section of the site, and I've added the "js_ray" benchmark too (in JavaScript, translated to D).

Links:

GCC hacks, nice to remember if you use GCC only:
http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

Now the "Bug of the month" page allows to run the debugger online on those tiny demo programs:
http://www.gimpel-online.com/bugsLinkPage.html
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , ,
Subject:Slow allocation of D objects
Time:12:10 pm
Allocating objects in D language, usign the good and efficient LDC compiler, may seem slower than doing the same thing in C++, but the situation is a little more complex, so few examples can show what's going on.

This is a syntetic C++ benchmark, it allocates an array of n pointers to Foo object, and then allocates the instances, that contain ten 32 bit integers:
// C++ version 1, 0.46 s
#include "stdio.h"
#include "stdlib.h"

using namespace std;

class Foo {
    public:
        int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
        Foo(int, int, int, int, int, int, int, int, int, int);
};

Foo::Foo(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10):
  y1(x1), y2(x2), y3(x3), y4(x4), y5(x5), y6(x6), y7(x7), y8(x8), y9(x9), y10(x10) {}

int main(int argc, char *argv[]) {
    int n = (argc == 2) ? atoi(argv[1]) : 5;
    Foo* foos[n];

    for (int i = 0; i < n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf("%d %d %d\n", foos[n-1]->y1, foos[n-1]->y2, foos[n-1]->y10);
    return 0;
}

This is an almost equivalent D code. I use printf and those imports are so hairy because this code is designed to work with both Phobos or Tango standard libraries, and with D1 and D2 languages:
// D version 1, 0.89 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
} else {
    import std.c.stdio: printf;
    version (D_Version2) {
        import std.conv: to;
        alias to!(int, char[]) toInt;
    } else
        import std.conv: toInt;
}

class Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;

    this(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
        y1 = x1;
        y2 = x2;
        y3 = x3;
        y4 = x4;
        y5 = x5;
        y6 = x6;
        y7 = x7;
        y8 = x8;
        y9 = x9;
        y10 = x10;
    }
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo[] foos = new Foo[n];

    for (int i = 0; i < n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

Replacing in the C++ the class like this doesn't change the running time of the program:
// piece of the C++ version 2, 0.47 s
class Foo {
    public:
        int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
        Foo(int, int, int, int, int, int, int, int, int, int);
};

Foo::Foo(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
    y1 = x1;
    y2 = x2;
    y3 = x3;
    y4 = x4;
    y5 = x5;
    y6 = x6;
    y7 = x7;
    y8 = x8;
    y9 = x9;
    y10 = x10;
}

The running times on a Ubuntu running on VirtualBox, running on Vista 32 bit, running on a Celeron CPU, using LDC compiler based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009):
Timings, n = 1_000_000, best of 3, seconds:
  C++ 1: 0.46  |#########                |
  C++ 2: 0.47  |#########                |
  D 1:   0.89  |##################       |
  D 2:   0.82  |################         |
  D 3:   0.87  |#################        |
  D 4:   0.46  |#########                |
  D 5:   0.46  |#########                |
  D 6:   1.22  |#########################|
  D 6b:  0.86  |#################        |
  D 7:   0.63  |############             |
  D 8:   0.71  |##############           |

I have compiled the C++ code with gcc V.4.3.3 with:
g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native objbench1_cpp.cpp -o objbench1_cpp
And the D code with:
ldc -O5 -release -inline objbench1_d.d

As you can see the D version is about twice slower than the C++ code, despite usually the LDC compiler is almost as efficient as GCC (LLVM is not able to perform some optimizations yet, like auto-vectorization, but things are already good and they will improve).

D classes are different from C++ ones, D objects always contain a pointer to the virtual table (even if no virtual methods are present, it's used by the reflection and GC too) and a monitor, so on 32 bit systems they need 8 extra bytes. So we may try to remove such memory (and bookeeping) overhead using a struct (this version can be changed a little, so it works with Phobos of D2 too):
// D version 2, 0.82 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i < n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }
    
    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

But you can see the performance improves only a little, so the problem is elsewhere. I have had to use that two-line initialization because only in D2 language structs are allowed to have an explicit constructor.

We can try allocating the pointer array on the stack but the performance gets a little worse:
// D version 3, 0.87 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: alloca;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: alloca;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo** ptr = cast(Foo**)alloca((Foo*).sizeof * n);
    Foo*[] foos = ptr[0 .. n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

A next step is to allocate the structs from the C heap, this time the performance is about the same as the C++ code, showing that the cause of the slowdown is the (not efficient yet) D Garbage Collector:
// D version 4, 0.46 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: alloca, malloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: alloca, malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo** ptr = cast(Foo**)alloca((Foo*).sizeof * n);
    Foo*[] foos = ptr[0 .. n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)malloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

So we can restore the original allocation of the dynamic array of pointers, the performance improves a tiny bit (less than 0.01 s):
// D version 5, 0.46 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: malloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)malloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

We may think that the problem is in the double initalization of the structs, it seems that's not the case, as the following code is the slowest, here I use uninitialized memory from the GC heap:
// D version 6, 1.22 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.malloc gcmalloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: gcmalloc = malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)gcmalloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

I don't fully know why the performance of the version 6 is so low (if you have ideas please tell me), part of that lower performance comes from the GC that scans the memory of those structs. You can see it in the version 6b, where I have disabled the scanning of those memory blocks, there's indeed a significant performance improvement:
// D version 6b, 0.86 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.malloc gcmalloc;
    void hasNoPointers(void* p) {
        GC.setAttr(p, GC.BlkAttr.NO_SCAN);
    }
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: gcmalloc = malloc, hasNoPointers;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)gcmalloc(Foo.sizeof);
        hasNoPointers(foos[i]);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

But even if the GC doesn't scan those memory blocks, it performs several operations anyway, so when you allocate many pieces of memory and you need performance, you may want to disable the GC (and enable it just after the allocation, here I have not added the enable(), but in your code you are supposed to put it), the performance is intermediate:
// D version 7, 0.63 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.disable disable;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: disable;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    disable();
    
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i < n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }
    
    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

We can also go back ot the object-based version, with a further little reduction of performance:
// D version 8, 0.71 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.disable disable;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: disable;
}

class Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;

    this(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
        y1 = x1;
        y2 = x2;
        y3 = x3;
        y4 = x4;
        y5 = x5;
        y6 = x6;
        y7 = x7;
        y8 = x8;
        y9 = x9;
        y10 = x10;
    }
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    disable();
    Foo[] foos = new Foo[n];

    for (int i = 0; i < n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

Having a garbage collector is handy, and sometimes safer too, but it has a price too (in the size of the binary too). D allows you to not use the GC when you need C++-like performance, but then you have to remember to manage and deallocate memory manually as in C, or you need to implement/use other forms or memory management as in C++ (and D allows the scope(exit) idiom too to deallocate memory when a scope ends). If the D language will get widespread, its GC will improve, restoring part of the lost performance.
comments: Leave a comment Add to Memories Tell a Friend

Tags:,
Subject:Tree visits in D
Time:09:28 pm
On the RosettaCode site there's this Programming Task:
http://rosettacode.org/wiki/Tree_traversal

Problem: Implement a binary tree where each node carries an integer, and implement preoder, inorder, postorder and level-order traversal. Use those traversals to output the following tree:
         1
        / \
       /   \
      /     \
     2       3
    / \     /
   4   5   6
  /       / \
 7       8   9

The correct output should look like this:

preorder:    1 2 4 7 5 3 6 8 9
inorder:     7 4 2 5 1 8 6 9 3
postorder:   7 4 5 2 8 9 6 3 1
level-order: 1 2 3 4 5 6 7 8 9
The following is my D implementation. I have used the D Version 2 language (to compile the code I am using DMD v2.032, with the Phobos std lib, but it uses only the printing of Phobos, so adapting the code to Tango is easy), but a very similar version can be adapted for D V1.

Disclaimer: The following isn't an example of common (or good) D code. It's very generic and it's a bit tricky. In practice you usually write simpler code, because you don't need such genericity, and to keep the code simpler to write and understand. Simpler code is also simpler to debug. You may need to write code so much generic only in library-like routines, that are usually limited in number and size. So there are surely ways to write simpler/shorter D code, but here I show this version because it can be useful to explain some features of the D language, and because it's more fun.

import std.stdio: write, writeln;

class Node(T) {
    T data;
    Node left, right;
    this(T data, Node left=null, Node right=null) {
        this.data = data;
        this.left = left;
        this.right = right;
    }
}

// static templated opCall can't be used in Node
auto node(T)(T data, Node!T left=null, Node!T right=null) {
    return new Node!T(data, left, right);
}

void show(T)(T x) {
    write(x, " ");
}

enum Visit { pre, inv, post }

// visitor can be any kind of callable or it uses a default visitor.
// TNode can be any kind of Node, with data, left and right fields,
// so this is more generic than a member function of Node.
void backtrackingOrder(Visit v, TNode, TyF=void*)(TNode node, TyF visitor=null) {
    static if (is(TyF == void*)) auto truevisitor = &show!(typeof(node.data));
    else                         auto truevisitor = visitor;
    if (node !is null) {
        static if (v == Visit.pre)   truevisitor(node.data);
        backtrackingOrder!v(node.left, visitor);
        static if (v == Visit.inv)  truevisitor(node.data);
        backtrackingOrder!v(node.right, visitor);
        static if (v == Visit.post) truevisitor(node.data);
    }
}

void levelOrder(TNode, TyF=void*)(TNode node, TyF visitor=null, TNode[] more=[]) {
    static if (is(TyF == void*)) auto truevisitor = &show!(typeof(node.data));
    else                         auto truevisitor = visitor;
    if (node !is null) {
        more ~= [node.left, node.right];
        truevisitor(node.data);
    }
    if (more.length)
        levelOrder(more[0], truevisitor, more[1..$]);
}

void main() {
    auto tree = node(1,
                     node(2,
                          node(4,
                               node(7)),
                          node(5)),
                     node(3,
                          node(6,
                               node(8),
                               node(9))));
    write("  preOrder: ");
    backtrackingOrder!(Visit.pre)(tree);
    write("\n   inOrder: ");
    backtrackingOrder!(Visit.inv)(tree);
    write("\n postOrder: ");
    backtrackingOrder!(Visit.post)(tree);
    write("\nlevelorder: ");
    levelOrder(tree);
    writeln();
}
Output:
  preOrder: 1 2 4 7 5 3 6 8 9
   inOrder: 7 4 2 5 1 8 6 9 3
 postOrder: 7 4 5 2 8 9 6 3 1
levelorder: 1 2 3 4 5 6 7 8 9
A possible design for Java is to create a Node class (using generics to allow for different types of data) that has preOrder, InOrder, etc, methods. This simple design can be used in D too. But it conflates data structure and algorithms, so you can't use the same algorithms (the tree visits) for other kinds of nodes.

So if your language is flexible enough (like C++, D, Haskell, or Python) it's better to follow the strategy used in the C++ STL, and implement generic algorithms, that work with more kinds of binary tree nodes and allow for a more flexible management of the node being visited.

Another design that can be natural in modern versions of Python is to create iterators, that yield the current node. This is very handy, and such strategy can be used in D too (with opApply or with the second iteration protocol of D2) but I have not used it in my D code. In Python (and in some other languages, in D too if you use opApply) this strategy leads to O(n^2) tree visits. (There is a Python PEP that if well implemented may solve this problem).

Th flexibility of D language has allowed me to merge the three backtracking visits (pre, in and post - order) in a single templated function with zero overehead.

In the code I have used a class to represent the node to keep code a little simpler, but a struct too can be used with very small changes. Such structs are smaller 8 bytes less on a 32 bit system) and it's simpler to allocate them with a memory pool that can halve the time needed to create and allocate the tree.

I have added a helper node() templated function, it just allocates a new node and returns it. In D something similar can also be done with a static opCall inside Node, but here it's also templated, and templatd methods can't replace opCall of object (I may be wrong on this, but I don't think so). If you have a different Node class/struct you need a different helper function, or you have to use opCall (if the Node isn't generic) or you have to build the tree in a simpler way.

Node!T is the new compact syntax introduced in D2 to instantiate a template when you have a single argument. The standard older syntax is Node!(T), that equals to C++ Node.

show() is a little function that's used as default visitor of the node. It's templated, so you have to instantiate it before taking its address to use it as function pointer.

backtrackingOrder() merges the three tree visits, its code is a little tricky.

The enum Visit is used to represent one of the three possible visits. To specify it you can first partial specify backtrackingOrder() according to visit type known a compile-time, and then you can fully specify it automatically with the given root node and an optional callable.

The node can be any type that has a left, right and node attributes.

visitor can be any callable, a delegate, function pointer, or callable object. If visitor is unspecified you can omit it, and it uses show() instantiated on the type of the data.

The static ifs are done at compile time, and they don't create a scope, so truevisitor (its type is found automatically) is then visible under the static if. The is() syntax is used to test if types are the same (a better simpler design for the D language is just to allow == between them).

In library code you may add few more static asserts inside backtrackingOrder() to test that visitor is a callable, with a line of code like:
static assert(IsCallable!(typeof(truevisitor)), "...");
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , ,
Subject:Code performance in D/Java
Time:05:05 pm
The code discussed in this post:
http://www.fantascienza.net/leonardo/js/life_bench.zip


To try to find "performance bugs" in both the LDC D compiler and the LLVM back-end I am exploring the performance signature of many small programs, often translating them to D. To perform such tests I compare the timings of the D code with the original C or Java code.

Java here is useful because its performance profiles are a little different from the usual C ones. Java HotSpot is able to inline many virtual calls, and its GarbageCollector is quite efficient (both things aren't good in all current D implementations).

I've found a small Life (Horton Conway's game) implementation in Java that shows a higher performance compared to D code (the original Java code isn't mine), so I cleaned up the Java code, I have removed the useless stuff from it (see the zip for the Java code), and I have translated it to as much close as possible D code (able to run both on Tango and Phobos). The result is the first D program (life1_d.d).

I have taken care of setting as final the main class in the D code, to allow inlining.

The Java GC (of Sun) is more efficient than the nonmoving D GC, so first of all I have taken a look at the number of memory allocations, bue they weren't the cause of the performance difference. I've profiled both the D and Java code, and I've seen that the calc_new() method was the one taking most time. I've also seen that the amount of inlining with LDC (using -O5 -release -inline) was not enough, so I've compiled the D code with the following, that forces a more aggressive inlining, that improves performance:

ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d

I have also seen that for this program Link-Time Optimization plus Interning improve the performance of the code:

ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d

opt -std-compile-opts life1_d.bc > life1_do.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc

I have done many more experiments, here you can find the ones that have given good results. I have split the Life class in its methods plus global values, I don't know why this increases performance with the LDC compiler (see the timings), see life2_d.d.

Later I have uses simple pointers as function arguments instead of arrays (see see life3_d.d), again I don't know why this increases the performance with the LDC compiler, also because most of sych functions get inlined anyway.

Now the performance of life3_d.d was bad only for large values (the last values of the use_Sizes array). After several more experiments I have by chance found that the cause was in the inner loop of the calc_new() function, where the whole botRow array is reset to zero. I think here the Java compiler recognizes such loop as a clearing, and replaces it with a call to a function like memset(). This makes the Java code slower than the D one for small botRow arrays, but faster for the longer ones (because it seems an inlined loop is faster than a memset when the length is small, about 50-100 or so if array items are 4 byte long).

So in life4_d.d I have replaced the loop with something a little more complex that takes a look at botRow length:
if (botRow.length > 100)
    botRow[] = 0;
else
    for (int c = 0; c < botRow.length; c++)
        botRow[c] = 0;
The timings of the 4th D version are now good enough, but not the best still, I don't know why. (I am now trying to find a faster array reset that uses an asm routine that contains the movntps SSE instuction).

------------------------

Scores on Windows Vista, using DMD compiler for the D code (bigger is better):
java -server Life
Size    average
Adjusting 6744 to 2811246
5       14288
6       14974
8       14125
10      15697
15      14175
25      15369
50      14595
250     6331
1000    2218
2500    880


life1_d.exe
Size    average
5       9248
6       8938
8       10213
10      10469
15      10974
25      9898
50      7189
250     1694
1000    421
2500    170


life2_d.exe
Size    average
5       9424
6       8270
8       8083
10      7876
15      8389
25      7191
50      5288
250     1331
1000    331
2500    147


life3_d.exe
Size    average
5       10156
6       9172
8       9353
10      9043
15      8802
25      8452
50      5865
250     1357
1000    356
2500    151


life4_d.exe
Size    average
Adjusting 917315 to 0
5       10145
6       9275
8       9631
10      10282
15      10061
25      9083
50      6286
250     4747
1000    1596
2500    743

Home Vista Basic with 2 GB RAM, Celeron 2.13 GHz

Compilers used:

Java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b66)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)

DMD Digital Mars D Compiler v1.042

------------------------

Scores on Ubuntu running on VirtualBox running on Vista, using LDC compiler for the D code (bigger is better):
java -server Life
Size	average
Adjusting 28766 to 951466
Adjusting 951466 to 2616313
5	13740
6	16494
8	14922
10	19481
15	20602
25	21249
50	19979
250	7493
1000	2609
2500	1004


ldc -O5 -release -inline life1_d.d
Size	average
5	17810
6	16661
8	15147
10	15201
15	14818
25	12901
50	4384
250	1893
1000	485
2500	198


ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
Size	average
5	16016
6	14690
8	13183
10	11823
15	11100
25	10144
50	5203
250	2305
1000	571
2500	256


ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d
opt -std-compile-opts life1_d.bc > life1_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
Size	average
5	22628
6	18773
8	14326
10	18820
15	17558
25	15264
50	10155
250	2329
1000	578
2500	251


ldc -O5 -release -inline -inline-threshold=2000000001 life2_d.d
Size	average
5	16557
6	15205
8	13149
10	12037
15	10870
25	10051
50	6470
250	2327
1000	571
2500	247


ldc -O5 -release -inline -inline-threshold=2000000001 life3_d.d
Size	average
5	18205
6	17032
8	15948
10	17798
15	17907
25	15656
50	10191
250	2310
1000	574
2500	248


ldc -O5 -release -inline -inline-threshold=2000000001 life4_d.d
Size	average
5	18169
6	16576
8	16129
10	17616
15	17937
25	15486
50	9667
250	6714
1000	2006
2500	886


ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life4_d.d
opt -std-compile-opts life4_d.bc > life4_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life4_do life4_do.bc
Size	average
5	25522
6	22085
8	21079
10	20964
15	21987
25	19300
50	13160
250	6774
1000	1986
2500	900


This version (not included in the zip) is the first D one, but the cleaning of the botRow array in calc_new() is done as in life4_d.d:
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1b_d.d
opt -std-compile-opts life1b_d.bc > life1b_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1b_do life1b_do.bc
Size	average
5	22957
6	19394
8	16457
10	17208
15	18177
25	15522
50	5089
250	6638
1000	1971
2500	889


Code running on Ubuntu 9.4 running on VirtualBox 3.0.6 r52128.

Compilers used:

LDC based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)

Java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)

------------------------
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , , ,
Subject:"Silly benchmark" in D/C++
Time:09:45 pm
All the code discussed here:
http://www.fantascienza.net/leonardo/js/silly_bench.zip

Slava Pestov, author of the programming language Factor, has created a very simple benchmark. Despite being defined "silly" such benchmark has shown me some interesting things.

This the original posts:
"Performance comparison between Factor and Java on a contrived benchmark"
http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html

"Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster":
http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html

What the benchmark does:
- It works on points, which are triplets of single-precision floats, (x,y,z)
- First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
-Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
- Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

The purpose of this benchmark is to compare performance of an array of struct in Factor to an array of class references in Java. It also shows how trigonometric functions are managed.

Slava has given Java code, that I have modified a bit, adding more timings inside, pulling out a subclass, making it static. The overall running time of the Java version is improved just a little.

The Java version is slow for the large amount of memory allocations, and also because it doesn't use the CPU native trigonometric instructions (sin, cos or sincos) when the input is arge, because of cross platform issues and because the Intel CPU instructions return quite inaccurate results in some cases, like when the input numers are large.

So I have created a second Java version, that uses an helper class (probably written by Razzi, http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums〈=java&id=4 ) to reduce the input range of trig functions. The result is much faster. (2.4 seconds instead of 11.4).

Then I have created three D versions:
- The first version is designed to be very similar to the Java version, the array is filled with class references. It's slower on performing the sqrt() and other code, but even in non-release mode it's overall quite faster than the first Java version. Bothe the Java versions and this D version use lot of RAM on each iteration of the program, the GC doesn't free it efficiently.
- The second D version is more optimized. The rray is now composed of struct values. I also delete the array at the end of each loop, so the memory used by the program is about constant. I have also used free functions instead of Silly class, but this has not much effect. D1 structs don't allow constructors (D2 has struct constructors), so I have had to change code a bit there too. The result is much faster, almost two times faster than the second Java version.
- In the third D version I've tried to manually optimize what possible. The DMD compiler doesn't inline functions with ref arguments and some other things. I have avoided an useless inizialization of the struct array and I have manually inlined its filling. I have used for loops insead of foreach, and allocated memory from the GC heap in a raw way. I have optimized the code using only one call to sin() because the DMD compiler isn't able to see the two calls to sin() have the same results. I have hand-optimized the computation of the max struct, even if takes only a small percentage of the whole runtime. The result is a nice speeup compared to the second D version. In D there's no simple way to allocate an array of uninitialized structs (I have to test something with LDC: it may not initialize the array if the struct data is float x=void,y=void,z=void; To be tested). LDC compiler can probably improve such timings.

Later, to have a baseline timing, I have translated the code to C++, and I have compiled it with G++ and LLVM-GCC. The second compiler uses calls to tle libc even using the -ffast-math compilation argument, so the running time is quite slower. Compiled with G++ the program is quite faster than the D code compiled with DMD.

This code in Haskell (or Python, or D with my dlibs) can be lazy, and reduce a lot memory used and probably running time too.

When possible I'll add timings with the LDC D compiler.

(I have not put the Python+Psyco timings here because I have not enough RAM to test it with n=5 millions. With n=1 million the best Python+Psyco time is about 3700 milliseconds. With the first Java program the best timing with n=1 million is 577 milliseconds, about seven times faster).

--------------------
Timings on Windows:

...>java -Xms512m -Xmx512m -server Silly1
Run #0
0.8944272, 1.0, 0.4472136
10810 422 234 0  Time: 11466

...>java -Xms512m -Xmx512m -server Silly2
Run #0
0.8944272, 1.0, 0.4472136
1966 406 124 0  Time: 2496

...>silly_bench_cpp (g++)
Run #0
0.894427, 1.000000, 0.447214
334 255 36 5  Total=632

...>silly_bench_cpp (llvm-g++)
Run #2
0.894427, 1.000000, 0.447214
1304 374 21 5  Total=1705

...>silly_bench1_d (No -release mode):
Run #0
0.894427, 1.000000, 0.447214
6941 444 80 0  total = 7466

...>silly_bench2_d
Run #1
0.894427, 1.000000, 0.447214
785 446 74 0  total = 1305

...>silly_bench3_d
Run #2
0.894427, 1.000000, 0.447214
488 443 28 0  Total = 960
------------------

Compilation arguments for g++ and llvm-g++:
g++ -Wall -O3 -s -fomit-frame-pointer -msse -msse2 -msse3 -march=native -ffast-math silly_bench_cpp.cpp -o silly_bench_cpp

Compilation arguments for DMD:
dmd -O -release -inline:

Intel CPU Celeron 560 at 2.13 GHz, 1GB RAM, Vista Home Basic.

Compilers used:
DMD v1.042
gcc v. 4.3.3-dw2-tdm-1 (GCC)
LLVM-G++ gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)
comments: 2 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , ,
Subject:Updates and links
Time:12:36 am
Updated two toy raytracers (Sphereflake and Yopyra) in my two software pages.

I have also added a new benchmark, about the Boggle game:
http://www.fantascienza.net/leonardo/js/boggle.zip

---------------------

A nice and probably useful hierarchical allocator for C programs:
http://swapped.cc/halloc/


Why colors are important in scientific visualizations, and the risks of their bad usage:
http://www.research.ibm.com/people/l/lloydt/color/color.HTM


Theo discusses about what's important for young people to learn, the risks of violent video games, the low quality of educational software:
http://theodoregray.com/BrainRot/index.html
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , ,
Subject:Updates and links
Time:11:25 pm
"Boostbench", a benchmark, in C, Java and D, the zip contains all the code, timings and information:
http://www.fantascienza.net/leonardo/js/boostbench.zip
Original code by Rene Grothmann:
http://mathsrv.ku-eichstaett.de/MGF/homes/grothmann/java/bench/

With the LDC compiler the performance is poor without Link-Time Optimization plus Interning, I don't know why. (If someone is able to tell me I'll be interested to know.)

-------------------

Theo Jansen demonstrates the amazingly lifelike kinetic sculptures he builds from plastic tubes and lemonade bottles, and then lets walk and live on a beach. They have even a brain made of bottles. Windosaurs, etc:
http://www.youtube.com/watch?v=b694exl_oZo
comments: Leave a comment Add to Memories Tell a Friend

Advertisement

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 10 entries.
Missed some entries? Then simply jump back 10 entries