| ||||||||
| All the code shown in this article: http://www.fantascienza.net/leonard This is a partial port of the Olden benchmarks to the D language. I'll add more Olden benchmarks in future as time allows me to. Info on the Olden benchmarks and their Java translation (JOlden): http://www-ali.cs.umass.edu/DaCapo/benc http://www.sable.mcgill.ca/~bdufou1/ash General comments: One of my purposes is to test the efficiency of the D1 code compiled with LDC. LDC is usually able to produce efficient binaries when I translate C code to C-like D code. So now I want to see how D is efficient if I start from higher level code, like Java code. It's usually easy to translate such Java code to D, D looks like a super set of C and Java, with some C++ mixed in. Olden benchmarks are designed to stress first of all the memory allocations. The D Garbage Collector (GC) is quite less efficient than the Java HotSpot GC, so literal translations of the Olden benchmarks from Java to D are usually lower or quite lower performance (up to 4-16 times slower). So I have to work some to regain performance. Usually D allows me to produce final programs that are 2-6 times faster than the original Java versions, but such optimization requires time, experience, and sometimes it's a little bug-prone. Beside having a more efficient GC, Java performs other optimizations not done by LDC, like inlining of virtual methods, and other things. In future the LLVM back-end of LDC will hopefully perform some of such optimizations (obsoleting some of my manual optimizations). In all JOlden benchmarks I have packed the Java code of a benchmark in a single source file, that later I have first translated to D, and then optimized for speed and memory usage. Sometimes I have cleaned up some the D code. Below I explain the various versions I've created. See the Python2 code near the bottom to see the algorithm used by this "em3d" benchmark. This benchmark models the propagation of electromagnetic waves through objects in 3 dimensions. It is a simple computation on an irregular bipartite graph containing nodes representing electric and magnetic field values. Update 1.2, Oct 26 2006: One of the main purposes of this article is to show show and teach how some manual optimizations are done. Timings of the various versions (i is the number of iterations):
i=50 i=200 i=2000
Java1: 4.37 7.13
Java2 3.62 6.87
D01: 6.55 11.40
D02: 4.98 9.36
D03: 3.53 8.01
D04: 3.32 6.44
D05: 2.94 5.95
D06: 2.14 5.23
D07: 1.99 4.72
D08: 1.82 4.04
D09: 1.82 4.03
D10: 1.20 3.06
D11: 1.27 2.96 24.62 (TyIndex=ushort)
D11: 1.52 3.60 (TyIndex=uint)
D11b: 1.27 2.97 23.25 (TyIndex=ushort, TyDelta=ubyte)
D11b: 1.31 3.00 (TyIndex=uint, TyDelta=ushort)
D12: 1.61 3.62
Java3: 3.14 6.24
D01b: 6.33 10.60
Python1: 64.3 ---
------------------------------------ Java1 The original JOlden Java code packed in a single source file. I have fixed a bug present in the Java code of JOlden Em3d but absent in the original C code: At line 131 of the Em3d1.java the line: if (otherNode == toNodes[filled]) break; Has to be fixed as: if (otherNode == toNodes[k]) break; Such bug was not found probably because this benchmark suite has no unit-tests that test well each class and each method of each class. Timings (best of 6, seconds): ...$ time java -server Em3d1 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Propagating field values for 50 iteration(s)... EM3D build time 2.725 EM3D compute time 1.277 EM3D total time 4.01 Done! real 0m4.370s user 0m3.376s sys 0m0.380s ...$ time java -server Em3d1 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Propagating field values for 200 iteration(s)... EM3D build time 2.458 EM3D compute time 4.35 EM3D total time 6.812 Done! real 0m7.128s user 0m6.568s sys 0m0.440s Used: java version "1.6.0_16" Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing) Code running on Ubuntu, running on VirtualBox, running on Windows Vista, on CPU Celeron 2.13 GHz. ------------------------------------ Java2 It's like the Java1 version, but it use a very simple portable pseudo-random generator. So when I translate this program to D I can see if it gives the same results. The Random.nextInt is a little different, it accepts a max value (min is 0). Timings: ...$ time java -server Em3d2 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Propagating field values for 50 iteration(s)... EM3D build time 2.209 EM3D compute time 1.109 EM3D total time 3.318 Done! real 0m3.622s user 0m3.100s sys 0m0.416s ...$ time java -server Em3d2 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Propagating field values for 200 iteration(s)... EM3D build time 2.197 EM3D compute time 4.354 EM3D total time 6.551 Done! real 0m6.875s user 0m6.292s sys 0m0.448s ------------------------------------ D01 This is the most direct translation of the Java2 code to D, with the portable pseudo-random generator. I have used printf for maximum portability across different D standard libraries. I have replaced the enumerated with the D opApply. As you can see D1 allows to program in a style almost equal to Java. Code compiled with: ...$ ldc -O5 -release -inline em3d01.d With: LDC compiler, based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009) Timings: ...$ time ./em3d01 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Propagating field values for 50 iteration(s)... EM3D build time 4.96 EM3D compute time 1.38 EM3D total time 6.34 Done! real 0m6.547s user 0m5.856s sys 0m0.576s ...$ time ./em3d01 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Propagating field values for 200 iteration(s)... EM3D build time 4.88 EM3D compute time 5.97 EM3D total time 10.86 Done! real 0m11.398s user 0m10.801s sys 0m0.492s The first D version, despite being compiled with LDC, is slower than the Java version. As you can see the build time is much bigger, more than two times, that's mostly because of the D GC. To test that no bugs are introduced I have created a small Python script that acts like the "diff" command, but ignored small differences in floating point values. The programs of this benchmark are able to print the results, so I can test them: ...$ java Em3d2 -n 10 -d 4 -i 5 -m -p > outj ...$ ./em3d01 -n 10 -d 4 -i 5 -m -p > outd01 With it I can test that the results are the same: ...$ python approx_diff.py outj2 outd01 3) line=26: ['EM3D', 'build', 'time', '0.0040'] ['EM3D', 'build', 'time', '0.00'] 0.004 0.0 3) line=28: ['EM3D', 'total', 'time', '0.0040'] ['EM3D', 'total', 'time', '0.00'] 0.004 0.0 This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB11_2: movl 32(%eax), %esi movl (%esi,%edx,4), %esi movl 40(%eax), %edi movsd (%edi,%edx,8), %xmm1 mulsd 8(%esi), %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, 8(%eax) incl %edx cmpl %ecx, %edx jl .LBB11_2 ------------------------------------ D02 Olden benchmarks are quite memory-based, so the first way to optimize them is to reduce the number of memory allocations. Also, in D a good basic optimization for this kind of programs is to convert classes that are instantiated many times, into structs. A quick scan of the code shows that it contains a single instance of Em3d and BiGraph, and many instances of Node. So this version translated Node into a struct, allocated on the heap and managed by pointer. The code is almost the same. I have also reformatted code to 4 spaces. Timings: ...$ time ./em3d02 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Propagating field values for 50 iteration(s)... EM3D build time 3.39 EM3D compute time 1.40 EM3D total time 4.79 Done! real 0m4.982s user 0m4.416s sys 0m0.472s ...$ time ./em3d02 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Propagating field values for 200 iteration(s)... EM3D build time 3.35 EM3D compute time 5.84 EM3D total time 9.19 Done! real 0m9.365s user 0m8.821s sys 0m0.460s The code is faster, but there's a lot to do still. The code is not good yet, there are several inefficiencies. ------------------------------------ D03 Timings: I have converted all the classes (Random, BiGraph and Em3d) to structs with minimal performance changes. So the problem is (as expected) elsewhere. I have reformatted the Java comments to produce a more compact code that (despite losing being fit for ddoc) allows me to understand the code better. I have renamed some Node fields to better names, because better names allow to understand code better: toNodes => outbound fromNodes => inbound fromCount => n_inbound Now the main() is a independent function. Then I have disabled the GC just before the graph creation, and enabled again after that. I have also added extra time printings inside the graph creation method, to study better where the running time goes. Then I have added an exit(0) at the end of the main, it kills the program quickly, saving the time needed to call destructors and to free the GC memory. In general this is not a safe thing to do, because destructors may be necessary, but in this program this is OK. Timings: ...$ time ./em3d03 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... 0.14 0.94 0.48 0.43 Propagating field values for 50 iteration(s)... EM3D build time 1.99 EM3D compute time 1.36 EM3D total time 3.35 Done! real 0m3.530s user 0m2.920s sys 0m0.524s ...$ time ./em3d03 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... 0.15 0.93 0.48 0.43 Propagating field values for 200 iteration(s)... EM3D build time 1.99 EM3D compute time 5.80 EM3D total time 7.79 Done! real 0m8.012s user 0m7.348s sys 0m0.540s As expected the compute time is unchanged, but the build time is decreased significantly. Now the build time is lower than the Java2 code! So the first benchmark (with just 50 loops) is faster than the Java2 version. This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB10_2: movl 24(%eax), %esi movl (%esi,%edx,4), %esi movl 32(%eax), %edi movsd (%edi,%edx,8), %xmm1 mulsd (%esi), %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, (%eax) incl %edx cmpl %ecx, %edx jl .LBB10_2 ------------------------------------ D04 I have renamed a method: BiGraph.compute() => graph.computeStep(); And I have added a test for the sanity of the input numDegree: if (numDegree > numNodes) {... The program allocates a linked list of nodes, and then doesn't change that topology any more, just iterates on them. Both allocating many single nodes, and iterating on a linked list, today are slow operations (the original C code was old). So today it's much more efficient to allocated many or all Nodes in an array (this program contains a bipartite graph, so there are two arrays). So I can also remove the "next" pointer field of Node, that was used for the links of the list. In this D version the arcs are kept as pointers. Now the opApply isn't needed (this increases speed because I now loop on arrays, but loses some encapsulation). The fillTable() method becomes the createTable() because it allocates an array of the nodes. Timings: ...$ time ./em3d04 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... 0.13 0.94 0.48 0.40 Propagating field values for 50 iteration(s)... EM3D build time 1.95 EM3D compute time 1.18 EM3D total time 3.13 Done! real 0m3.326s user 0m2.696s sys 0m0.540s ...$ time ./em3d04 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... 0.13 0.92 0.46 0.40 Propagating field values for 200 iteration(s)... EM3D build time 1.91 EM3D compute time 4.35 EM3D total time 6.26 Done! real 0m6.443s user 0m5.908s sys 0m0.432s The build time is not changed, I don't know why, probably disabling the GC allows for an efficient enough memory allocation. But now the compute time is decreased, because the program can scan the arrays more efficiently. If a program like this gets used in practice, the build time is not important, because the number of iterations is probably large, so the compute time has to be as small as possible. Now both timings are better than the Java2 timings. This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB9_2: movl 12(%eax), %esi movl (%esi,%edx,4), %esi movl 20(%eax), %edi movsd (%edi,%edx,8), %xmm1 mulsd (%esi), %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, (%eax) incl %edx cmpl %ecx, %edx jb .LBB9_2 ------------------------------------ D05 The CPU is able to use the pointers among Nodes in a quick way, but the number of nodes is probably limited. The program can increase its speed if there is less data traffic through the CPU cache. So I have replaced the pointers by 16 bit (ushort) indexes. This allows up to 2^16 nodes. If you need more, you can replace: alias ushort TyIndex; With: alias uint TyIndex; But this may produce code slower than D04. Timings: ...$ time ./em3d05 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... 0.06 0.94 0.38 0.42 Propagating field values for 50 iteration(s)... EM3D build time 1.80 EM3D compute time 1.00 EM3D total time 2.80 Done! real 0m2.937s user 0m2.540s sys 0m0.324s ...$ time ./em3d05 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... 0.07 0.95 0.37 0.39 Propagating field values for 200 iteration(s)... EM3D build time 1.78 EM3D compute time 4.02 EM3D total time 5.80 Done! real 0m5.955s user 0m5.508s sys 0m0.360s This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB9_2: movl 20(%eax), %edi movsd (%edi,%esi,8), %xmm1 movl 12(%eax), %edi movzwl (%edi,%esi,2), %edi imull $40, %edi, %edi mulsd (%edx,%edi), %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, (%eax) incl %esi cmpl %ecx, %esi jb .LBB9_2 ------------------------------------ D06 Now I look for possible algorithmic improvements. makeUniqueNeighbors() contains stupid quadratic code to avoid duplicating Neighbors. I have tried to use a D associative array, and then a set implemented as a bit vector (using bt/bts intrinsics), but the most efficient seems a set of booleans implemented with just an array of ubytes. I have renamed: coeffs => inCoeffs Timings: ...$ time ./em3d06 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... 0.07 0.11 0.40 0.39 Propagating field values for 50 iteration(s)... EM3D build time 0.97 EM3D compute time 1.03 EM3D total time 2.00 Done! real 0m2.145s user 0m1.748s sys 0m0.324s ...$ time ./em3d06 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... 0.07 0.11 0.37 0.40 Propagating field values for 200 iteration(s)... EM3D build time 0.95 EM3D compute time 4.11 EM3D total time 5.06 Done! real 0m5.231s user 0m4.804s sys 0m0.352s The build time for 5000 nodes with 300 arcs is half as before. This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB9_2: movl 20(%eax), %edi movsd (%edi,%esi,8), %xmm1 movl 12(%eax), %edi movzwl (%edi,%esi,2), %edi imull $40, %edi, %edi mulsd (%edx,%edi), %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, (%eax) incl %esi cmpl %ecx, %esi jb .LBB9_2 ------------------------------------ D07 Now the most common optimizations are done, and the code looks good enough. To improve the code some more we have to think more about the CPU cache. The computations (of "compute time") are done on an array of large structs Node. But only few fields are actually necessary for such computation: value, inbound and inCoeffs (total 6 words), so if I remove the useless data, the iterations on the array will be (hopefully) faster. I don't know how to remove items from the struct. The simpler way to solve this problem is to define a second Node struct, a Node2, with just the essential fields. Creating the two arrays of Node2 is fast, because all arrays of inbound and inCoeffs are just copied by reference. So I add the eNodes2 and hNodes2 dynamic arrays of Node2 to BiGraph (now encapsulation looks gone, but in practice I don't think the original design was good. The only data structure that has to know about a collection of Nodes2 has to be BiGraph, and not Node2 itself). There's some redundancy too here in a Node2: the inbound and inCoeffs arrays have the same length, so the Node2 can use just 5 words. There are few ways to remove this field, but the first field (the value) is a double, and for max performance it must be aligned to 8 bytes. So I have done few experiments, but they are failed, the program was slower, with a small memory usage reduction. So I have kept all the 6 words. Timings: ...$ time ./em3d07 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.08 0.10 0.37 0.39 0.00 Propagating field values for 50 iteration(s)... EM3D build time 0.94 EM3D compute time 0.92 EM3D total time 1.86 Done! real 0m1.993s user 0m1.600s sys 0m0.316s ...$ time ./em3d07 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.07 0.10 0.39 0.40 0.00 Propagating field values for 200 iteration(s)... EM3D build time 0.96 EM3D compute time 3.62 EM3D total time 4.58 Done! real 0m4.718s user 0m4.292s sys 0m0.360s The build time of the graph is the same, but the computing time is decreased. The time needed to build the arrays of Node2 is very small. This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB9_2: movl 20(%eax), %edi movsd (%edi,%esi,8), %xmm1 movl 12(%eax), %edi movzwl (%edi,%esi,2), %edi imull $24, %edi, %edi mulsd (%edx,%edi), %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, (%eax) incl %esi cmpl %ecx, %esi jb .LBB9_2 ----------------------------------- D08 We can improve the code more taking a look at how hValues are accessed, a sequential access is faster. But I've seen that the inbound[i] are already sorted. So I have to take a better look at the very small loop that performs one of the two halves of a computation step:
void computeNewValue(Node2[] otherTable) {
for (int i; i < inbound.length; i++)
value -= inCoeffs[i] * otherTable[inbound[i]].value;
}
There's not much that can be improved here. I have drawn on paper the simple data structures involved here, and with that I've seen that inbound[i] performs (forward but) large jumps in an array of largish structs (6 words each) to find the "value" fields. This forces the CPU cache to a lot of traffic. If we reduce this traffic the iterations will get faster. The simple way to do this is to pull the values out of the nodes2 and put them in a uniform array, that's used in parallel to the Node2 array.
// Node2 of D07:
struct Node2 {
double value;
Node.TyIndex[] inbound;
double[] inCoeffs;
}
// Node2 of D08:
struct Node2 {
Node.TyIndex[] inbound;
double[] inCoeffs;
static double[] eValues, hValues;
}
Now the jumps are performed on arrays (eValues and hValues) with just 2 words/item. To perform a better computation I have had to split computeNewValue() into eValuesStep() and hValuesStep(). I have pulled the outer loop like foreach(j, ref n; hNodes2) inside them, this may help the DMD compiler too, because it doesn't inline functions/methods that have a loop inside. I also have had to add a "aux_val" auxiliary variable inside the eValuesStep/hValuesStep functions, because it seems LDC/LLVM was not able to pull the access to Node2.hValues[j] out of the loop. I don't know why [see below, the explanation after D01b]. Timings: ...$ time ./em3d08 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.07 0.10 0.37 0.40 0.00 Propagating field values for 50 iteration(s)... EM3D build time 0.94 EM3D compute time 0.74 EM3D total time 1.68 Done! real 0m1.817s user 0m1.416s sys 0m0.332s ...$ time ./em3d08 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.07 0.10 0.37 0.39 0.00 Propagating field values for 200 iteration(s)... EM3D build time 0.93 EM3D compute time 2.96 EM3D total time 3.89 Done! real 0m4.038s user 0m3.628s sys 0m0.336s The compute time is decreased enough. This is the asm of inner loop of eValuesStep() compiled with LDC: .LBB9_4: movzwl (%esi,%edi,2), %ebp movsd (%edx,%edi,8), %xmm1 mulsd (%ebx,%ebp,8), %xmm1 subsd %xmm1, %xmm0 incl %edi cmpl %ecx, %edi jb .LBB9_4 ------------------------------------ D09 There's one last possible optimization of the memory usage that I can see. I can remove again the redundancy from Node2, because now it doesn't contain a double, and now probably I can use 3 32-bit words. The new Node2 is:
struct Node2 {
Node.TyIndex* inbound;
double* inCoeffs;
int inbound_len;
static double[] eValues, hValues;
}
Timings: ...$ time ./em3d09 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.08 0.10 0.37 0.39 0.00 Propagating field values for 50 iteration(s)... EM3D build time 0.94 EM3D compute time 0.74 EM3D total time 1.68 Done! real 0m1.821s user 0m1.444s sys 0m0.296s ...$ time ./em3d09 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.07 0.11 0.37 0.39 0.00 Propagating field values for 200 iteration(s)... EM3D build time 0.94 EM3D compute time 2.95 EM3D total time 3.89 Done! real 0m4.030s user 0m3.632s sys 0m0.332s The computing time is about the same. The memory saving compared to D08 is very small, something like 40 KB. It may be possible to vectorize the computing loop, but I stop here. Compared to the Java2 version the speed of this last D09 version is not much higher, because the computing loop is very tight and the JavaVM is able to optimize it very well, quite better than LDC, see the compute time of Java2 compared to the compute time of D01: 4.35 compared to 5.97. A shame for D/LDC/LLVM :-) Other Olden benchmarks show a bigger gap between the performance of the Java and optimized D code. This is the asm of inner loop of eValuesStep() compiled with LDC: .align 16 .LBB9_4: movzwl (%esi,%edi,2), %ebp movsd (%edx,%edi,8), %xmm1 mulsd (%ebx,%ebp,8), %xmm1 subsd %xmm1, %xmm0 incl %edi cmpl %ecx, %edi jl .LBB9_4 ------------------------------------ D10 Update 1.2, Oct 27 2006: There is another simple performance optimization that can be done. The memory for the Nodes and their arcs can be allocated from the C heap (and in some cases such memory can even be allocated with malloc instead of calloc, but in this programs this has no significant performance difference). The memory coming from the C heap is usually aligned to 4 bytes, so it's not fit for storing values and coefficients that in this program are doubles, that are used much more efficiently when aligned to 8 bytes. But in this program the actual computations performed on double FP numbers are done on arrays allocated in Node2, that are allocated from the D GC heap that returns memory aligned to 16 bytes. The only problem may come from Node.inCoeffs that's copied as is to Node2 inCoeffs. So this allocation needs a little of extra care (see makeFromNodes()). I can even allocate all "outbound" arrays at once, because they are all of the same length, but this change will not change performance significantly, because that time is part of the first timing of the "Detailed creation timings:", that's about 0.03-0.04 seconds for 5000 nodes. I have moved the disable() of the GC below, when the computing loops are done, because there's no point in keeping the GC active during those loops, that produce no garbage. Timings: ...$ time ./em3d10 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.10 0.14 0.19 0.00 Propagating field values for 50 iteration(s)... EM3D build time 0.47 EM3D compute time 0.60 EM3D total time 1.07 Done! real 0m1.203s user 0m0.968s sys 0m0.148s ...$ time ./em3d10 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.03 0.10 0.16 0.18 0.00 Propagating field values for 200 iteration(s)... EM3D build time 0.47 EM3D compute time 2.46 EM3D total time 2.93 Done! real 0m3.061s user 0m2.844s sys 0m0.124s Not just (as expected) the build times are lower, but the compute ones too are lower, I don't know why, maybe the memory from the C heap has more coherence (more contiguous, reducing cache misses. The GC adds some spaces, because it returns memory blocks aligned to 16 bytes). ------------------------------------ D11 Update 1.2, Oct 27 2006: To reduce the time used by makeFromNodes(), and hopefully to increase the computation loops (with a better cache coherence) we can cut the inbound and inCoeffs arrays of Node (later copied into Node2) from a large memory block allocated all at once (again, doubles are better alighed to 8 bytes) to do this I have created two memory arenas, with the DoubleArena and IndexArena structs. Timings: ...$ time ./em3d11 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.10 0.00 0.42 0.00 Propagating field values for 50 iteration(s)... EM3D build time 0.56 EM3D compute time 0.58 EM3D total time 1.14 Done! real 0m1.271s user 0m1.028s sys 0m0.164s ...$ time ./em3d11 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.09 0.00 0.43 0.01 Propagating field values for 200 iteration(s)... EM3D build time 0.57 EM3D compute time 2.23 EM3D total time 2.80 Done! real 0m2.961s user 0m2.652s sys 0m0.184s ...$ time ./em3d11 -n 5000 -d 300 -i 2000 -m Initializing em3d random graph... Detailed creation timings: 0.03 0.11 0.00 0.43 0.00 Propagating field values for 2000 iteration(s)... EM3D build time 0.57 EM3D compute time 23.39 EM3D total time 23.96 Done! real 0m24.621s user 0m23.869s sys 0m0.144s Now the build time is a little higher (I don't know why), while the computing loops are faster (I don't know why), so when the number of iterations is 200 the total time is lower. In the detailed creation timings you can also see that makeFromNodes() is now very fast, while updateFromNodes() is quite slower (I don't know why. Here DMD is twice faster). This version also needs a little less RAM compared to D10, about 39 MB with -n 5000 -d 300. Java3 needs about 197 MB for the same graph. Now the D11 program is fast enough, and uses a low enough memory, that the limit of 65000 nodes given by the ushort indexes can be felt. Using an uint as TyIndex the memory used by D11 with -n 5000 -d 300 is 51 MB. Timings of D11 with TyIndex=uint: ...$ time ./em3d11 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.09 0.11 0.00 0.48 0.00 Propagating field values for 50 iteration(s)... EM3D build time 0.68 EM3D compute time 0.68 EM3D total time 1.36 Done! real 0m1.525s user 0m1.220s sys 0m0.200s ...$ time ./em3d11 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.06 0.12 0.00 0.50 0.00 Propagating field values for 200 iteration(s)... EM3D build time 0.68 EM3D compute time 2.71 EM3D total time 3.39 Done! real 0m3.597s user 0m3.252s sys 0m0.184s Using such larger indexes the running time is not much higher, but in this case using just pointers is probably faster. This is the asm of inner loop of eValuesStep() of D11 compiled with LDC, when TyIndex=uint: .LBB14_4: movl (%esi,%edi,4), %ebp movsd (%edx,%edi,8), %xmm1 mulsd (%ebx,%ebp,8), %xmm1 subsd %xmm1, %xmm0 incl %edi cmpl %ecx, %edi jl .LBB14_4 ------------------------------------ D11b (lateral branch) Update 1.2, Oct 27 2006: When the number of nodes is high, instead of storing ushort indexes, I can store ushort deltas. And because the indexes are ordered such deltas are never negative. If links are well spread, then such deltas are usually small, and they can be used to index far more than 2^16 nodes. If such links are randomly distriuited, you can probably manage millions of nodes in a safe enough way, and it's easy to add a runtime test when they are created to be sure they don't overflow. I have done a test, using the usual -n 5000 -d 300 arguments with the same seed=783, in such case the maximum delta is 240, so just one ubyte suffices! ubyte indexes reduce traffic through the cache, so they may speed up the program a little more. The D11b uses such TyDelta in Node2 but I have kept usign TyIndex in Node to keep the program simpler. Timings of D11b with TyIndex=ushort, TyDelta=ubyte: ...$ time ./em3d11b -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.10 0.00 0.42 0.02 Propagating field values for 50 iteration(s)... EM3D build time 0.59 EM3D compute time 0.56 EM3D total time 1.15 Done! real 0m1.266s user 0m1.032s sys 0m0.156s ...$ time ./em3d11b -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.11 0.00 0.42 0.02 Propagating field values for 200 iteration(s)... EM3D build time 0.59 EM3D compute time 2.20 EM3D total time 2.79 Done! real 0m2.968s user 0m2.676s sys 0m0.160s ...$ time ./em3d11b -n 5000 -d 300 -i 2000 -m Initializing em3d random graph... Detailed creation timings: 0.03 0.12 0.00 0.48 0.02 Propagating field values for 2000 iteration(s)... EM3D build time 0.65 EM3D compute time 21.98 EM3D total time 22.63 Done! real 0m23.252s user 0m22.477s sys 0m0.204s The iteration is a bit faster, but the build time of Nodes2 is a little slower (by about 0.02 seconds). So for 200 iterations the running time is about the same, while for 2000 you can see some difference. For TyDelta==ubyte this D11b program is not very useful (unless you have few nodes), but for larger graphs such delta encoding can be useful to keep using ushorts even when there are more than 2^16 nodes. Timings of D11b with TyIndex=uint, TyDelta=ushort: ...$ time ./em3d11b -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.11 0.00 0.44 0.02 Propagating field values for 50 iteration(s)... EM3D build time 0.61 EM3D compute time 0.56 EM3D total time 1.17 Done! real 0m1.309s user 0m1.032s sys 0m0.180s ...$ time ./em3d11b -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.05 0.10 0.00 0.43 0.02 Propagating field values for 200 iteration(s)... EM3D build time 0.60 EM3D compute time 2.20 EM3D total time 2.80 Done! real 0m2.997s user 0m2.684s sys 0m0.176s D11b with TyDelta=ushort is just a little slower than D11b with TyDelta=ubyte, so in most situations it can be enough for a number of nodes >> 2^16 if the arcs are spread in an uniform random way. This em3d program is just a benchmark, but in a real program you want to add class invariants, unittests, several safeties for the memory, and ways to deallocate memory when not used any more, etc. ------------------------------------ D12 (this ignores the D11b branch) Update 1.3, Oct 27 2006: The items of inbound and inCoeffs are accessed in parallel, so it can be useful for them to be close to each other. Ideally the items of inbound and inCoeffs are better kept in the same cache line, that's a block of 64 bytes (and I think it has to be aligned too). But I can't put them into a struct because it wastes a lot of space (such struct has to be 16 bytes long to keep the double it contains aligned to 8 bytes), and wasting space increases cache misses. If I want to pack pairs of double + ushort in a struct that inside an array is aligned to 8 bytes, I need to pack 4 doubles and 4 ushorts, this needs 8*4 + 2*4 = 32 + 8 = 40 bytes. But this straddles a cache line, and I think this is negative. So to pack as much as possible in a cache line I can use 6 pairs, that need 8*6 + 2*6 = 48 + 12 = 60 bytes, that needs 4 bytes of padding, this wastes 6.25% of space, I think it's acceptable. I can define it like this:
struct PairsPack {
static if (TyIndex.sizeof == 2)
const int NPAIRS = 6; // 8*6 + 2*6 + 4 = 48 + 12 + 4 = 64
else static if (TyIndex.sizeof == 4)
const int NPAIRS = 5; // 8*5 + 4*5 + 4 = 40 + 20 + 4 = 64
else
static assert(0);
double[NPAIRS] weights;
TyIndex[NPAIRS] inbounds;
uint _padding;
}
Just to be sure I'll allocate the array of PairsPack aligned to 64 bytes. But generally the number of inbound arcs in a node isn't a multiple of 6, so I need to waste some of those pairs, or use part of an already used PairsPack. Both solutions have advantages and disadvantages. I choose the option to not waste memory, this requires a little more complex indexing of the sub-part of a PairsPack, but allows me to allocate the array of PairsPack in a simpler way. This D12 was the hardest version to create, the code is quite hairy, and hard to maintain and understand. And it's slow (the Range!() that I have used is not necessary. LLVM was able to unroll that loop inside eValuesStep/hValuesStep by itself). Timings: ...$ time ./em3d12o -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.11 0.00 0.43 0.19 Propagating field values for 50 iteration(s)... EM3D build time 0.78 EM3D compute time 0.68 EM3D total time 1.46 Done! real 0m1.614s user 0m1.240s sys 0m0.284s ...$ time ./em3d12o -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Detailed creation timings: 0.04 0.11 0.00 0.41 0.20 Propagating field values for 200 iteration(s)... EM3D build time 0.76 EM3D compute time 2.65 EM3D total time 3.41 Done! real 0m3.623s user 0m3.176s sys 0m0.308s The asm of inner loop of eValuesStep() of D11 compiled with LDC: .LBB14_4: movzwl 50(%eax), %ebp movsd 8(%eax), %xmm1 mulsd (%esi,%ebp,8), %xmm1 movzwl 48(%eax), %ebp movsd (%eax), %xmm2 mulsd (%esi,%ebp,8), %xmm2 subsd %xmm2, %xmm0 subsd %xmm1, %xmm0 movzwl 52(%eax), %ebp movsd 16(%eax), %xmm1 mulsd (%esi,%ebp,8), %xmm1 subsd %xmm1, %xmm0 movzwl 54(%eax), %ebp movsd 24(%eax), %xmm1 mulsd (%esi,%ebp,8), %xmm1 subsd %xmm1, %xmm0 movzwl 56(%eax), %ebp movsd 32(%eax), %xmm1 mulsd (%esi,%ebp,8), %xmm1 subsd %xmm1, %xmm0 movzwl 58(%eax), %ebp movsd 40(%eax), %xmm1 mulsd (%esi,%ebp,8), %xmm1 subsd %xmm1, %xmm0 addl $64, %eax decl %ebx jne .LBB14_4 ------------------------------------ Java3 Now I can think porting back some of the optimizations I have implemented in D to Java. Java don't uses structs, so several things are not possible. In this Java3 I have used the uniqueNeighborsSet set, implemented as the D code. Timings: ...$ time java -server Em3d3 -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Propagating field values for 50 iteration(s)... EM3D build time 1.681 EM3D compute time 1.14 EM3D total time 2.821 Done! real 0m3.138s user 0m2.600s sys 0m0.416s ...$ time java -server Em3d3 -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Propagating field values for 200 iteration(s)... EM3D build time 1.649 EM3D compute time 4.29 EM3D total time 5.943 Done! real 0m6.239s user 0m5.656s sys 0m0.452s More optimizations may be ported from D to Java. ------------------------------------ D01b (lateral branch) Looking at the asm of the inner loop of computeNewValue for D01 and D09, I think the D01 can be improved with an auxiliary variable:
void computeNewValue() {
auto aux = this.value;
for (int i = 0; i < fromCount; i++)
aux -= coeffs[i] * fromNodes[i].value;
this.value = aux;
}
Timings: ...$ time ./em3d01b -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Propagating field values for 50 iteration(s)... EM3D build time 4.89 EM3D compute time 1.28 EM3D total time 6.17 Done! real 0m6.335s user 0m5.804s sys 0m0.464s ...$ time ./em3d01b -n 5000 -d 300 -i 200 -m Initializing em3d random graph... Propagating field values for 200 iteration(s)... EM3D build time 4.92 EM3D compute time 5.23 EM3D total time 10.15 Done! real 0m10.601s user 0m10.105s sys 0m0.440s LDC/LLVM must be able to perform that optimization (pulling this.value out of the loop) by itself. This is the asm of inner loop of computeNewValue() compiled with LDC: .LBB11_2: movl (%edx,%edi,4), %ebx movsd (%esi,%edi,8), %xmm1 mulsd 8(%ebx), %xmm1 subsd %xmm1, %xmm0 incl %edi cmpl %ecx, %edi jl .LBB11_2 This asm is very short, but it's quite slower than D09 anyway. I don't know what kind of optimizations are done by the JavaVM on the Java1 code (there is a way to look at the asm produced by the JVM, but it's not handy). Update 1.1, Oct 26 2006: now I think I know why LDC isn't able to perform that optimization in the computeNewValue() loop: the compiler doesn't know that "value" on the right is always distinct from "value" on the left, because this is a bipartite graph, where no self-loops are allowed. So it can't pull the "value" of the left out of the loop. To perform the optimization the compiled need more semantics. Future programming languages may allow the programmer to give such semantics to the compiler. I have done a similar optimization (pulling "value" out of the loop) in the Java3 version, with no change in performance, so probably the JavaVM is somehow able to perform that optimization. ------------------------------------ Python1 I have created a Python version too, that uses Psyco. It's similar to the Java3 version. I have seen that the faster unique_neighbors_set is using just a Python set. I have added __slots__ to both Random and Node, this doubles the performance. Timings: ...$ time python em3d.py -n 5000 -d 300 -i 50 -m Initializing em3d random graph... Detailed creation timings: 0.14 3.86 0.16 4.32 Propagating field values for 50 iteration(s)... EM3D build time 8.48 EM3D compute time 54.70 EM3D total time 63.19 Done! real 1m4.320s user 0m59.628s sys 0m3.876s I have used: Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) Psyco 1.6.0 final ------------------------------------ Python2 While improving the D code, and especially when I have created the Python1 version, I have felt the code as too much complex for the simple operations it performs. So I have reduced the Python code, removing useless parts. The Python2 version is almost the shortest Python code that produces the same output (there are ways to shorten it a little more, but I am not doing code golf here, my purposes are different). The Python2 code is 67 lines long, while the Java3 is 481 lines including the comments. ------------------------------------ Python3 Then I have replaced the random() and sample() with ones from the Python standard library, I have removed the printing code, and I have reorganized the code a little more. The result is short enough (21 nonempty lines) to be shown here too:
from random import random, sample
class Node:
def __init__(self, value):
self.value = value
self.in_arcs = []
n_nodes, in_degree, n_steps = 15, 5, 5
h_nodes = [Node(random()) for i in xrange(n_nodes)]
e_nodes = [Node(random()) for i in xrange(n_nodes)]
def make_in_arcs(nodes, other_nodes, in_degree):
for n1 in nodes:
for n2 in sample(other_nodes, in_degree):
n2.in_arcs.append((n1, random()))
make_in_arcs(h_nodes, e_nodes, in_degree)
make_in_arcs(e_nodes, h_nodes, in_degree)
def half_step(nodes):
for n1 in nodes:
for n2, weight in n1.in_arcs:
n1.value -= weight * n2.value
for i in xrange(n_steps):
half_step(e_nodes)
half_step(h_nodes)
This code shows the essential, and for me it's much simpler to follow and understand. This code may be conveted to C/D again, if necessary. The original C code was harder to understand, and much more complex. Removing the noise and useless parts allows me to understand the algorithm better, this is a simple algorithm. Even if you don't understand this code immediately, the complexity left is necessary, it's not spurious. Generally good programs aren't the clever and complex ones, but ones that look "obvious". This Python code shows that lot of the complexity of the original Java code was spurious. ------------------------------------ Python4 This is similar to the Python3 version, I have just encoded a node in a simpler way, using a single list. The first item is the "value" and all the following ones are pairs of node-weight. This code is just 17 nonempty lines long. For me Python3 version is more readable. ------------------------------------ | ||||||||
| comments: Leave a comment |