<?xml version='1.0' encoding='utf-8' ?>
<!--  If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/  -->
<rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/' xmlns:media='http://search.yahoo.com/mrss/' xmlns:atom10='http://www.w3.org/2005/Atom'>
<channel>
  <title>leonardo</title>
  <link>http://leonardo-m.livejournal.com/</link>
  <description>leonardo - LiveJournal.com</description>
  <lastBuildDate>Fri, 11 Dec 2009 17:10:21 GMT</lastBuildDate>
  <generator>LiveJournal / LiveJournal.com</generator>
  <lj:journal>leonardo_m</lj:journal>
  <lj:journalid>10049645</lj:journalid>
  <lj:journaltype>personal</lj:journaltype>
  <atom10:link rel='hub' href='http://pubsubhubbub.appspot.com/' />
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/90943.html</guid>
  <pubDate>Fri, 11 Dec 2009 17:10:21 GMT</pubDate>
  <title>Benchmark of small memory allocations in D and C</title>
  <link>http://leonardo-m.livejournal.com/90943.html</link>
  <description>All the code of the tests:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/mem_bench.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/mem_bench.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;To see the relative performance of various ways to allocate memory I have timed various alternative versions of this program, in C and D languages:&lt;br /&gt;&lt;pre&gt;
// C99 code
#include &quot;stdio.h&quot;
#include &quot;stdlib.h&quot;

int compute(int n) {
    int *arr = malloc(n * sizeof(int));
    if (arr == NULL) exit(1);
    for (int i = 0; i &amp;lt; n; i++)
        arr[i] = i;
    int tot = 0;
    for (int i = 0; i &amp;lt; n; i++)
        tot += arr[i];
    free(arr);
    return tot;
}

int main() {
    int n = 2 * 1000 * 1000;
    int tot = 0;
    for (int i = 1; i &amp;lt; n; i++)
        tot += compute(i % 500);
    printf(&quot;%d\n&quot;, tot);
    return 0;
}
&lt;/pre&gt;&lt;br /&gt;In D deleting the dynamic array manually improves the timings a lot.&lt;br /&gt;&lt;pre&gt;
Timings on Virtual Box, n=2_000_000, % 500, best of 6, seconds:
  C 1: 0.98  malloc + free
  C 2: 0.81  C99 dynamic stack-allocated array
  C 3:       alloca
  D 1: 2.83  dynamic array
  D 2: 2.79  scoped dynamic array
  D 3: 1.13  dynamic array + delete at the end
  D 4: 0.88  C heap malloc + free
  D 5: 0.74  alloca
  D 6: 1.13  GC heap malloc + free/relloc
  
C code compiled with:
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) 
gcc -Wall -s -std=c99 -O3

D code compiled with (with Tango):
ldc based on DMD v1.051 and llvm 2.6 (Fri Nov 27 12:54:12 2009)
ldc -O5 -release -inline memtest1_d.d

---------------------

Timings on Vista, n=2_000_000, % 500, best of 6, seconds:
  C 1: 1.57  malloc + free
  C 2: 0.69  C99 dynamic stack-allocated array
  C 3:       alloca
  D 1: 2.60  dynamic array
  D 2: 2.62  scoped dynamic array
  D 3: 2.50  dynamic array + delete at the end
  D 4: 1.24  C heap malloc + free
  D 5: 0.74  alloca
  D 6: 1.33  GC heap malloc + free/relloc
  
(Virtual box not started, so it&apos;s a little faster)

C code compiled with:
gcc version 4.3.3-dw2-tdm-1 (GCC)
gcc -Wall -s -std=c99 -O3

D code compiled with (with Phobos):
Digital Mars D Compiler v1.042
ldc -O -release -inline memtest1_d.d

---------------------

Timings on Vista, n=2_000_000, % 500, best of 6, seconds:
  C 1: 1.47  malloc + free
  C 2: 0.63  C99 dynamic stack-allocated array
  C 3:       alloca
  D 1: 2.62  dynamic array
  D 2: 2.58  scoped dynamic array
  D 3: 2.31  dynamic array + delete at the end
  D 4: 1.27  C heap malloc + free
  D 5: 0.75  alloca
  D 6: 1.40  GC heap malloc + free/relloc
  
(Virtual box not started, so it&apos;s a little faster)

C code compiled with:
gcc version 4.3.3-dw2-tdm-1 (GCC)
llvm-gcc -Wall -s -Wl,--enable-stdcall-fixup -std=c99 -O3

D code compiled with (with Phobos):
Digital Mars D Compiler v2.037
ldc -O -release -inline memtest1_d.d
&lt;/pre&gt;</description>
  <comments>http://leonardo-m.livejournal.com/90943.html</comments>
  <category>dmd</category>
  <category>programming</category>
  <category>gcc</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>c</category>
  <category>d language</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/90594.html</guid>
  <pubDate>Wed, 18 Nov 2009 18:44:29 GMT</pubDate>
  <title>Updates</title>
  <link>http://leonardo-m.livejournal.com/90594.html</link>
  <description>Lot of time from the last update. I have spent lot of time improving the Scimark2 benchmark, see below.&lt;br /&gt;&lt;br /&gt;In my page of junk software I have added two new D benchmarks, dolden_tsp and dolden_bisort that are ports of the famous memory-based Olden benchmarks. I have included several alternative versions that show improvements. When I can I plan to add two more Olden benchmarks.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In the junk software page I have also addes multibase_happy, various solutions in Python and D of a Google Code Jam 2009 problem. You can find info inside the zip.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;And then I have done many improvements to the Scimark2 benchmark, I have added some faster D not-OOP versions, to reach the performance of the Java version. See the zip for a lot more information. It turns out that Java HotSpot is able to partially unroll a loop with a number of iterations known only at runtime. LLVM is not currently able to perform this optimization, so the code was slower. I have unrolled the inner loop of the SOR and LU parts using a static foreach, now the performance is good.</description>
  <comments>http://leonardo-m.livejournal.com/90594.html</comments>
  <category>programming</category>
  <category>d language</category>
  <category>java</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/90054.html</guid>
  <pubDate>Sun, 25 Oct 2009 23:38:20 GMT</pubDate>
  <title>Olden &quot;em3d&quot; benchmark in D language</title>
  <link>http://leonardo-m.livejournal.com/90054.html</link>
  <description>All the code shown in this article:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/dolden_em3d.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/dolden_em3d.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This is a partial port of the Olden benchmarks to the D language. I&apos;ll add more Olden benchmarks in future as time allows me to.&lt;br /&gt;&lt;br /&gt;Info on the Olden benchmarks and their Java translation (JOlden):&lt;br /&gt;&lt;a href=&quot;http://www-ali.cs.umass.edu/DaCapo/benchmarks.html&quot;&gt;http://www-ali.cs.umass.edu/DaCapo/benchmarks.html&lt;/a&gt;&lt;br /&gt;&lt;a href=&quot;http://www.sable.mcgill.ca/~bdufou1/ashes2/&quot;&gt;http://www.sable.mcgill.ca/~bdufou1/ashes2/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;General comments:&lt;br /&gt;&lt;br /&gt;One of my purposes is to test the efficiency of the D1 code compiled with LDC. LDC is usually able to produce efficient binaries when I translate C code to C-like D code. So now I want to see how D is efficient if I start from higher level code, like Java code. It&apos;s usually easy to translate such Java code to D, D looks like a super set of C and Java, with some C++ mixed in.&lt;br /&gt;&lt;br /&gt;Olden benchmarks are designed to stress first of all the memory allocations. The D Garbage Collector (GC) is quite less efficient than the Java HotSpot GC, so literal translations of the Olden benchmarks from Java to D are usually lower or quite lower performance (up to 4-16 times slower). So I have to work some to regain performance. Usually D allows me to produce final programs that are 2-6 times faster than the original Java versions, but such optimization requires time, experience, and sometimes it&apos;s a little bug-prone.&lt;br /&gt;&lt;br /&gt;Beside having a more efficient GC, Java performs other optimizations not done by LDC, like inlining of virtual methods, and other things. In future the LLVM back-end of LDC will hopefully perform some of such optimizations (obsoleting some of my manual optimizations).&lt;br /&gt;&lt;br /&gt;In all JOlden benchmarks I have packed the Java code of a benchmark in a single source file, that later I have first translated to D, and then optimized for speed and memory usage. Sometimes I have cleaned up some the D code.&lt;br /&gt;&lt;br /&gt;Below I explain the various versions I&apos;ve created.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;See the Python2 code near the bottom to see the algorithm used by this &quot;em3d&quot; benchmark. This benchmark models the propagation of electromagnetic waves through objects in 3 dimensions. It is a simple computation on an irregular bipartite graph containing nodes representing electric and magnetic field values.&lt;br /&gt;&lt;br /&gt;Update 1.2, Oct 26 2006: &lt;br /&gt;&lt;br /&gt;One of the main purposes of this article is to show show and teach how some manual optimizations are done.&lt;br /&gt;&lt;br /&gt;Timings of the various versions (i is the number of iterations):&lt;br /&gt;&lt;pre&gt;
          i=50   i=200  i=2000
Java1:    4.37    7.13
Java2     3.62    6.87
D01:      6.55   11.40
D02:      4.98    9.36
D03:      3.53    8.01
D04:      3.32    6.44
D05:      2.94    5.95
D06:      2.14    5.23
D07:      1.99    4.72
D08:      1.82    4.04
D09:      1.82    4.03
D10:      1.20    3.06
D11:      1.27    2.96   24.62 (TyIndex=ushort)
D11:      1.52    3.60         (TyIndex=uint)
D11b:     1.27    2.97   23.25 (TyIndex=ushort, TyDelta=ubyte)
D11b:     1.31    3.00         (TyIndex=uint, TyDelta=ushort)
D12:      1.61    3.62
Java3:    3.14    6.24
D01b:     6.33   10.60
Python1: 64.3    ---
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Java1&lt;br /&gt;&lt;br /&gt;The original JOlden Java code packed in a single source file.&lt;br /&gt;&lt;br /&gt;I have fixed a bug present in the Java code of JOlden Em3d but absent in the original C code:&lt;br /&gt;At line 131 of the Em3d1.java the line:&lt;br /&gt;if (otherNode == toNodes[filled]) break;&lt;br /&gt;Has to be fixed as:&lt;br /&gt;if (otherNode == toNodes[k]) break;&lt;br /&gt;&lt;br /&gt;Such bug was not found probably because this benchmark suite has no unit-tests that test well each class and each method of each class.&lt;br /&gt;&lt;br /&gt;Timings (best of 6, seconds):&lt;br /&gt;&lt;br /&gt;...$ time java -server Em3d1 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 2.725&lt;br /&gt;EM3D compute time 1.277&lt;br /&gt;EM3D total time 4.01&lt;br /&gt;Done!&lt;br /&gt;real 0m4.370s&lt;br /&gt;user 0m3.376s&lt;br /&gt;sys 0m0.380s&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;...$ time java -server Em3d1 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 2.458&lt;br /&gt;EM3D compute time 4.35&lt;br /&gt;EM3D total time 6.812&lt;br /&gt;Done!&lt;br /&gt;real 0m7.128s&lt;br /&gt;user 0m6.568s&lt;br /&gt;sys 0m0.440s&lt;br /&gt;&lt;br /&gt;Used:&lt;br /&gt;java version &quot;1.6.0_16&quot;&lt;br /&gt;Java(TM) SE Runtime Environment (build 1.6.0_16-b01)&lt;br /&gt;Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)&lt;br /&gt;&lt;br /&gt;Code running on Ubuntu, running on VirtualBox, running on Windows Vista, on CPU Celeron 2.13 GHz.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Java2&lt;br /&gt;&lt;br /&gt;It&apos;s like the Java1 version, but it use a very simple portable pseudo-random generator. So when I translate this program to D I can see if it gives the same results.&lt;br /&gt;&lt;br /&gt;The Random.nextInt is a little different, it accepts a max value (min is 0).&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time java -server Em3d2 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 2.209&lt;br /&gt;EM3D compute time 1.109&lt;br /&gt;EM3D total time 3.318&lt;br /&gt;Done!&lt;br /&gt;real 0m3.622s&lt;br /&gt;user 0m3.100s&lt;br /&gt;sys 0m0.416s&lt;br /&gt;&lt;br /&gt;...$ time java -server Em3d2 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 2.197&lt;br /&gt;EM3D compute time 4.354&lt;br /&gt;EM3D total time 6.551&lt;br /&gt;Done!&lt;br /&gt;real 0m6.875s&lt;br /&gt;user 0m6.292s&lt;br /&gt;sys 0m0.448s&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D01&lt;br /&gt;&lt;br /&gt;This is the most direct translation of the Java2 code to D, with the portable pseudo-random generator. I have used printf for maximum portability across different D standard libraries. I have replaced the enumerated with the D opApply.&lt;br /&gt;As you can see D1 allows to program in a style almost equal to Java.&lt;br /&gt;&lt;br /&gt;Code compiled with:&lt;br /&gt;...$ ldc -O5 -release -inline em3d01.d&lt;br /&gt;&lt;br /&gt;With:&lt;br /&gt;LDC compiler, based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d01 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 4.96&lt;br /&gt;EM3D compute time 1.38&lt;br /&gt;EM3D total time 6.34&lt;br /&gt;Done!&lt;br /&gt;real 0m6.547s&lt;br /&gt;user 0m5.856s&lt;br /&gt;sys 0m0.576s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d01 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 4.88&lt;br /&gt;EM3D compute time 5.97&lt;br /&gt;EM3D total time 10.86&lt;br /&gt;Done!&lt;br /&gt;real 0m11.398s&lt;br /&gt;user 0m10.801s&lt;br /&gt;sys 0m0.492s&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The first D version, despite being compiled with LDC, is slower than the Java version. As you can see the build time is much bigger, more than two times, that&apos;s mostly because of the D GC.&lt;br /&gt;&lt;br /&gt;To test that no bugs are introduced I have created a small Python script that acts like the &quot;diff&quot; command, but ignored small differences in floating point values. The programs of this benchmark are able to print the results, so I can test them:&lt;br /&gt;&lt;br /&gt;...$ java Em3d2 -n 10 -d 4 -i 5 -m -p &amp;gt; outj&lt;br /&gt;...$ ./em3d01 -n 10 -d 4 -i 5 -m -p &amp;gt; outd01&lt;br /&gt;&lt;br /&gt;With it I can test that the results are the same:&lt;br /&gt;&lt;br /&gt;...$ python approx_diff.py outj2 outd01&lt;br /&gt;3) line=26: [&apos;EM3D&apos;, &apos;build&apos;, &apos;time&apos;, &apos;0.0040&apos;] [&apos;EM3D&apos;, &apos;build&apos;, &apos;time&apos;, &apos;0.00&apos;] 0.004 0.0&lt;br /&gt;3) line=28: [&apos;EM3D&apos;, &apos;total&apos;, &apos;time&apos;, &apos;0.0040&apos;] [&apos;EM3D&apos;, &apos;total&apos;, &apos;time&apos;, &apos;0.00&apos;] 0.004 0.0&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB11_2:&lt;br /&gt; movl 32(%eax), %esi&lt;br /&gt; movl (%esi,%edx,4), %esi&lt;br /&gt; movl 40(%eax), %edi&lt;br /&gt; movsd (%edi,%edx,8), %xmm1&lt;br /&gt; mulsd 8(%esi), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movsd %xmm0, 8(%eax)&lt;br /&gt; incl %edx&lt;br /&gt; cmpl %ecx, %edx&lt;br /&gt; jl .LBB11_2&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D02&lt;br /&gt;&lt;br /&gt;Olden benchmarks are quite memory-based, so the first way to optimize them is to reduce the number of memory allocations. Also, in D a good basic optimization for this kind of programs is to convert classes that are instantiated many times, into structs.&lt;br /&gt;&lt;br /&gt;A quick scan of the code shows that it contains a single instance of Em3d and BiGraph, and many instances of Node. So this version translated Node into a struct, allocated on the heap and managed by pointer. The code is almost the same. I have also reformatted code to 4 spaces.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d02 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 3.39&lt;br /&gt;EM3D compute time 1.40&lt;br /&gt;EM3D total time 4.79&lt;br /&gt;Done!&lt;br /&gt;real 0m4.982s&lt;br /&gt;user 0m4.416s&lt;br /&gt;sys 0m0.472s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d02 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 3.35&lt;br /&gt;EM3D compute time 5.84&lt;br /&gt;EM3D total time 9.19&lt;br /&gt;Done!&lt;br /&gt;real 0m9.365s&lt;br /&gt;user 0m8.821s&lt;br /&gt;sys 0m0.460s&lt;br /&gt;&lt;br /&gt;The code is faster, but there&apos;s a lot to do still. The code is not good yet, there are several inefficiencies.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D03&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;I have converted all the classes (Random, BiGraph and Em3d) to structs with minimal performance changes. So the problem is (as expected) elsewhere.&lt;br /&gt;I have reformatted the Java comments to produce a more compact code that (despite losing being fit for ddoc) allows me to understand the code better.&lt;br /&gt;&lt;br /&gt;I have renamed some Node fields to better names, because better names allow to understand code better:&lt;br /&gt;toNodes =&amp;gt; outbound&lt;br /&gt;fromNodes =&amp;gt; inbound&lt;br /&gt;fromCount =&amp;gt; n_inbound&lt;br /&gt;&lt;br /&gt;Now the main() is a independent function.&lt;br /&gt;&lt;br /&gt;Then I have disabled the GC just before the graph creation, and enabled again after that.&lt;br /&gt;&lt;br /&gt;I have also added extra time printings inside the graph creation method, to study better where the running time goes.&lt;br /&gt;&lt;br /&gt;Then I have added an exit(0) at the end of the main, it kills the program quickly, saving the time needed to call destructors and to free the GC memory. In general this is not a safe thing to do, because destructors may be necessary, but in this program this is OK.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d03 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.14 0.94 0.48 0.43&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 1.99&lt;br /&gt;EM3D compute time 1.36&lt;br /&gt;EM3D total time 3.35&lt;br /&gt;Done!&lt;br /&gt;real 0m3.530s&lt;br /&gt;user 0m2.920s&lt;br /&gt;sys 0m0.524s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d03 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.15 0.93 0.48 0.43&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 1.99&lt;br /&gt;EM3D compute time 5.80&lt;br /&gt;EM3D total time 7.79&lt;br /&gt;Done!&lt;br /&gt;real 0m8.012s&lt;br /&gt;user 0m7.348s&lt;br /&gt;sys 0m0.540s&lt;br /&gt;&lt;br /&gt;As expected the compute time is unchanged, but the build time is decreased significantly.&lt;br /&gt;Now the build time is lower than the Java2 code! So the first benchmark (with just 50 loops) is faster than the Java2 version.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB10_2:&lt;br /&gt; movl 24(%eax), %esi&lt;br /&gt; movl (%esi,%edx,4), %esi&lt;br /&gt; movl 32(%eax), %edi&lt;br /&gt; movsd (%edi,%edx,8), %xmm1&lt;br /&gt; mulsd (%esi), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movsd %xmm0, (%eax)&lt;br /&gt; incl %edx&lt;br /&gt; cmpl %ecx, %edx&lt;br /&gt; jl .LBB10_2&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D04&lt;br /&gt;&lt;br /&gt;I have renamed a method:&lt;br /&gt;BiGraph.compute() =&amp;gt; graph.computeStep();&lt;br /&gt;&lt;br /&gt;And I have added a test for the sanity of the input numDegree:&lt;br /&gt;if (numDegree &amp;gt; numNodes) {...&lt;br /&gt;&lt;br /&gt;The program allocates a linked list of nodes, and then doesn&apos;t change that topology any more, just iterates on them. Both allocating many single nodes, and iterating on a linked list, today are slow operations (the original C code was old). So today it&apos;s much more efficient to allocated many or all Nodes in an array (this program contains a bipartite graph, so there are two arrays).&lt;br /&gt;&lt;br /&gt;So I can also remove the &quot;next&quot; pointer field of Node, that was used for the links of the list.&lt;br /&gt;&lt;br /&gt;In this D version the arcs are kept as pointers. Now the opApply isn&apos;t needed (this increases speed because I now loop on arrays, but loses some encapsulation).&lt;br /&gt;&lt;br /&gt;The fillTable() method becomes the createTable() because it allocates an array of the nodes.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d04 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.13 0.94 0.48 0.40&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 1.95&lt;br /&gt;EM3D compute time 1.18&lt;br /&gt;EM3D total time 3.13&lt;br /&gt;Done!&lt;br /&gt;real 0m3.326s&lt;br /&gt;user 0m2.696s&lt;br /&gt;sys 0m0.540s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d04 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.13 0.92 0.46 0.40&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 1.91&lt;br /&gt;EM3D compute time 4.35&lt;br /&gt;EM3D total time 6.26&lt;br /&gt;Done!&lt;br /&gt;real 0m6.443s&lt;br /&gt;user 0m5.908s&lt;br /&gt;sys 0m0.432s&lt;br /&gt;&lt;br /&gt;The build time is not changed, I don&apos;t know why, probably disabling the GC allows for an efficient enough memory allocation. But now the compute time is decreased, because the program can scan the arrays more efficiently.&lt;br /&gt;&lt;br /&gt;If a program like this gets used in practice, the build time is not important, because the number of iterations is probably large, so the compute time has to be as small as possible.&lt;br /&gt;&lt;br /&gt;Now both timings are better than the Java2 timings.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB9_2:&lt;br /&gt; movl 12(%eax), %esi&lt;br /&gt; movl (%esi,%edx,4), %esi&lt;br /&gt; movl 20(%eax), %edi&lt;br /&gt; movsd (%edi,%edx,8), %xmm1&lt;br /&gt; mulsd (%esi), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movsd %xmm0, (%eax)&lt;br /&gt; incl %edx&lt;br /&gt; cmpl %ecx, %edx&lt;br /&gt; jb .LBB9_2&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D05&lt;br /&gt;&lt;br /&gt;The CPU is able to use the pointers among Nodes in a quick way, but the number of nodes is probably limited. The program can increase its speed if there is less data traffic through the CPU cache. So I have replaced the pointers by 16 bit (ushort) indexes. This allows up to 2^16 nodes. If you need more, you can replace:&lt;br /&gt;alias ushort TyIndex;&lt;br /&gt;With:&lt;br /&gt;alias uint TyIndex;&lt;br /&gt;But this may produce code slower than D04.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d05 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.06 0.94 0.38 0.42&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 1.80&lt;br /&gt;EM3D compute time 1.00&lt;br /&gt;EM3D total time 2.80&lt;br /&gt;Done!&lt;br /&gt;real 0m2.937s&lt;br /&gt;user 0m2.540s&lt;br /&gt;sys 0m0.324s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d05 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.07 0.95 0.37 0.39&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 1.78&lt;br /&gt;EM3D compute time 4.02&lt;br /&gt;EM3D total time 5.80&lt;br /&gt;Done!&lt;br /&gt;real 0m5.955s&lt;br /&gt;user 0m5.508s&lt;br /&gt;sys 0m0.360s&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB9_2:&lt;br /&gt; movl 20(%eax), %edi&lt;br /&gt; movsd (%edi,%esi,8), %xmm1&lt;br /&gt; movl 12(%eax), %edi&lt;br /&gt; movzwl (%edi,%esi,2), %edi&lt;br /&gt; imull $40, %edi, %edi&lt;br /&gt; mulsd (%edx,%edi), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movsd %xmm0, (%eax)&lt;br /&gt; incl %esi&lt;br /&gt; cmpl %ecx, %esi&lt;br /&gt; jb .LBB9_2&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D06&lt;br /&gt;&lt;br /&gt;Now I look for possible algorithmic improvements.&lt;br /&gt;makeUniqueNeighbors() contains stupid quadratic code to avoid duplicating Neighbors. I have tried to use a D associative array, and then a set implemented as a bit vector (using bt/bts intrinsics), but the most efficient seems a set of booleans implemented with just an array of ubytes.&lt;br /&gt;&lt;br /&gt;I have renamed:&lt;br /&gt;coeffs =&amp;gt; inCoeffs&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d06 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.07 0.11 0.40 0.39&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.97&lt;br /&gt;EM3D compute time 1.03&lt;br /&gt;EM3D total time 2.00&lt;br /&gt;Done!&lt;br /&gt;real 0m2.145s&lt;br /&gt;user 0m1.748s&lt;br /&gt;sys 0m0.324s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d06 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;0.07 0.11 0.37 0.40&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.95&lt;br /&gt;EM3D compute time 4.11&lt;br /&gt;EM3D total time 5.06&lt;br /&gt;Done!&lt;br /&gt;real 0m5.231s&lt;br /&gt;user 0m4.804s&lt;br /&gt;sys 0m0.352s&lt;br /&gt;&lt;br /&gt;The build time for 5000 nodes with 300 arcs is half as before.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB9_2:&lt;br /&gt; movl 20(%eax), %edi&lt;br /&gt; movsd (%edi,%esi,8), %xmm1&lt;br /&gt; movl 12(%eax), %edi&lt;br /&gt; movzwl (%edi,%esi,2), %edi&lt;br /&gt; imull $40, %edi, %edi&lt;br /&gt; mulsd (%edx,%edi), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movsd %xmm0, (%eax)&lt;br /&gt; incl %esi&lt;br /&gt; cmpl %ecx, %esi&lt;br /&gt; jb .LBB9_2&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D07&lt;br /&gt;&lt;br /&gt;Now the most common optimizations are done, and the code looks good enough. To improve the code some more we have to think more about the CPU cache.&lt;br /&gt;&lt;br /&gt;The computations (of &quot;compute time&quot;) are done on an array of large structs Node. But only few fields are actually necessary for such computation: value, inbound and inCoeffs (total 6 words), so if I remove the useless data, the iterations on the array will be (hopefully) faster.&lt;br /&gt;&lt;br /&gt;I don&apos;t know how to remove items from the struct. The simpler way to solve this problem is to define a second Node struct, a Node2, with just the essential fields. Creating the two arrays of Node2 is fast, because all arrays of inbound and inCoeffs are just copied by reference.&lt;br /&gt;&lt;br /&gt;So I add the eNodes2 and hNodes2 dynamic arrays of Node2 to BiGraph (now encapsulation looks gone, but in practice I don&apos;t think the original design was good. The only data structure that has to know about a collection of Nodes2 has to be BiGraph, and not Node2 itself).&lt;br /&gt;&lt;br /&gt;There&apos;s some redundancy too here in a Node2: the inbound and inCoeffs arrays have the same length, so the Node2 can use just 5 words. There are few ways to remove this field, but the first field (the value) is a double, and for max performance it must be aligned to 8 bytes. So I have done few experiments, but they are failed, the program was slower, with a small memory usage reduction. So I have kept all the 6 words.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d07 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.08 0.10 0.37 0.39 0.00&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.94&lt;br /&gt;EM3D compute time 0.92&lt;br /&gt;EM3D total time 1.86&lt;br /&gt;Done!&lt;br /&gt;real 0m1.993s&lt;br /&gt;user 0m1.600s&lt;br /&gt;sys 0m0.316s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d07 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.07 0.10 0.39 0.40 0.00&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.96&lt;br /&gt;EM3D compute time 3.62&lt;br /&gt;EM3D total time 4.58&lt;br /&gt;Done!&lt;br /&gt;real 0m4.718s&lt;br /&gt;user 0m4.292s&lt;br /&gt;sys 0m0.360s&lt;br /&gt;&lt;br /&gt;The build time of the graph is the same, but the computing time is decreased.&lt;br /&gt;The time needed to build the arrays of Node2 is very small.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB9_2:&lt;br /&gt; movl 20(%eax), %edi&lt;br /&gt; movsd (%edi,%esi,8), %xmm1&lt;br /&gt; movl 12(%eax), %edi&lt;br /&gt; movzwl (%edi,%esi,2), %edi&lt;br /&gt; imull $24, %edi, %edi&lt;br /&gt; mulsd (%edx,%edi), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movsd %xmm0, (%eax)&lt;br /&gt; incl %esi&lt;br /&gt; cmpl %ecx, %esi&lt;br /&gt; jb .LBB9_2&lt;br /&gt;&lt;br /&gt;-----------------------------------&lt;br /&gt;&lt;br /&gt;D08&lt;br /&gt;&lt;br /&gt;We can improve the code more taking a look at how hValues are accessed, a sequential access is faster. But I&apos;ve seen that the inbound[i] are already sorted.&lt;br /&gt;&lt;br /&gt;So I have to take a better look at the very small loop that performs one of the two halves of a computation step:&lt;br /&gt;&lt;pre&gt;
void computeNewValue(Node2[] otherTable) {
  for (int i; i &amp;lt; inbound.length; i++)
    value -= inCoeffs[i] * otherTable[inbound[i]].value;
}
&lt;/pre&gt;&lt;br /&gt;There&apos;s not much that can be improved here. I have drawn on paper the simple data structures involved here, and with that I&apos;ve seen that inbound[i] performs (forward but) large jumps in an array of largish structs (6 words each) to find the &quot;value&quot; fields. This forces the CPU cache to a lot of traffic. If we reduce this traffic the iterations will get faster.&lt;br /&gt;&lt;br /&gt;The simple way to do this is to pull the values out of the nodes2 and put them in a uniform array, that&apos;s used in parallel to the Node2 array.&lt;br /&gt;&lt;pre&gt;
// Node2 of D07:
struct Node2 {
    double value;
    Node.TyIndex[] inbound;
    double[] inCoeffs;
}

// Node2 of D08:
struct Node2 {
    Node.TyIndex[] inbound;
    double[] inCoeffs;
    static double[] eValues, hValues;
}
&lt;/pre&gt;&lt;br /&gt;Now the jumps are performed on arrays (eValues and hValues) with just 2 words/item.&lt;br /&gt;&lt;br /&gt;To perform a better computation I have had to split computeNewValue() into eValuesStep() and hValuesStep(). I have pulled the outer loop like foreach(j, ref n; hNodes2) inside them, this may help the DMD compiler too, because it doesn&apos;t inline functions/methods that have a loop inside.&lt;br /&gt;&lt;br /&gt;I also have had to add a &quot;aux_val&quot; auxiliary variable inside the eValuesStep/hValuesStep functions, because it seems LDC/LLVM was not able to pull the access to Node2.hValues[j] out of the loop. I don&apos;t know why [see below, the explanation after D01b].&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d08 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.07 0.10 0.37 0.40 0.00&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.94&lt;br /&gt;EM3D compute time 0.74&lt;br /&gt;EM3D total time 1.68&lt;br /&gt;Done!&lt;br /&gt;real 0m1.817s&lt;br /&gt;user 0m1.416s&lt;br /&gt;sys 0m0.332s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d08 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.07 0.10 0.37 0.39 0.00&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.93&lt;br /&gt;EM3D compute time 2.96&lt;br /&gt;EM3D total time 3.89&lt;br /&gt;Done!&lt;br /&gt;real 0m4.038s&lt;br /&gt;user 0m3.628s&lt;br /&gt;sys 0m0.336s&lt;br /&gt;&lt;br /&gt;The compute time is decreased enough.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of eValuesStep() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB9_4:&lt;br /&gt; movzwl (%esi,%edi,2), %ebp&lt;br /&gt; movsd (%edx,%edi,8), %xmm1&lt;br /&gt; mulsd (%ebx,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; incl %edi&lt;br /&gt; cmpl %ecx, %edi&lt;br /&gt; jb .LBB9_4&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D09&lt;br /&gt;&lt;br /&gt;There&apos;s one last possible optimization of the memory usage that I can see. I can remove again the redundancy from Node2, because now it doesn&apos;t contain a double, and now probably I can use 3 32-bit words.&lt;br /&gt;&lt;br /&gt;The new Node2 is:&lt;br /&gt;&lt;pre&gt;
struct Node2 {
  Node.TyIndex* inbound;
  double* inCoeffs;
  int inbound_len;
  static double[] eValues, hValues;
}
&lt;/pre&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d09 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.08 0.10 0.37 0.39 0.00&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.94&lt;br /&gt;EM3D compute time 0.74&lt;br /&gt;EM3D total time 1.68&lt;br /&gt;Done!&lt;br /&gt;real 0m1.821s&lt;br /&gt;user 0m1.444s&lt;br /&gt;sys 0m0.296s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d09 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.07 0.11 0.37 0.39 0.00&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.94&lt;br /&gt;EM3D compute time 2.95&lt;br /&gt;EM3D total time 3.89&lt;br /&gt;Done!&lt;br /&gt;real 0m4.030s&lt;br /&gt;user 0m3.632s&lt;br /&gt;sys 0m0.332s&lt;br /&gt;&lt;br /&gt;The computing time is about the same. The memory saving compared to D08 is very small, something like 40 KB.&lt;br /&gt;&lt;br /&gt;It may be possible to vectorize the computing loop, but I stop here.&lt;br /&gt;&lt;br /&gt;Compared to the Java2 version the speed of this last D09 version is not much higher, because the computing loop is very tight and the JavaVM is able to optimize it very well, quite better than LDC, see the compute time of Java2 compared to the compute time of D01: 4.35 compared to 5.97. A shame for D/LDC/LLVM :-)&lt;br /&gt;&lt;br /&gt;Other Olden benchmarks show a bigger gap between the performance of the Java and optimized D code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of eValuesStep() compiled with LDC:&lt;br /&gt;&lt;br /&gt; .align 16&lt;br /&gt;.LBB9_4:&lt;br /&gt; movzwl (%esi,%edi,2), %ebp&lt;br /&gt; movsd (%edx,%edi,8), %xmm1&lt;br /&gt; mulsd (%ebx,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; incl %edi&lt;br /&gt; cmpl %ecx, %edi&lt;br /&gt; jl .LBB9_4&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D10&lt;br /&gt;&lt;br /&gt;Update 1.2, Oct 27 2006: &lt;br /&gt;&lt;br /&gt;There is another simple performance optimization that can be done. The memory for the Nodes and their arcs can be allocated from the C heap (and in some cases such memory can even be allocated with malloc instead of calloc, but in this programs this has no significant performance difference).&lt;br /&gt;&lt;br /&gt;The memory coming from the C heap is usually aligned to 4 bytes, so it&apos;s not fit for storing values and coefficients that in this program are doubles, that are used much more efficiently when aligned to 8 bytes. But in this program the actual computations performed on double FP numbers are done on arrays allocated in Node2, that are allocated from the D GC heap that returns memory aligned to 16 bytes. The only problem may come from Node.inCoeffs that&apos;s copied as is to Node2 inCoeffs. So this allocation needs a little of extra care (see makeFromNodes()).&lt;br /&gt;&lt;br /&gt;I can even allocate all &quot;outbound&quot; arrays at once, because they are all of the same length, but this change will not change performance significantly, because that time is part of the first timing of the &quot;Detailed creation timings:&quot;, that&apos;s about 0.03-0.04 seconds for 5000 nodes.&lt;br /&gt;&lt;br /&gt;I have moved the disable() of the GC below, when the computing loops are done, because there&apos;s no point in keeping the GC active during those loops, that produce no garbage.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d10 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.10 0.14 0.19 0.00&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.47&lt;br /&gt;EM3D compute time 0.60&lt;br /&gt;EM3D total time 1.07&lt;br /&gt;Done!&lt;br /&gt;real 0m1.203s&lt;br /&gt;user 0m0.968s&lt;br /&gt;sys 0m0.148s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d10 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.03 0.10 0.16 0.18 0.00&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.47&lt;br /&gt;EM3D compute time 2.46&lt;br /&gt;EM3D total time 2.93&lt;br /&gt;Done!&lt;br /&gt;real 0m3.061s&lt;br /&gt;user 0m2.844s&lt;br /&gt;sys 0m0.124s&lt;br /&gt;&lt;br /&gt;Not just (as expected) the build times are lower, but the compute ones too are lower, I don&apos;t know why, maybe the memory from the C heap has more coherence (more contiguous, reducing cache misses. The GC adds some spaces, because it returns memory blocks aligned to 16 bytes).&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D11&lt;br /&gt;&lt;br /&gt;Update 1.2, Oct 27 2006:&lt;br /&gt;&lt;br /&gt;To reduce the time used by makeFromNodes(), and hopefully to increase the computation loops (with a better cache coherence) we can cut the inbound and inCoeffs arrays of Node (later copied into Node2) from a large memory block allocated all at once (again, doubles are better alighed to 8 bytes) to do this I have created two memory arenas, with the DoubleArena and IndexArena structs.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.10 0.00 0.42 0.00&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.56&lt;br /&gt;EM3D compute time 0.58&lt;br /&gt;EM3D total time 1.14&lt;br /&gt;Done!&lt;br /&gt;real 0m1.271s&lt;br /&gt;user 0m1.028s&lt;br /&gt;sys 0m0.164s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.09 0.00 0.43 0.01&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.57&lt;br /&gt;EM3D compute time 2.23&lt;br /&gt;EM3D total time 2.80&lt;br /&gt;Done!&lt;br /&gt;real 0m2.961s&lt;br /&gt;user 0m2.652s&lt;br /&gt;sys 0m0.184s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11 -n 5000 -d 300 -i 2000 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.03 0.11 0.00 0.43 0.00&lt;br /&gt;Propagating field values for 2000 iteration(s)...&lt;br /&gt;EM3D build time 0.57&lt;br /&gt;EM3D compute time 23.39&lt;br /&gt;EM3D total time 23.96&lt;br /&gt;Done!&lt;br /&gt;real 0m24.621s&lt;br /&gt;user 0m23.869s&lt;br /&gt;sys 0m0.144s&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now the build time is a little higher (I don&apos;t know why), while the computing loops are faster (I don&apos;t know why), so when the number of iterations is 200 the total time is lower.&lt;br /&gt;&lt;br /&gt;In the detailed creation timings you can also see that makeFromNodes() is now very fast, while updateFromNodes() is quite slower (I don&apos;t know why. Here DMD is twice faster).&lt;br /&gt;&lt;br /&gt;This version also needs a little less RAM compared to D10, about 39 MB with -n 5000 -d 300. Java3 needs about 197 MB for the same graph.&lt;br /&gt;&lt;br /&gt;Now the D11 program is fast enough, and uses a low enough memory, that the limit of 65000 nodes given by the ushort indexes can be felt. Using an uint as TyIndex the memory used by D11 with -n 5000 -d 300 is 51 MB.&lt;br /&gt;&lt;br /&gt;Timings of D11 with TyIndex=uint:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.09 0.11 0.00 0.48 0.00&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.68&lt;br /&gt;EM3D compute time 0.68&lt;br /&gt;EM3D total time 1.36&lt;br /&gt;Done!&lt;br /&gt;real 0m1.525s&lt;br /&gt;user 0m1.220s&lt;br /&gt;sys 0m0.200s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.06 0.12 0.00 0.50 0.00&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.68&lt;br /&gt;EM3D compute time 2.71&lt;br /&gt;EM3D total time 3.39&lt;br /&gt;Done!&lt;br /&gt;real 0m3.597s&lt;br /&gt;user 0m3.252s&lt;br /&gt;sys 0m0.184s&lt;br /&gt;&lt;br /&gt;Using such larger indexes the running time is not much higher, but in this case using just pointers is probably faster.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of eValuesStep() of D11 compiled with LDC, when TyIndex=uint:&lt;br /&gt;&lt;br /&gt;.LBB14_4:&lt;br /&gt; movl (%esi,%edi,4), %ebp&lt;br /&gt; movsd (%edx,%edi,8), %xmm1&lt;br /&gt; mulsd (%ebx,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; incl %edi&lt;br /&gt; cmpl %ecx, %edi&lt;br /&gt; jl .LBB14_4&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D11b&lt;br /&gt;(lateral branch)&lt;br /&gt;&lt;br /&gt;Update 1.2, Oct 27 2006:&lt;br /&gt;&lt;br /&gt;When the number of nodes is high, instead of storing ushort indexes, I can store ushort deltas. And because the indexes are ordered such deltas are never negative. If links are well spread, then such deltas are usually small, and they can be used to index far more than 2^16 nodes. If such links are randomly distriuited, you can probably manage millions of nodes in a safe enough way, and it&apos;s easy to add a runtime test when they are created to be sure they don&apos;t overflow.&lt;br /&gt;&lt;br /&gt;I have done a test, using the usual -n 5000 -d 300 arguments with the same seed=783, in such case the maximum delta is 240, so just one ubyte suffices! ubyte indexes reduce traffic through the cache, so they may speed up the program a little more. The D11b uses such TyDelta in Node2 but I have kept usign TyIndex in Node to keep the program simpler.&lt;br /&gt;&lt;br /&gt;Timings of D11b with TyIndex=ushort, TyDelta=ubyte:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11b -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.10 0.00 0.42 0.02&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.59&lt;br /&gt;EM3D compute time 0.56&lt;br /&gt;EM3D total time 1.15&lt;br /&gt;Done!&lt;br /&gt;real 0m1.266s&lt;br /&gt;user 0m1.032s&lt;br /&gt;sys 0m0.156s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11b -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.11 0.00 0.42 0.02&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.59&lt;br /&gt;EM3D compute time 2.20&lt;br /&gt;EM3D total time 2.79&lt;br /&gt;Done!&lt;br /&gt;real 0m2.968s&lt;br /&gt;user 0m2.676s&lt;br /&gt;sys 0m0.160s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11b -n 5000 -d 300 -i 2000 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.03 0.12 0.00 0.48 0.02&lt;br /&gt;Propagating field values for 2000 iteration(s)...&lt;br /&gt;EM3D build time 0.65&lt;br /&gt;EM3D compute time 21.98&lt;br /&gt;EM3D total time 22.63&lt;br /&gt;Done!&lt;br /&gt;real 0m23.252s&lt;br /&gt;user 0m22.477s&lt;br /&gt;sys 0m0.204s&lt;br /&gt;&lt;br /&gt;The iteration is a bit faster, but the build time of Nodes2 is a little slower (by about 0.02 seconds). So for 200 iterations the running time is about the same, while for 2000 you can see some difference. For TyDelta==ubyte this D11b program is not very useful (unless you have few nodes), but for larger graphs such delta encoding can be useful to keep using ushorts even when there are more than 2^16 nodes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Timings of D11b with TyIndex=uint, TyDelta=ushort:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11b -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.11 0.00 0.44 0.02&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.61&lt;br /&gt;EM3D compute time 0.56&lt;br /&gt;EM3D total time 1.17&lt;br /&gt;Done!&lt;br /&gt;real 0m1.309s&lt;br /&gt;user 0m1.032s&lt;br /&gt;sys 0m0.180s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d11b -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.05 0.10 0.00 0.43 0.02&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.60&lt;br /&gt;EM3D compute time 2.20&lt;br /&gt;EM3D total time 2.80&lt;br /&gt;Done!&lt;br /&gt;real 0m2.997s&lt;br /&gt;user 0m2.684s&lt;br /&gt;sys 0m0.176s&lt;br /&gt;&lt;br /&gt;D11b with TyDelta=ushort is just a little slower than D11b with TyDelta=ubyte, so in most situations it can be enough for a number of nodes &amp;gt;&amp;gt; 2^16 if the arcs are spread in an uniform random way.&lt;br /&gt;&lt;br /&gt;This em3d program is just a benchmark, but in a real program you want to add class invariants, unittests, several safeties for the memory, and ways to deallocate memory when not used any more, etc.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D12&lt;br /&gt;(this ignores the D11b branch)&lt;br /&gt;&lt;br /&gt;Update 1.3, Oct 27 2006:&lt;br /&gt;&lt;br /&gt;The items of inbound and inCoeffs are accessed in parallel, so it can be useful for them to be close to each other. Ideally the items of inbound and inCoeffs are better kept in the same cache line, that&apos;s a block of 64 bytes (and I think it has to be aligned too). But I can&apos;t put them into a struct because it wastes a lot of space (such struct has to be 16 bytes long to keep the double it contains aligned to 8 bytes), and wasting space increases cache misses.&lt;br /&gt;&lt;br /&gt;If I want to pack pairs of double + ushort in a struct that inside an array is aligned to 8 bytes, I need to pack 4 doubles and 4 ushorts, this needs 8*4 + 2*4 = 32 + 8 = 40 bytes. But this straddles a cache line, and I think this is negative. So to pack as much as possible in a cache line I can use 6 pairs, that need 8*6 + 2*6 = 48 + 12 = 60 bytes, that needs 4 bytes of padding, this wastes 6.25% of space, I think it&apos;s acceptable. I can define it like this:&lt;br /&gt;&lt;pre&gt;
struct PairsPack {
  static if (TyIndex.sizeof == 2)
    const int NPAIRS = 6; // 8*6 + 2*6 + 4 = 48 + 12 + 4 = 64
  else static if (TyIndex.sizeof == 4)
    const int NPAIRS = 5; // 8*5 + 4*5 + 4 = 40 + 20 + 4 = 64
  else
      static assert(0);

  double[NPAIRS] weights;
  TyIndex[NPAIRS] inbounds;
  uint _padding;
}
&lt;/pre&gt;&lt;br /&gt;Just to be sure I&apos;ll allocate the array of PairsPack aligned to 64 bytes.&lt;br /&gt;&lt;br /&gt;But generally the number of inbound arcs in a node isn&apos;t a multiple of 6, so I need to waste some of those pairs, or use part of an already used PairsPack. Both solutions have advantages and disadvantages. I choose the option to not waste memory, this requires a little more complex indexing of the sub-part of a PairsPack, but allows me to allocate the array of PairsPack in a simpler way.&lt;br /&gt;&lt;br /&gt;This D12 was the hardest version to create, the code is quite hairy, and hard to maintain and understand. And it&apos;s slow (the Range!() that I have used is not necessary. LLVM was able to unroll that loop inside eValuesStep/hValuesStep by itself).&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d12o -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.11 0.00 0.43 0.19&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 0.78&lt;br /&gt;EM3D compute time 0.68&lt;br /&gt;EM3D total time 1.46&lt;br /&gt;Done!&lt;br /&gt;real 0m1.614s&lt;br /&gt;user 0m1.240s&lt;br /&gt;sys 0m0.284s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d12o -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.04 0.11 0.00 0.41 0.20&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 0.76&lt;br /&gt;EM3D compute time 2.65&lt;br /&gt;EM3D total time 3.41&lt;br /&gt;Done!&lt;br /&gt;real 0m3.623s&lt;br /&gt;user 0m3.176s&lt;br /&gt;sys 0m0.308s&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The asm of inner loop of eValuesStep() of D11 compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB14_4:&lt;br /&gt; movzwl 50(%eax), %ebp&lt;br /&gt; movsd 8(%eax), %xmm1&lt;br /&gt; mulsd (%esi,%ebp,8), %xmm1&lt;br /&gt; movzwl 48(%eax), %ebp&lt;br /&gt; movsd (%eax), %xmm2&lt;br /&gt; mulsd (%esi,%ebp,8), %xmm2&lt;br /&gt; subsd %xmm2, %xmm0&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movzwl 52(%eax), %ebp&lt;br /&gt; movsd 16(%eax), %xmm1&lt;br /&gt; mulsd (%esi,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movzwl 54(%eax), %ebp&lt;br /&gt; movsd 24(%eax), %xmm1&lt;br /&gt; mulsd (%esi,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movzwl 56(%eax), %ebp&lt;br /&gt; movsd 32(%eax), %xmm1&lt;br /&gt; mulsd (%esi,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; movzwl 58(%eax), %ebp&lt;br /&gt; movsd 40(%eax), %xmm1&lt;br /&gt; mulsd (%esi,%ebp,8), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; addl $64, %eax&lt;br /&gt; decl %ebx&lt;br /&gt; jne .LBB14_4&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Java3&lt;br /&gt;&lt;br /&gt;Now I can think porting back some of the optimizations I have implemented in D to Java. Java don&apos;t uses structs, so several things are not possible.&lt;br /&gt;&lt;br /&gt;In this Java3 I have used the uniqueNeighborsSet set, implemented as the D code.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time java -server Em3d3 -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 1.681&lt;br /&gt;EM3D compute time 1.14&lt;br /&gt;EM3D total time 2.821&lt;br /&gt;Done!&lt;br /&gt;real 0m3.138s&lt;br /&gt;user 0m2.600s&lt;br /&gt;sys 0m0.416s&lt;br /&gt;&lt;br /&gt;...$ time java -server Em3d3 -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 1.649&lt;br /&gt;EM3D compute time 4.29&lt;br /&gt;EM3D total time 5.943&lt;br /&gt;Done!&lt;br /&gt;real 0m6.239s&lt;br /&gt;user 0m5.656s&lt;br /&gt;sys 0m0.452s&lt;br /&gt;&lt;br /&gt;More optimizations may be ported from D to Java.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;D01b&lt;br /&gt;(lateral branch)&lt;br /&gt;&lt;br /&gt;Looking at the asm of the inner loop of computeNewValue for D01 and D09, I think the D01 can be improved with an auxiliary variable:&lt;br /&gt;&lt;pre&gt;
void computeNewValue() {
  auto aux = this.value;
  for (int i = 0; i &amp;lt; fromCount; i++)
    aux -= coeffs[i] * fromNodes[i].value;
  this.value = aux;
}
&lt;/pre&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time ./em3d01b -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 4.89&lt;br /&gt;EM3D compute time 1.28&lt;br /&gt;EM3D total time 6.17&lt;br /&gt;Done!&lt;br /&gt;real 0m6.335s&lt;br /&gt;user 0m5.804s&lt;br /&gt;sys 0m0.464s&lt;br /&gt;&lt;br /&gt;...$ time ./em3d01b -n 5000 -d 300 -i 200 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Propagating field values for 200 iteration(s)...&lt;br /&gt;EM3D build time 4.92&lt;br /&gt;EM3D compute time 5.23&lt;br /&gt;EM3D total time 10.15&lt;br /&gt;Done!&lt;br /&gt;real 0m10.601s&lt;br /&gt;user 0m10.105s&lt;br /&gt;sys 0m0.440s&lt;br /&gt;&lt;br /&gt;LDC/LLVM must be able to perform that optimization (pulling this.value out of the loop) by itself.&lt;br /&gt;&lt;br /&gt;This is the asm of inner loop of computeNewValue() compiled with LDC:&lt;br /&gt;&lt;br /&gt;.LBB11_2:&lt;br /&gt; movl (%edx,%edi,4), %ebx&lt;br /&gt; movsd (%esi,%edi,8), %xmm1&lt;br /&gt; mulsd 8(%ebx), %xmm1&lt;br /&gt; subsd %xmm1, %xmm0&lt;br /&gt; incl %edi&lt;br /&gt; cmpl %ecx, %edi&lt;br /&gt; jl .LBB11_2&lt;br /&gt;&lt;br /&gt;This asm is very short, but it&apos;s quite slower than D09 anyway. I don&apos;t know what kind of optimizations are done by the JavaVM on the Java1 code (there is a way to look at the asm produced by the JVM, but it&apos;s not handy).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Update 1.1, Oct 26 2006: now I think I know why LDC isn&apos;t able to perform that optimization in the computeNewValue() loop: the compiler doesn&apos;t know that &quot;value&quot; on the right is always distinct from &quot;value&quot; on the left, because this is a bipartite graph, where no self-loops are allowed. So it can&apos;t pull the &quot;value&quot; of the left out of the loop.&lt;br /&gt;&lt;br /&gt;To perform the optimization the compiled need more semantics. Future programming languages may allow the programmer to give such semantics to the compiler.&lt;br /&gt;&lt;br /&gt;I have done a similar optimization (pulling &quot;value&quot; out of the loop) in the Java3 version, with no change in performance, so probably the JavaVM is somehow able to perform that optimization.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Python1&lt;br /&gt;&lt;br /&gt;I have created a Python version too, that uses Psyco. It&apos;s similar to the Java3 version.&lt;br /&gt;I have seen that the faster unique_neighbors_set is using just a Python set.&lt;br /&gt;&lt;br /&gt;I have added __slots__ to both Random and Node, this doubles the performance.&lt;br /&gt;&lt;br /&gt;Timings:&lt;br /&gt;&lt;br /&gt;...$ time python em3d.py -n 5000 -d 300 -i 50 -m&lt;br /&gt;Initializing em3d random graph...&lt;br /&gt;Detailed creation timings: 0.14 3.86 0.16 4.32&lt;br /&gt;Propagating field values for 50 iteration(s)...&lt;br /&gt;EM3D build time 8.48&lt;br /&gt;EM3D compute time 54.70&lt;br /&gt;EM3D total time 63.19&lt;br /&gt;Done!&lt;br /&gt;real 1m4.320s&lt;br /&gt;user 0m59.628s&lt;br /&gt;sys 0m3.876s&lt;br /&gt;&lt;br /&gt;I have used:&lt;br /&gt;Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)&lt;br /&gt;Psyco 1.6.0 final&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Python2&lt;br /&gt;&lt;br /&gt;While improving the D code, and especially when I have created the Python1 version, I have felt the code as too much complex for the simple operations it performs. So I have reduced the Python code, removing useless parts. The Python2 version is almost the shortest Python code that produces the same output (there are ways to shorten it a little more, but I am not doing code golf here, my purposes are different).&lt;br /&gt;&lt;br /&gt;The Python2 code is 67 lines long, while the Java3 is 481 lines including the comments.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Python3&lt;br /&gt;&lt;br /&gt;Then I have replaced the random() and sample() with ones from the Python standard library, I have removed the printing code, and I have reorganized the code a little more.&lt;br /&gt;&lt;br /&gt;The result is short enough (21 nonempty lines) to be shown here too:&lt;br /&gt;&lt;pre&gt;
from random import random, sample

class Node:
    def __init__(self, value):
        self.value = value
        self.in_arcs = []

n_nodes, in_degree, n_steps = 15, 5, 5

h_nodes = [Node(random()) for i in xrange(n_nodes)]
e_nodes = [Node(random()) for i in xrange(n_nodes)]

def make_in_arcs(nodes, other_nodes, in_degree):
    for n1 in nodes:
        for n2 in sample(other_nodes, in_degree):
            n2.in_arcs.append((n1, random()))

make_in_arcs(h_nodes, e_nodes, in_degree)
make_in_arcs(e_nodes, h_nodes, in_degree)

def half_step(nodes):
    for n1 in nodes:
        for n2, weight in n1.in_arcs:
            n1.value -= weight * n2.value

for i in xrange(n_steps):
    half_step(e_nodes)
    half_step(h_nodes)
&lt;/pre&gt;&lt;br /&gt;This code shows the essential, and for me it&apos;s much simpler to follow and understand. This code may be conveted to C/D again, if necessary. The original C code was harder to understand, and much more complex. Removing the noise and useless parts allows me to understand the algorithm better, this is a simple algorithm. Even if you don&apos;t understand this code immediately, the complexity left is necessary, it&apos;s not spurious. Generally good programs aren&apos;t the clever and complex ones, but ones that look &quot;obvious&quot;. This Python code shows that lot of the complexity of the original Java code was spurious.&lt;br /&gt;&lt;br /&gt;------------------------------------&lt;br /&gt;&lt;br /&gt;Python4&lt;br /&gt;&lt;br /&gt;This is similar to the Python3 version, I have just encoded a node in a simpler way, using a single list. The first item is the &quot;value&quot; and all the following ones are pairs of node-weight. This code is just 17 nonempty lines long. For me Python3 version is more readable.&lt;br /&gt;&lt;br /&gt;------------------------------------</description>
  <comments>http://leonardo-m.livejournal.com/90054.html</comments>
  <category>psyco</category>
  <category>programming</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>python</category>
  <category>ldc</category>
  <category>c</category>
  <category>java</category>
  <category>asm</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/89105.html</guid>
  <pubDate>Mon, 19 Oct 2009 10:10:43 GMT</pubDate>
  <title>Slow allocation of D objects</title>
  <link>http://leonardo-m.livejournal.com/89105.html</link>
  <description>Allocating objects in D language, usign the good and efficient LDC compiler, may seem slower than doing the same thing in C++, but the situation is a little more complex, so few examples can show what&apos;s going on.&lt;br /&gt;&lt;br /&gt;This is a syntetic C++ benchmark, it allocates an array of n pointers to Foo object, and then allocates the instances, that contain ten 32 bit integers:&lt;br /&gt;&lt;pre&gt;// C++ version 1, 0.46 s
#include &quot;stdio.h&quot;
#include &quot;stdlib.h&quot;

using namespace std;

class Foo {
    public:
        int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
        Foo(int, int, int, int, int, int, int, int, int, int);
};

Foo::Foo(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10):
  y1(x1), y2(x2), y3(x3), y4(x4), y5(x5), y6(x6), y7(x7), y8(x8), y9(x9), y10(x10) {}

int main(int argc, char *argv[]) {
    int n = (argc == 2) ? atoi(argv[1]) : 5;
    Foo* foos[n];

    for (int i = 0; i &amp;lt; n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf(&quot;%d %d %d\n&quot;, foos[n-1]-&amp;gt;y1, foos[n-1]-&amp;gt;y2, foos[n-1]-&amp;gt;y10);
    return 0;
}&lt;/pre&gt;&lt;br /&gt;This is an almost equivalent D code. I use printf and those imports are so hairy because this code is designed to work with both Phobos or Tango standard libraries, and with D1 and D2 languages:&lt;br /&gt;&lt;pre&gt;// D version 1, 0.89 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
} else {
    import std.c.stdio: printf;
    version (D_Version2) {
        import std.conv: to;
        alias to!(int, char[]) toInt;
    } else
        import std.conv: toInt;
}

class Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;

    this(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
        y1 = x1;
        y2 = x2;
        y3 = x3;
        y4 = x4;
        y5 = x5;
        y6 = x6;
        y7 = x7;
        y8 = x8;
        y9 = x9;
        y10 = x10;
    }
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo[] foos = new Foo[n];

    for (int i = 0; i &amp;lt; n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;Replacing in the C++ the class like this doesn&apos;t change the running time of the program:&lt;br /&gt;&lt;pre&gt;// piece of the C++ version 2, 0.47 s
class Foo {
    public:
        int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
        Foo(int, int, int, int, int, int, int, int, int, int);
};

Foo::Foo(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
    y1 = x1;
    y2 = x2;
    y3 = x3;
    y4 = x4;
    y5 = x5;
    y6 = x6;
    y7 = x7;
    y8 = x8;
    y9 = x9;
    y10 = x10;
}&lt;/pre&gt;&lt;br /&gt;The running times on a Ubuntu running on VirtualBox, running on Vista 32 bit, running on a Celeron CPU, using LDC compiler based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009):&lt;br /&gt;&lt;pre&gt;Timings, n = 1_000_000, best of 3, seconds:
  C++ 1: 0.46  |#########                |
  C++ 2: 0.47  |#########                |
  D 1:   0.89  |##################       |
  D 2:   0.82  |################         |
  D 3:   0.87  |#################        |
  D 4:   0.46  |#########                |
  D 5:   0.46  |#########                |
  D 6:   1.22  |#########################|
  D 6b:  0.86  |#################        |
  D 7:   0.63  |############             |
  D 8:   0.71  |##############           |&lt;/pre&gt;&lt;br /&gt;I have compiled the C++ code with gcc V.4.3.3 with:&lt;br /&gt;g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native objbench1_cpp.cpp -o objbench1_cpp&lt;br /&gt;And the D code with:&lt;br /&gt;ldc -O5 -release -inline objbench1_d.d&lt;br /&gt;&lt;br /&gt;As you can see the D version is about twice slower than the C++ code, despite usually the LDC compiler is almost as efficient as GCC (LLVM is not able to perform some optimizations yet, like auto-vectorization, but things are already good and they will improve).&lt;br /&gt;&lt;br /&gt;D classes are different from C++ ones, D objects always contain a pointer to the virtual table (even if no virtual methods are present, it&apos;s used by the reflection and GC too) and a monitor, so on 32 bit systems they need 8 extra bytes. So we may try to remove such memory (and bookeeping) overhead using a struct (this version can be changed a little, so it works with Phobos of D2 too):&lt;br /&gt;&lt;pre&gt;// D version 2, 0.82 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }
    
    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;But you can see the performance improves only a little, so the problem is elsewhere. I have had to use that two-line initialization because only in D2 language structs are allowed to have an explicit constructor.&lt;br /&gt;&lt;br /&gt;We can try allocating the pointer array on the stack but the performance gets a little worse:&lt;br /&gt;&lt;pre&gt;// D version 3, 0.87 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: alloca;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: alloca;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo** ptr = cast(Foo**)alloca((Foo*).sizeof * n);
    Foo*[] foos = ptr[0 .. n];
    
    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;A next step is to allocate the structs from the C heap, this time the performance is about the same as the C++ code, showing that the cause of the slowdown is the (not efficient yet) D Garbage Collector:&lt;br /&gt;&lt;pre&gt;// D version 4, 0.46 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: alloca, malloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: alloca, malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo** ptr = cast(Foo**)alloca((Foo*).sizeof * n);
    Foo*[] foos = ptr[0 .. n];
    
    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = cast(Foo*)malloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;So we can restore the original allocation of the dynamic array of pointers, the performance improves a tiny bit (less than 0.01 s):&lt;br /&gt;&lt;pre&gt;// D version 5, 0.46 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: malloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];
    
    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = cast(Foo*)malloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;We may think that the problem is in the double initalization of the structs, it seems that&apos;s not the case, as the following code is the slowest, here I use uninitialized memory from the GC heap:&lt;br /&gt;&lt;pre&gt;// D version 6, 1.22 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.malloc gcmalloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: gcmalloc = malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];
    
    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = cast(Foo*)gcmalloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;I don&apos;t fully know why the performance of the version 6 is so low (if you have ideas please tell me), part of that lower performance comes from the GC that scans the memory of those structs. You can see it in the version 6b, where I have disabled the scanning of those memory blocks, there&apos;s indeed a significant performance improvement:&lt;br /&gt;&lt;pre&gt;// D version 6b, 0.86 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.malloc gcmalloc;
    void hasNoPointers(void* p) {
        GC.setAttr(p, GC.BlkAttr.NO_SCAN);
    }
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: gcmalloc = malloc, hasNoPointers;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = cast(Foo*)gcmalloc(Foo.sizeof);
        hasNoPointers(foos[i]);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);
    }

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;But even if the GC doesn&apos;t scan those memory blocks, it performs several operations anyway, so when you allocate many pieces of memory and you need performance, you may want to disable the GC (and enable it just after the allocation, here I have not added the enable(), but in your code you are supposed to put it), the performance is intermediate:&lt;br /&gt;&lt;pre&gt;// D version 7, 0.63 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.disable disable;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: disable;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    disable();
    
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i &amp;lt; n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }
    
    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;We can also go back ot the object-based version, with a further little reduction of performance:&lt;br /&gt;&lt;pre&gt;// D version 8, 0.71 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.disable disable;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: disable;
}

class Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;

    this(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
        y1 = x1;
        y2 = x2;
        y3 = x3;
        y4 = x4;
        y5 = x5;
        y6 = x6;
        y7 = x7;
        y8 = x8;
        y9 = x9;
        y10 = x10;
    }
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    disable();
    Foo[] foos = new Foo[n];

    for (int i = 0; i &amp;lt; n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf(&quot;%d %d %d\n&quot;, foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}&lt;/pre&gt;&lt;br /&gt;Having a garbage collector is handy, and sometimes safer too, but it has a price too (in the size of the binary too). D allows you to not use the GC when you need C++-like performance, but then you have to remember to manage and deallocate memory manually as in C, or you need to implement/use other forms or memory management as in C++ (and D allows the scope(exit) idiom too to deallocate memory when a scope ends). If the D language will get widespread, its GC will improve, restoring part of the lost performance.</description>
  <comments>http://leonardo-m.livejournal.com/89105.html</comments>
  <category>programming</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>c++</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/88868.html</guid>
  <pubDate>Tue, 29 Sep 2009 19:28:12 GMT</pubDate>
  <title>Tree visits in D</title>
  <link>http://leonardo-m.livejournal.com/88868.html</link>
  <description>On the RosettaCode site there&apos;s this Programming Task:&lt;br /&gt;&lt;a href=&quot;http://rosettacode.org/wiki/Tree_traversal&quot;&gt;http://rosettacode.org/wiki/Tree_traversal&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Problem: Implement a binary tree where each node carries an integer, and implement preoder, inorder, postorder and level-order traversal. Use those traversals to output the following tree:&lt;br /&gt;&lt;pre&gt;         1
        / \
       /   \
      /     \
     2       3
    / \     /
   4   5   6
  /       / \
 7       8   9

The correct output should look like this:

preorder:    1 2 4 7 5 3 6 8 9
inorder:     7 4 2 5 1 8 6 9 3
postorder:   7 4 5 2 8 9 6 3 1
level-order: 1 2 3 4 5 6 7 8 9
&lt;/pre&gt;The following is my D implementation. I have used the D Version 2 language (to compile the code I am using DMD v2.032, with the Phobos std lib, but it uses only the printing of Phobos, so adapting the code to Tango is easy), but a very similar version can be adapted for D V1.&lt;br /&gt;&lt;br /&gt;Disclaimer: The following isn&apos;t an example of common (or good) D code. It&apos;s very generic and it&apos;s a bit tricky. In practice you usually write simpler code, because you don&apos;t need such genericity, and to keep the code simpler to write and understand. Simpler code is also simpler to debug. You may need to write code so much generic only in library-like routines, that are usually limited in number and size. So there are surely ways to write simpler/shorter D code, but here I show this version because it can be useful to explain some features of the D language, and because it&apos;s more fun.&lt;br /&gt;&lt;br /&gt;&lt;div class=&quot;d&quot; style=&quot;border: 1px solid rgb(208, 208, 208); font-family: monospace; color: rgb(0, 0, 102); background-color: rgb(240, 240, 240);&quot;&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;import&lt;/span&gt; std.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;stdio&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;:&lt;/span&gt; write&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; writeln&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;class&lt;/span&gt; Node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;T&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; T data&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; Node left&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; right&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;this&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;T data&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; Node left&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; Node right&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;this&lt;/span&gt;.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; data&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;this&lt;/span&gt;.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;left&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; left&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;this&lt;/span&gt;.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;right&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; right&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(128, 128, 128); font-style: italic;&quot;&gt;// static templated opCall can&apos;t be used in Node&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;auto&lt;/span&gt; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;T&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;T data&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; Node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;T left&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; Node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;T right&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;return&lt;/span&gt; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;new&lt;/span&gt; Node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;T&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;data&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; left&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; right&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt; show&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;T&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;T x&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; write&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;x&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; &lt;span style=&quot;color: rgb(255, 0, 0);&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;enum&lt;/span&gt; Visit &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt; pre&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; inv&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; post &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(128, 128, 128); font-style: italic;&quot;&gt;// visitor can be any kind of callable or it uses a default visitor.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(128, 128, 128); font-style: italic;&quot;&gt;// TNode can be any kind of Node, with data, left and right fields,&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(128, 128, 128); font-style: italic;&quot;&gt;// so this is more generic than a member function of Node.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt; backtrackingOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;Visit v&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; TNode&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; TyF&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;TNode node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; TyF visitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;is&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;TyF &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;==&lt;/span&gt; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;auto&lt;/span&gt; truevisitor &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;&amp;amp;&lt;/span&gt;show&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;typeof&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;else&lt;/span&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;auto&lt;/span&gt; truevisitor &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; visitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;is&lt;/span&gt; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;v &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;==&lt;/span&gt; Visit.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;pre&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &amp;nbsp; truevisitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; backtrackingOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;v&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;left&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; visitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;v &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;==&lt;/span&gt; Visit.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;inv&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&amp;nbsp; truevisitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; backtrackingOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;v&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;right&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; visitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;v &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;==&lt;/span&gt; Visit.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;post&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; truevisitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt; levelOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;TNode&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; TyF&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;TNode node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; TyF visitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; TNode&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;[&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;]&lt;/span&gt; more&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;[&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;]&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;is&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;TyF &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;==&lt;/span&gt; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;auto&lt;/span&gt; truevisitor &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;&amp;amp;&lt;/span&gt;show&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;typeof&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;else&lt;/span&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;auto&lt;/span&gt; truevisitor &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; visitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;is&lt;/span&gt; &lt;span style=&quot;color: rgb(0, 0, 0); font-weight: bold;&quot;&gt;null&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; more &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;~=&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;[&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;left&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;right&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;]&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; truevisitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;node.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;data&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(177, 177, 0);&quot;&gt;if&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;more.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;length&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; levelOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;more&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;[&lt;/span&gt;0&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;]&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; truevisitor&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt; more&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;[&lt;/span&gt;1&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;..&lt;/span&gt;$&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;]&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;void&lt;/span&gt; main&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt; &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;{&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &lt;span style=&quot;color: rgb(153, 51, 51);&quot;&gt;auto&lt;/span&gt; tree &lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;=&lt;/span&gt; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;1&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;2&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;4&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;7&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;5&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;3&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;6&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;8&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;,&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; node&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;9&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; write&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(255, 0, 0);&quot;&gt;&quot; &amp;nbsp;preOrder: &quot;&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; backtrackingOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;Visit.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;pre&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;tree&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; write&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(255, 0, 0);&quot;&gt;&quot;&lt;span style=&quot;color: rgb(0, 0, 153); font-weight: bold;&quot;&gt;\n&lt;/span&gt; &amp;nbsp; inOrder: &quot;&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; backtrackingOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;Visit.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;inv&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;tree&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; write&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(255, 0, 0);&quot;&gt;&quot;&lt;span style=&quot;color: rgb(0, 0, 153); font-weight: bold;&quot;&gt;\n&lt;/span&gt; postOrder: &quot;&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; backtrackingOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;!&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;Visit.&lt;span style=&quot;color: rgb(0, 102, 0);&quot;&gt;post&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;tree&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; write&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(255, 0, 0);&quot;&gt;&quot;&lt;span style=&quot;color: rgb(0, 0, 153); font-weight: bold;&quot;&gt;\n&lt;/span&gt;levelorder: &quot;&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; levelOrder&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;tree&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; writeln&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;)&lt;/span&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: rgb(102, 204, 102);&quot;&gt;}&lt;/span&gt;&lt;/div&gt;&lt;pre&gt;Output:
  preOrder: 1 2 4 7 5 3 6 8 9
   inOrder: 7 4 2 5 1 8 6 9 3
 postOrder: 7 4 5 2 8 9 6 3 1
levelorder: 1 2 3 4 5 6 7 8 9
&lt;/pre&gt;A possible design for Java is to create a Node class (using generics to allow for different types of data) that has preOrder, InOrder, etc, methods. This simple design can be used in D too. But it conflates data structure and algorithms, so you can&apos;t use the same algorithms (the tree visits) for other kinds of nodes.&lt;br /&gt;&lt;br /&gt;So if your language is flexible enough (like C++, D, Haskell, or Python) it&apos;s better to follow the strategy used in the C++ STL, and implement generic algorithms, that work with more kinds of binary tree nodes and allow for a more flexible management of the node being visited.&lt;br /&gt;&lt;br /&gt;Another design that can be natural in modern versions of Python is to create iterators, that yield the current node. This is very handy, and such strategy can be used in D too (with opApply or with the second iteration protocol of D2) but I have not used it in my D code. In Python (and in some other languages, in D too if you use opApply) this strategy leads to O(n^2) tree visits. (There is a Python PEP that if well implemented may solve this problem).&lt;br /&gt;&lt;br /&gt;Th flexibility of D language has allowed me to merge the three backtracking visits (pre, in and post - order) in a single templated function with zero overehead.&lt;br /&gt;&lt;br /&gt;In the code I have used a class to represent the node to keep code a little simpler, but a struct too can be used with very small changes. Such structs are smaller 8 bytes less on a 32 bit system) and it&apos;s simpler to allocate them with a memory pool that can halve the time needed to create and allocate the tree.&lt;br /&gt;&lt;br /&gt;I have added a helper node() templated function, it just allocates a new node and returns it. In D something similar can also be done with a static opCall inside Node, but here it&apos;s also templated, and templatd methods can&apos;t replace opCall of object (I may be wrong on this, but I don&apos;t think so). If you have a different Node class/struct you need a different helper function, or you have to use opCall (if the Node isn&apos;t generic) or you have to build the tree in a simpler way.&lt;br /&gt;&lt;br /&gt;Node!T is the new compact syntax introduced in D2 to instantiate a template when you have a single argument. The standard older syntax is Node!(T), that equals to C++ Node&lt;t&gt;.&lt;br /&gt;&lt;br /&gt;show() is a little function that&apos;s used as default visitor of the node. It&apos;s templated, so you have to instantiate it before taking its address to use it as function pointer.&lt;br /&gt;&lt;br /&gt;backtrackingOrder() merges the three tree visits, its code is a little tricky.&lt;br /&gt;&lt;br /&gt;The enum Visit is used to represent one of the three possible visits. To specify it you can first partial specify backtrackingOrder() according to visit type known a compile-time, and then you can fully specify it automatically with the given root node and an optional callable.&lt;br /&gt;&lt;br /&gt;The node can be any type that has a left, right and node attributes.&lt;br /&gt;&lt;br /&gt;visitor can be any callable, a delegate, function pointer, or callable object. If visitor is unspecified you can omit it, and it uses show() instantiated on the type of the data.&lt;br /&gt;&lt;br /&gt;The static ifs are done at compile time, and they don&apos;t create a scope, so truevisitor (its type is found automatically) is then visible under the static if. The is() syntax is used to test if types are the same (a better simpler design for the D language is just to allow == between them).&lt;br /&gt;&lt;br /&gt;In library code you may add few more static asserts inside backtrackingOrder() to test that visitor is a callable, with a line of code like:&lt;br /&gt;static assert(IsCallable!(typeof(truevisitor)), &quot;...&quot;);&lt;/t&gt;</description>
  <comments>http://leonardo-m.livejournal.com/88868.html</comments>
  <category>programming</category>
  <category>d language</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/88482.html</guid>
  <pubDate>Sat, 19 Sep 2009 15:05:44 GMT</pubDate>
  <title>Code performance in D/Java</title>
  <link>http://leonardo-m.livejournal.com/88482.html</link>
  <description>The code discussed in this post:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/life_bench.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/life_bench.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;To try to find &quot;performance bugs&quot; in both the LDC D compiler and the LLVM back-end I am exploring the performance signature of many small programs, often translating them to D. To perform such tests I compare the timings of the D code with the original C or Java code.&lt;br /&gt;&lt;br /&gt;Java here is useful because its performance profiles are a little different from the usual C ones. Java HotSpot is able to inline many virtual calls, and its GarbageCollector is quite efficient (both things aren&apos;t good in all current D implementations).&lt;br /&gt;&lt;br /&gt;I&apos;ve found a small Life (Horton Conway&apos;s game) implementation in Java that shows a higher performance compared to D code (the original Java code isn&apos;t mine), so I cleaned up the Java code, I have removed the useless stuff from it (see the zip for the Java code), and I have translated it to as much close as possible D code (able to run both on Tango and Phobos). The result is the first D program (life1_d.d).&lt;br /&gt;&lt;br /&gt;I have taken care of setting as final the main class in the D code, to allow inlining.&lt;br /&gt;&lt;br /&gt;The Java GC (of Sun) is more efficient than the nonmoving D GC, so first of all I have taken a look at the number of memory allocations, bue they weren&apos;t the cause of the performance difference. I&apos;ve profiled both the D and Java code, and I&apos;ve seen that the calc_new() method was the one taking most time. I&apos;ve also seen that the amount of inlining with LDC (using -O5 -release -inline) was not enough, so I&apos;ve compiled the D code with the following, that forces a more aggressive inlining, that improves performance:&lt;br /&gt;&lt;br /&gt;ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d&lt;br /&gt;&lt;br /&gt;I have also seen that for this program Link-Time Optimization plus Interning improve the performance of the code:&lt;br /&gt;&lt;br /&gt;ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d&lt;br /&gt;&lt;br /&gt;opt -std-compile-opts life1_d.bc &amp;gt; life1_do.bc&lt;br /&gt;&lt;br /&gt;llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc&lt;br /&gt;&lt;br /&gt;I have done many more experiments, here you can find the ones that have given good results. I have split the Life class in its methods plus global values, I don&apos;t know why this increases performance with the LDC compiler (see the timings), see life2_d.d.&lt;br /&gt;&lt;br /&gt;Later I have uses simple pointers as function arguments instead of arrays (see see life3_d.d), again I don&apos;t know why this increases the performance with the LDC compiler, also because most of sych functions get inlined anyway.&lt;br /&gt;&lt;br /&gt;Now the performance of life3_d.d was bad only for large values (the last values of the use_Sizes array). After several more experiments I have by chance found that the cause was in the inner loop of the calc_new() function, where the whole botRow array is reset to zero. I think here the Java compiler recognizes such loop as a clearing, and replaces it with a call to a function like memset(). This makes the Java code slower than the D one for small botRow arrays, but faster for the longer ones (because it seems an inlined loop is faster than a memset when the length is small, about 50-100 or so if array items are 4 byte long).&lt;br /&gt;&lt;br /&gt;So in life4_d.d I have replaced the loop with something a little more complex that takes a look at botRow length:&lt;br /&gt;&lt;pre&gt;if (botRow.length &amp;gt; 100)
    botRow[] = 0;
else
    for (int c = 0; c &amp;lt; botRow.length; c++)
        botRow[c] = 0;
&lt;/pre&gt;The timings of the 4th D version are now good enough, but not the best still, I don&apos;t know why. (I am now trying to find a faster array reset that uses an asm routine that contains the movntps SSE instuction).&lt;br /&gt;&lt;br /&gt;------------------------&lt;br /&gt;&lt;br /&gt;Scores on Windows Vista, using DMD compiler for the D code (bigger is better):&lt;br /&gt;&lt;pre&gt;
java -server Life
Size    average
Adjusting 6744 to 2811246
5       14288
6       14974
8       14125
10      15697
15      14175
25      15369
50      14595
250     6331
1000    2218
2500    880


life1_d.exe
Size    average
5       9248
6       8938
8       10213
10      10469
15      10974
25      9898
50      7189
250     1694
1000    421
2500    170


life2_d.exe
Size    average
5       9424
6       8270
8       8083
10      7876
15      8389
25      7191
50      5288
250     1331
1000    331
2500    147


life3_d.exe
Size    average
5       10156
6       9172
8       9353
10      9043
15      8802
25      8452
50      5865
250     1357
1000    356
2500    151


life4_d.exe
Size    average
Adjusting 917315 to 0
5       10145
6       9275
8       9631
10      10282
15      10061
25      9083
50      6286
250     4747
1000    1596
2500    743
&lt;/pre&gt;&lt;br /&gt;Home Vista Basic with 2 GB RAM, Celeron 2.13 GHz&lt;br /&gt;&lt;br /&gt;Compilers used:&lt;br /&gt;&lt;br /&gt;Java version &quot;1.7.0-ea&quot;&lt;br /&gt;Java(TM) SE Runtime Environment (build 1.7.0-ea-b66)&lt;br /&gt;Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)&lt;br /&gt;&lt;br /&gt;DMD Digital Mars D Compiler v1.042&lt;br /&gt;&lt;br /&gt;------------------------&lt;br /&gt;&lt;br /&gt;Scores on Ubuntu running on VirtualBox running on Vista, using LDC compiler for the D code (bigger is better):&lt;br /&gt;&lt;pre&gt;java -server Life
Size	average
Adjusting 28766 to 951466
Adjusting 951466 to 2616313
5	13740
6	16494
8	14922
10	19481
15	20602
25	21249
50	19979
250	7493
1000	2609
2500	1004


ldc -O5 -release -inline life1_d.d
Size	average
5	17810
6	16661
8	15147
10	15201
15	14818
25	12901
50	4384
250	1893
1000	485
2500	198


ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
Size	average
5	16016
6	14690
8	13183
10	11823
15	11100
25	10144
50	5203
250	2305
1000	571
2500	256
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d&lt;br /&gt;opt -std-compile-opts life1_d.bc &amp;gt; life1_do.bc&lt;br /&gt;llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc&lt;br /&gt;&lt;pre&gt;Size	average
5	22628
6	18773
8	14326
10	18820
15	17558
25	15264
50	10155
250	2329
1000	578
2500	251


ldc -O5 -release -inline -inline-threshold=2000000001 life2_d.d
Size	average
5	16557
6	15205
8	13149
10	12037
15	10870
25	10051
50	6470
250	2327
1000	571
2500	247


ldc -O5 -release -inline -inline-threshold=2000000001 life3_d.d
Size	average
5	18205
6	17032
8	15948
10	17798
15	17907
25	15656
50	10191
250	2310
1000	574
2500	248


ldc -O5 -release -inline -inline-threshold=2000000001 life4_d.d
Size	average
5	18169
6	16576
8	16129
10	17616
15	17937
25	15486
50	9667
250	6714
1000	2006
2500	886
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life4_d.d&lt;br /&gt;opt -std-compile-opts life4_d.bc &amp;gt; life4_do.bc&lt;br /&gt;llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life4_do life4_do.bc&lt;br /&gt;&lt;pre&gt;Size	average
5	25522
6	22085
8	21079
10	20964
15	21987
25	19300
50	13160
250	6774
1000	1986
2500	900
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This version (not included in the zip) is the first D one, but the cleaning of the botRow array in calc_new() is done as in life4_d.d:&lt;br /&gt;ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1b_d.d&lt;br /&gt;opt -std-compile-opts life1b_d.bc &amp;gt; life1b_do.bc&lt;br /&gt;llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1b_do life1b_do.bc&lt;br /&gt;&lt;pre&gt;Size	average
5	22957
6	19394
8	16457
10	17208
15	18177
25	15522
50	5089
250	6638
1000	1971
2500	889
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Code running on Ubuntu 9.4 running on VirtualBox 3.0.6 r52128.&lt;br /&gt;&lt;br /&gt;Compilers used:&lt;br /&gt;&lt;br /&gt;LDC based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)&lt;br /&gt;&lt;br /&gt;Java version &quot;1.6.0_16&quot;&lt;br /&gt;Java(TM) SE Runtime Environment (build 1.6.0_16-b01)&lt;br /&gt;Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)&lt;br /&gt;&lt;br /&gt;------------------------</description>
  <comments>http://leonardo-m.livejournal.com/88482.html</comments>
  <category>dmd</category>
  <category>programming</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>java</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/88088.html</guid>
  <pubDate>Sun, 30 Aug 2009 19:45:46 GMT</pubDate>
  <title>&quot;Silly benchmark&quot; in D/C++</title>
  <link>http://leonardo-m.livejournal.com/88088.html</link>
  <description>All the code discussed here:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/silly_bench.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/silly_bench.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Slava Pestov, author of the programming language Factor, has created a very simple benchmark. Despite being defined &quot;silly&quot; such benchmark has shown me some interesting things.&lt;br /&gt;&lt;br /&gt;This the original posts:&lt;br /&gt;&quot;Performance comparison between Factor and Java on a contrived benchmark&quot;&lt;br /&gt;&lt;a href=&quot;http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html&quot;&gt;http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&quot;Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster&quot;:&lt;br /&gt;&lt;a href=&quot;http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html&quot;&gt;http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;What the benchmark does:&lt;br /&gt;- It works on points, which are triplets of single-precision floats, (x,y,z)&lt;br /&gt;- First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).&lt;br /&gt;-Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).&lt;br /&gt;- Finally, the maximum x, y, and z components are found, for all points, and this is printed out.&lt;br /&gt;&lt;br /&gt;The purpose of this benchmark is to compare performance of an array of struct in Factor to an array of class references in Java. It also shows how trigonometric functions are managed.&lt;br /&gt;&lt;br /&gt;Slava has given Java code, that I have modified a bit, adding more timings inside, pulling out a subclass, making it static. The overall running time of the Java version is improved just a little.&lt;br /&gt;&lt;br /&gt;The Java version is slow for the large amount of memory allocations, and also because it doesn&apos;t use the CPU native trigonometric instructions (sin, cos or sincos) when the input is arge, because of cross platform issues and because the Intel CPU instructions return quite inaccurate results in some cases, like when the input numers are large.&lt;br /&gt;&lt;br /&gt;So I have created a second Java version, that uses an helper class (probably written by Razzi, &lt;a href=&quot;http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums〈=java&amp;amp;id=4&quot;&gt;http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums〈=java&amp;amp;id=4&lt;/a&gt; ) to reduce the input range of trig functions. The result is much faster. (2.4 seconds instead of 11.4).&lt;br /&gt;&lt;br /&gt;Then I have created three D versions:&lt;br /&gt;- The first version is designed to be very similar to the Java version, the array is filled with class references. It&apos;s slower on performing the sqrt() and other code, but even in non-release mode it&apos;s overall quite faster than the first Java version. Bothe the Java versions and this D version use lot of RAM on each iteration of the program, the GC doesn&apos;t free it efficiently.&lt;br /&gt;- The second D version is more optimized. The rray is now composed of struct values. I also delete the array at the end of each loop, so the memory used by the program is about constant. I have also used free functions instead of Silly class, but this has not much effect. D1 structs don&apos;t allow constructors (D2 has struct constructors), so I have had to change code a bit there too. The result is much faster, almost two times faster than the second Java version.&lt;br /&gt;- In the third D version I&apos;ve tried to manually optimize what possible. The DMD compiler doesn&apos;t inline functions with ref arguments and some other things. I have avoided an useless inizialization of the struct array and I have manually inlined its filling. I have used for loops insead of foreach, and allocated memory from the GC heap in a raw way. I have optimized the code using only one call to sin() because the DMD compiler isn&apos;t able to see the two calls to sin() have the same results. I have hand-optimized the computation of the max struct, even if takes only a small percentage of the whole runtime. The result is a nice speeup compared to the second D version. In D there&apos;s no simple way to allocate an array of uninitialized structs (I have to test something with LDC: it may not initialize the array if the struct data is float x=void,y=void,z=void; To be tested). LDC compiler can probably improve such timings.&lt;br /&gt;&lt;br /&gt;Later, to have a baseline timing, I have translated the code to C++, and I have compiled it with G++ and LLVM-GCC. The second compiler uses calls to tle libc even using the -ffast-math compilation argument, so the running time is quite slower. Compiled with G++ the program is quite faster than the D code compiled with DMD.&lt;br /&gt;&lt;br /&gt;This code in Haskell (or Python, or D with my dlibs) can be lazy, and reduce a lot memory used and probably running time too.&lt;br /&gt;&lt;br /&gt;When possible I&apos;ll add timings with the LDC D compiler.&lt;br /&gt;&lt;br /&gt;(I have not put the Python+Psyco timings here because I have not enough RAM to test it with n=5 millions. With n=1 million the best Python+Psyco time is about 3700 milliseconds. With the first Java program the best timing with n=1 million is 577 milliseconds, about seven times faster).&lt;br /&gt;&lt;br /&gt;--------------------&lt;br /&gt;&lt;pre&gt;Timings on Windows:

...&amp;gt;java -Xms512m -Xmx512m -server Silly1
Run #0
0.8944272, 1.0, 0.4472136
10810 422 234 0  Time: 11466

...&amp;gt;java -Xms512m -Xmx512m -server Silly2
Run #0
0.8944272, 1.0, 0.4472136
1966 406 124 0  Time: 2496

...&amp;gt;silly_bench_cpp (g++)
Run #0
0.894427, 1.000000, 0.447214
334 255 36 5  Total=632

...&amp;gt;silly_bench_cpp (llvm-g++)
Run #2
0.894427, 1.000000, 0.447214
1304 374 21 5  Total=1705

...&amp;gt;silly_bench1_d (No -release mode):
Run #0
0.894427, 1.000000, 0.447214
6941 444 80 0  total = 7466

...&amp;gt;silly_bench2_d
Run #1
0.894427, 1.000000, 0.447214
785 446 74 0  total = 1305

...&amp;gt;silly_bench3_d
Run #2
0.894427, 1.000000, 0.447214
488 443 28 0  Total = 960
&lt;/pre&gt;------------------&lt;br /&gt;&lt;br /&gt;Compilation arguments for g++ and llvm-g++:&lt;br /&gt;g++ -Wall -O3 -s -fomit-frame-pointer -msse -msse2 -msse3 -march=native -ffast-math silly_bench_cpp.cpp -o silly_bench_cpp&lt;br /&gt;&lt;br /&gt;Compilation arguments for DMD:&lt;br /&gt;dmd -O -release -inline:&lt;br /&gt;&lt;br /&gt;Intel CPU Celeron 560 at 2.13 GHz, 1GB RAM, Vista Home Basic.&lt;br /&gt;&lt;br /&gt;Compilers used:&lt;br /&gt;DMD v1.042&lt;br /&gt;gcc v. 4.3.3-dw2-tdm-1 (GCC)&lt;br /&gt;LLVM-G++ gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)&lt;br /&gt;Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)</description>
  <comments>http://leonardo-m.livejournal.com/88088.html</comments>
  <category>llvm-g++</category>
  <category>g++</category>
  <category>programming</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>c++</category>
  <category>python</category>
  <category>dmd</category>
  <category>java</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/87980.html</guid>
  <pubDate>Thu, 27 Aug 2009 22:36:39 GMT</pubDate>
  <title>Updates and links</title>
  <link>http://leonardo-m.livejournal.com/87980.html</link>
  <description>Updated two toy raytracers (Sphereflake and Yopyra) in my two software pages.&lt;br /&gt;&lt;br /&gt;I have also added a new benchmark, about the Boggle game:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/boggle.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/boggle.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;---------------------&lt;br /&gt;&lt;br /&gt;A nice and probably useful hierarchical allocator for C programs:&lt;br /&gt;&lt;a href=&quot;http://swapped.cc/halloc/&quot;&gt;http://swapped.cc/halloc/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Why colors are important in scientific visualizations, and the risks of their bad usage:&lt;br /&gt;&lt;a href=&quot;http://www.research.ibm.com/people/l/lloydt/color/color.HTM&quot;&gt;http://www.research.ibm.com/people/l/lloydt/color/color.HTM&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Theo discusses about what&apos;s important for young people to learn, the risks of violent video games, the low quality of educational software:&lt;br /&gt;&lt;a href=&quot;http://theodoregray.com/BrainRot/index.html&quot;&gt;http://theodoregray.com/BrainRot/index.html&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/87980.html</comments>
  <category>llvm-g++</category>
  <category>g++</category>
  <category>programming</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>c++</category>
  <category>dmd</category>
  <category>links</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/87801.html</guid>
  <pubDate>Sat, 15 Aug 2009 21:25:16 GMT</pubDate>
  <title>Updates and links</title>
  <link>http://leonardo-m.livejournal.com/87801.html</link>
  <description>&quot;Boostbench&quot;, a benchmark, in C, Java and D, the zip contains all the code, timings and information:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/boostbench.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/boostbench.zip&lt;/a&gt;&lt;br /&gt;Original code by Rene Grothmann:&lt;br /&gt;&lt;a href=&quot;http://mathsrv.ku-eichstaett.de/MGF/homes/grothmann/java/bench/&quot;&gt;http://mathsrv.ku-eichstaett.de/MGF/homes/grothmann/java/bench/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;With the LDC compiler the performance is poor without Link-Time Optimization plus Interning, I don&apos;t know why. (If someone is able to tell me I&apos;ll be interested to know.)&lt;br /&gt;&lt;br /&gt;-------------------&lt;br /&gt;&lt;br /&gt;Theo Jansen demonstrates the amazingly lifelike kinetic sculptures he builds from plastic tubes and lemonade bottles, and then lets walk and live on a beach. They have even a brain made of bottles. Windosaurs, etc:&lt;br /&gt;&lt;a href=&quot;http://www.youtube.com/watch?v=b694exl_oZo&quot;&gt;http://www.youtube.com/watch?v=b694exl_oZo&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/87801.html</comments>
  <category>llvm-gcc</category>
  <category>programming</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>gcc</category>
  <category>c</category>
  <category>ldc</category>
  <category>java</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/87388.html</guid>
  <pubDate>Fri, 14 Aug 2009 13:17:35 GMT</pubDate>
  <title>Jarvis March in D</title>
  <link>http://leonardo-m.livejournal.com/87388.html</link>
  <description>Robert C. Martin in his blog has written an implementation of the Jarvis March algorithm in Clojure, and was looking for ways to speed up the code:&lt;br /&gt;&lt;a href=&quot;http://blog.objectmentor.com/articles/2009/08/11/jarvis-march-in-clojure&quot;&gt;http://blog.objectmentor.com/articles/2009/08/11/jarvis-march-in-clojure&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;See also:&lt;br /&gt;&lt;a href=&quot;http://www.butunclebob.com/ArticleS.UncleBob.ConvexHullTiming&quot;&gt;http://www.butunclebob.com/ArticleS.UncleBob.ConvexHullTiming&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So I have translated the Java code to D (and later C), and then I have done some timing tests.&lt;br /&gt;&lt;br /&gt;The Java results are quite good compared to the D ones.&lt;br /&gt;&lt;br /&gt;Number of points in the hull in all tests is just 14.&lt;br /&gt;&lt;pre&gt;Timings, n = 2_000_000:

On Windows XP, best of 6, seconds:
  D 1:    1.88 (DMD compiler)
  Java 1: 1.83
  C 2:    1.55 (GCC compiler)
  Java 2: 1.55
  Java 1: 1.47 (-server)
  Java 2: 1.47 (-server -Xms40M)
  C 2:    1.46 (LLVM-GCC compiler)
  Java 2: 1.45 (-server)
  D 2:    1.32 (DMD compiler)

On Pubuntu, best of 6, seconds:
  Java 1: 2.71
  C 2:    2.59 (GCC compiler)
  Java 1: 2.42 (-server)
  Java 2: 2.39
  Java 2: 2.37 (-server)
  D 1:    2.06 (LDC compiler)
  D 1:    2.06 (LDC compiler, LTO+I)
  D 2:    2.02 (LDC compiler, LTO+I)  
  D 2:    2.00 (LDC compiler)
&lt;/pre&gt;Notes:&lt;br /&gt;- The code of the D 1 version is similar to the Java 1 version. The code of the D 2 version is similar to the Java 2 version and C 2 version.&lt;br /&gt;- On Pubuntu all timings are increased because it has a slower access to memory.&lt;br /&gt;- I have used a (low quality) portable rnd generator to assure equal test cases on all compilers and operating systems. (And because while the D std library Tango has a gaussian generator, the Phobos std lib of Dv.1 doesn&apos;t have it).&lt;br /&gt;- Currently the seed of the random generator can&apos;t be changed.&lt;br /&gt;- All the programs give the same output, but the results aren&apos;t equal up to the last floating point digits anyway because different FP instructions produce sligtly different results. FP numbers are approximations.&lt;br /&gt;- The C code is slower, I don&apos;t know why.&lt;br /&gt;- The D 2 code is much faster than D 1 with DMD because DMD has limited inlining capabilities (this is why I have created the D 2 version).&lt;br /&gt;- In the D code the abs function is not taken from Tango because LDC has a performance bug, and otherwise it&apos;s not able to inline it.&lt;br /&gt;- For the LTO+I see below.&lt;br /&gt;&lt;br /&gt;More info and all the code can be found here:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/jarvism.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/jarvism.zip&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/87388.html</comments>
  <category>llvm-gcc</category>
  <category>gcc</category>
  <category>geometry</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>c</category>
  <category>d language</category>
  <category>java</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/86620.html</guid>
  <pubDate>Tue, 11 Aug 2009 18:10:41 GMT</pubDate>
  <link>http://leonardo-m.livejournal.com/86620.html</link>
  <description>I have improved the code of the Sphereflake raytracer and thanks to Tomas Lindquist Olsen I have fixed my LDC installation, so there I have redone the benchmarks on Pubuntu:&lt;br /&gt;&lt;pre&gt;
Timings on Pubuntu, w=h=1024, lvl=6, best of 3, seconds:
(66_430 spheres, WITH_SHADOWS=true, FASTER_LDC=true)
  D:    4.33  (claiming 4.6 MB) (first fast version + LTO+I)
  D:    4.51  (claiming 2.5 MB) (second fast version + LTO+I)
  D:    4.55  (claiming 4.6 MB) (basic version + LTO+I)
  D:    4.70  (claiming 2.5 MB) (second fast version)
  D:    4.72  (claiming 4.6 MB) (first fast version)
  D:    4.97  (claiming 4.6 MB) (basic version)
  C++:  5.04  (claiming 4.6 MB) (+4 bytes padding)
  C++:  5.75  (claiming 4.3 MB) (original version)

Timings on Pubuntu, w=h=1024, lvl=7, best of 3, seconds:
(597_871 spheres, WITH_SHADOWS=true, FASTER_LDC=true)
  D:    5.51  (claiming 23 MB) (second fast version + LTO+I)
  D:    5.74  (claiming 41 MB) (first fast version + LTO+I)

Timings on Pubuntu, w=h=1024, lvl=8, best of 3, seconds:
  D:   10.84  (claiming 205 MB) (second fast version + LTO+I)
  D:   11.21  (claiming 205 MB) (second fast version)
  D:   23.34  (claiming 369 MB) (first fast version + LTO+I)
  D:   23.84  (claiming 369 MB) (basic version + LTO+I)
  D:   24.06  (claiming 369 MB) (first fast version)
  D:   24.55  (claiming 369 MB) (basic version)
  C++: 24.61  (claiming 369 MB) (+4 bytes padding)
&lt;/pre&gt;&lt;br /&gt;For more information, and for the updated Sphereflake code:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/index.html#flake&quot;&gt;http://www.fantascienza.net/leonardo/js/index.html#flake&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;----------------&lt;br /&gt;&lt;br /&gt;I have added a new benchmark, a sparse matrix multiplication, this time the performance of D code (with LDC) isn&apos;t much good. Info and timings inside the zip:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/index.html#jaspa&quot;&gt;http://www.fantascienza.net/leonardo/js/index.html#jaspa&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/86620.html</comments>
  <category>raytracing</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>c</category>
  <category>d language</category>
  <category>java</category>
  <category>c++</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/86431.html</guid>
  <pubDate>Wed, 05 Aug 2009 15:33:32 GMT</pubDate>
  <title>Sphereflake ray-tracing benchmark in C++ and D</title>
  <link>http://leonardo-m.livejournal.com/86431.html</link>
  <description>Timings of a small ray-tracing &quot;Spereflake&quot; benchmark (you can find a copy of this Html page in the references directory of the zip too):&lt;br /&gt;&lt;a href=&quot;http://ompf.org/ray/sphereflake/&quot;&gt;http://ompf.org/ray/sphereflake/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;center&gt;&lt;img src=&quot;http://www.fantascienza.net/leonardo/js/sphereflake5_ico.png&quot; border=&quot;1&quot;&gt;&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;The D-LDC results of this benchmark are good. I have also created a faster D version where the output is now in P5 pgm format (bytes are represented as single chars, this speeds up the output) and where the trascendental values are precomputed (G++ 4.3.3 is able to pre-compute them, while the current LDC isn&apos;t able to).&lt;br /&gt;&lt;br /&gt;A possible way to further speed up the code: when lvl=8 it creates 5_380_840 visible spheres and node_t, but it needs only 597_871 bounding spheres (the leaves don&apos;t need such bounding spheres and most of the nodes of a tree are leaves).&lt;br /&gt;&lt;br /&gt;So it may be useful to split the array of node_t into two arrays, one much larger that just contains the visible spheres and the skip pointer, and one smaller that contains the bounding spheres (and maybe another skip pointer). In the end the CPU cache has to manage two arrays, so the program may be slower.&lt;br /&gt;&lt;br /&gt;With that idea with lvl=8 the memory used becomes 205 MB (or 228 MB if doubles are aligned to 8 bytes) instead of the current 369 MB, this lowers the CPU cache traffic.&lt;br /&gt;&lt;br /&gt;I have 2 GB RAM on my PC, so I can&apos;t run it with lvl=9. But removing just visible sphere radiuses it needs less than 2 GB RAM:&lt;br /&gt;( 48427561 * (8 + 3 * 8) + 5380840 * (8 + 4 * 8) ) / (1024*1024) = 1683.15&lt;br /&gt;&lt;pre&gt;
Timings on Windows, w=h=1024, lvl=6, best of 3, seconds:
  C++:  4.60  (claiming 4.6 MB) (+4 bytes padding)
  C++:  4.60  (claiming 4.6 MB)
  C++:  4.63  (claiming 4.6 MB) (+4 bytes padding, LLVM-G++)
  C++:  4.64  (claiming 4.6 MB) (LLVM-G++)
  C++:  4.68  (claiming 4.6 MB) (+4 bytes padding, PGO)
  D:   11.78  (claiming 4.6 MB) (DMD v1.043, with &quot;ref&quot;, fast version)
  D:   17.34  (claiming 4.6 MB) (DMD v1.043, with &quot;ref&quot;)
  D:   17.34  (claiming 4.6 MB) (DMD v1.043, with &quot;ref&quot;, no GC)
  D:   28.24  (claiming 4.6 MB) (DMD v2.031, with no &quot;ref&quot;)
  D:   28.27  (claiming 4.6 MB) (DMD v2.031, with no &quot;ref&quot;, gs)
  D:   29.78  (claiming 4.6 MB) (DMD v1.043, with no &quot;ref&quot;)
  D:   29.79  (claiming 4.6 MB) (DMD v1.046, with no &quot;ref&quot;)
(Those DMD-Windows timings are not reliable, I have timed it as low ad 12.5 seconds)


Timings on Pubuntu, w=h=1024, lvl=6, best of 3, seconds:
(66_430 spheres, WITH_SHADOWS=true, FASTER_LDC=true)
  D:    4.81  (claiming 4.6 MB) (with fast output)
  C++:  4.86  (claiming 4.6 MB) (+4 bytes padding)
  D:    4.91  (claiming 4.6 MB) (fast version)
  D:    4.93  (claiming 4.6 MB)
  C++:  5.75  (claiming 4.3 MB)

Timings on Pubuntu, w=h=1024, lvl=8, best of 3, seconds:
(5_380_840 spheres)
  C++: 25.00  (claiming 369 MB) (+4 bytes padding)
  D:   25.27  (claiming 369 MB) (fast version)
  D:   25.62  (claiming 369 MB) (with &quot;ref&quot;)
&lt;/pre&gt;&lt;br /&gt;Key:&lt;br /&gt;  no GC = Garbage collector disabled in the whole running&lt;br /&gt;  +4 bytes padding = 4 bytes of padding added to the Node struct&lt;br /&gt;  PGO = profile-guided optimization&lt;br /&gt;  gs = all global variables are annotated with __gshared&lt;br /&gt;&lt;br /&gt;Args:&lt;br /&gt;  g++ -O3 -s -fomit-frame-pointer -msse3 -march=native -ffast-math&lt;br /&gt;  llvm-g++ -O3 -s -fomit-frame-pointer -msse3 -march=native -ffast-math&lt;br /&gt;  dmd -O -release -inline&lt;br /&gt;  ldc -O5 -release -inline&lt;br /&gt;  &lt;br /&gt;&lt;br /&gt;All the tested code:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/sphereflake.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/sphereflake.zip&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/86431.html</comments>
  <category>llvm-g++</category>
  <category>g++</category>
  <category>programming</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>c++</category>
  <lj:security>public</lj:security>
  <lj:reply-count>4</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/86110.html</guid>
  <pubDate>Wed, 05 Aug 2009 14:38:44 GMT</pubDate>
  <title>A small raytracing benchmark</title>
  <link>http://leonardo-m.livejournal.com/86110.html</link>
  <description>To test the LDC D compiler, its LLVM backend and to look for more possible optimizations, I have translated to D a small ray tracing benchmark present on Jon Harrop site:&lt;br /&gt;&lt;a href=&quot;http://www.ffconsultancy.com/languages/ray_tracer/benchmark.html&quot;&gt;http://www.ffconsultancy.com/languages/ray_tracer/benchmark.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It contains translations in several languages of progressively more optimized versions of a toy ray tracer program. As reference I have used the first C++ implementation, the slowest one hoping it can show me where LDC misses some possible optimizations (if I translate to D1 the faster C++ version I may offer less chances to LDC and G++ compilers to show their optimization capabilities).&lt;br /&gt;&lt;br /&gt;To translate the code to D I have had to change few things, so to have a more fair comparison I have  back-ported those changes to C++.&lt;br /&gt;&lt;br /&gt;The two main changes I have performed to the first C++ version are:&lt;br /&gt;1) vector is used instead of list. This speeds up the code a little. I have done this because D has no built-in list type (there are lists in Phobos2 and Tango that I have not used).&lt;br /&gt;&lt;br /&gt;2) Inside the Group class &quot;bound&quot; is a Sphere* instead of a Sphere, because in D classes are used by reference only. This has slowed down code. So I have created a &quot;ray1_scoped_cpp&quot; C++ version that keeps &quot;bound&quot; as a value. It shows where D misses a possible optimization. Some people have asked for a &apos;scope&apos; among class attributes, but the discussion is gone nowhere (I think Walter has never answered). But in some situations LDC is recently able to &apos;scope&apos; classes by itself, as an optimization. So can LDC perform such optimization anyway, with no change to the language, as a way to optimize? It&apos;s not a very quick to do, you have to see if no references to the class &quot;f&quot; in this class escape. It may be doable.&lt;br /&gt;&lt;pre&gt;Timings on Windows:
  ray1_scoped_cpp  6.52
  ray1_cpp         6.77
  ray1_d          16.01 (DMD v1.043)

Timings on Pubuntu:
  ray1_scoped_cpp  7.99
  ray1_d           8.51 (LDC Aug 3 2009)
  ray1_cpp         8.66
&lt;/pre&gt;CPU used: Core2 at 2 GHz.&lt;br /&gt;&lt;br /&gt;Compilers:&lt;br /&gt;LDC LLVM D Compiler based on DMD v1.045 and llvm 2.6svn (Mon Aug  3 22:09:36 2009)&lt;br /&gt;On Pubuntu G++ version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)&lt;br /&gt;DMD v1.042&lt;br /&gt;On Windows gcc version 4.3.3-dw2-tdm-1 (GCC)&lt;br /&gt;&lt;br /&gt;Compilation arguments:&lt;br /&gt;Windows:&lt;br /&gt;llvm-g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native ray1d_cpp.cpp -o ray1d_cpp&lt;br /&gt;&lt;br /&gt;Ubuntu:&lt;br /&gt;g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native ray1d_cpp.cpp -o ray1d_cpp&lt;br /&gt;&lt;br /&gt;All the D and C++ code tested here:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/ray.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/ray.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In future I may add a D translation of the 5th (the fastest) C++ version too, as reference point.</description>
  <comments>http://leonardo-m.livejournal.com/86110.html</comments>
  <category>llvm-g++</category>
  <category>g++</category>
  <category>programming</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>c++</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/85528.html</guid>
  <pubDate>Wed, 29 Jul 2009 11:00:26 GMT</pubDate>
  <title>Update and links</title>
  <link>http://leonardo-m.livejournal.com/85528.html</link>
  <description>A small benchmark, computing digits of Pi, compared D, C++, Java:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/pi_bench.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/pi_bench.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;------------------------&lt;br /&gt;&lt;br /&gt;Case study: Improving the performance of matrix multiplication by 296,260x&lt;br /&gt;&lt;a href=&quot;http://stellar.mit.edu/S/course/6/fa08/6.197/courseMaterial/topics/topic2/lectureNotes/Intro_and_MxM/Intro_and_MxM.pdf&quot;&gt;http://stellar.mit.edu/S/course/6/fa08/6.197/courseMaterial/topics/topic2/lectureNotes/Intro_and_MxM/Intro_and_MxM.pdf&lt;/a&gt;&lt;br /&gt;&lt;a href=&quot;http://stellar.mit.edu/S/course/6/fa08/6.197/&quot;&gt;http://stellar.mit.edu/S/course/6/fa08/6.197/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What&apos;s the fastest programming language? It may be Python, because there are CorePy and PyCuda that allow you to write very fast programs:&lt;br /&gt;&lt;a href=&quot;http://www.corepy.org/&quot;&gt;http://www.corepy.org/&lt;/a&gt;&lt;br /&gt;&lt;a href=&quot;http://mathema.tician.de/software/pycuda&quot;&gt;http://mathema.tician.de/software/pycuda&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;One of the best biology blogs I have found so far:&lt;br /&gt;&lt;a href=&quot;http://bytesizebio.net/&quot;&gt;http://bytesizebio.net/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Sig.ma - Live views on the Web of Data (a demo video):&lt;br /&gt;&lt;a href=&quot;http://vimeo.com/5703809&quot;&gt;http://vimeo.com/5703809&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Lot of people know Java-style Object Oriented programming, while knowledge of pure mathematics is much less common (even if knowing some group theory, etc, can be quite useful. Such topics are part of courses at the university if you want a degree in computer science). Marc Conrad has shown ways to explain some of those mathematics ideas using the OOP ideas. It may be a good idea:&lt;br /&gt;&lt;a href=&quot;http://ring.perisic.com/&quot;&gt;http://ring.perisic.com/&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/85528.html</comments>
  <category>benchmark</category>
  <category>d language</category>
  <category>java</category>
  <category>c++</category>
  <category>links</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/85364.html</guid>
  <pubDate>Mon, 20 Jul 2009 11:32:55 GMT</pubDate>
  <title>Linpack benchmark in D</title>
  <link>http://leonardo-m.livejournal.com/85364.html</link>
  <description>I have translated a C Linpack benchmark to D, and I have timed it compared to the C and a similar Java version.&lt;br /&gt;&lt;br /&gt;Some of the D Link-Time optimized (+ interning) versions segfault (binaries produced by DMD here never segfault). The C code compiled with GCC with -msse3 -march=core2 compilation arguments too segfaults (that&apos;s why I have used llvm-gcc that has never produced a segfault), I don&apos;t know why, it may be a pointer aliasing problem.&lt;br /&gt;&lt;br /&gt;For bigger matrices Java shows to be about as fast as gcc-C and LDC-D. Both float and real versions are very slow. Static arrays are a bit faster with LDC but not much, I don&apos;t know why.&lt;br /&gt;&lt;br /&gt;See inside the zip for the full timings:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/linpack.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/linpack.zip&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/85364.html</comments>
  <category>benchmark</category>
  <category>c</category>
  <category>d language</category>
  <category>python</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/85166.html</guid>
  <pubDate>Fri, 17 Jul 2009 00:52:16 GMT</pubDate>
  <title>Updates</title>
  <link>http://leonardo-m.livejournal.com/85166.html</link>
  <description>Updated the rectangle packing code: added a D and ShedSkin version of the first algorithm:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/so/index.html#rectpack&quot;&gt;http://www.fantascienza.net/leonardo/so/index.html#rectpack&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Added a D version (fvd) of the fv tool shown here, the D version compiled with LDC is a bit faster than the C++ version (probably thanks to unlocked_get/set that can also be used in C++):&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/ar/string_repetition_statistics/string_repetition_statistics.html&quot;&gt;http://www.fantascienza.net/leonardo/ar/string_repetition_statistics/string_repetition_statistics.html&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/85166.html</comments>
  <category>programming</category>
  <category>update</category>
  <category>d language</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/84867.html</guid>
  <pubDate>Wed, 15 Jul 2009 00:01:26 GMT</pubDate>
  <title>Rectangle packing</title>
  <link>http://leonardo-m.livejournal.com/84867.html</link>
  <description>Efficient packing of many rectangles inside a bigger one:&lt;br /&gt;&lt;img src=&quot;http://www.fantascienza.net/leonardo/so/rect_pack.png&quot; alt=&quot;&quot; border=&quot;1&quot; /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/so/index.html#rectpack&quot;&gt;http://www.fantascienza.net/leonardo/so/index.html#rectpack&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Improved from this C# code:&lt;br /&gt;&lt;a href=&quot;http://kossovsky.net/index.php/2009/07/cshar-rectangle-packing/&quot;&gt;http://kossovsky.net/index.php/2009/07/cshar-rectangle-packing/&lt;/a&gt;&lt;br /&gt;&lt;a href=&quot;https://devel.nuclex.org/framework/browser/game/Nuclex.Game/trunk/Source/Packing&quot;&gt;https://devel.nuclex.org/framework/browser/game/Nuclex.Game/trunk/Source/Packing&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/84867.html</comments>
  <category>psyco</category>
  <category>geometry</category>
  <category>d language</category>
  <category>python</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/84408.html</guid>
  <pubDate>Sat, 04 Jul 2009 21:57:38 GMT</pubDate>
  <title>Richards benchmark</title>
  <link>http://leonardo-m.livejournal.com/84408.html</link>
  <description>This old post of mine is about virtual methods and devirtualizations:&lt;br /&gt;&lt;a href=&quot;http://leonardo-m.livejournal.com/76547.html&quot;&gt;http://leonardo-m.livejournal.com/76547.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I have performed more benchmarks about virtual methods and its costs with LDC. I have used a little well known benchmark, the Richards one. I have used the code from here (on the Web Archive because it seems to be not online anymore):&lt;br /&gt;&lt;a href=&quot;http://web.archive.org/web/20060715074131/lissett.port5.com/ben/bench1.htm&quot;&gt;http://web.archive.org/web/20060715074131/lissett.port5.com/ben/bench1.htm&lt;/a&gt;&lt;br /&gt;&lt;a href=&quot;http://web.archive.org/web/20060715074131/http://lissett.port5.com/ben/bench3.htm&quot;&gt;http://web.archive.org/web/20060715074131/http://lissett.port5.com/ben/bench3.htm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;All the code of the following benchmarks:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/richards.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/richards.zip&lt;/a&gt;&lt;br /&gt;&lt;pre&gt;
Timing results:

Windows, n = 10_000_000:
  C:       1.26
  D ~C:    2.01
  Java:    2.04 (final classes, -server)
  D2 ~C#:  2.36 (final classes)
  D3 ~C#:  2.36 (final classes, no getters/setters)  
  Java:    2.73 (final classes)
  D4 ~C#:  3.07 (no getters/setters)  
  C#:      3.98
  D1 ~C#:  4.23
  
Windows, n = 100_000_000:
  C:      12.16
  Java:   18.73 (final classes, -server)
  D ~C:   18.86
  D2 ~C#: 23.11 (final classes)
  D3 ~C#: 23.12 (final classes, no getters/setters)  
  Java:   25.40 (final classes)
  D4 ~C#: 30.16 (no getters/setters)    
  C#:     38.39
  D1 ~C#: 41.64


Pubuntu, n = 10_000_000:
  D ~C:    1.35
  C:       1.39
  D2 ~C#:  1.98 (final classes)  
  D3 ~C#:  2.00 (final classes, no getters/setters) 
  Java:    2.73 (final classes)  
  D4 ~C#:  2.94 (no getters/setters)      
  C#:      -  
  D1 ~C#:  4.03
  
Pubuntu, n = 100_000_000:
  D ~C:   13.24
  C:      13.77
  D2 ~C#: 19.64 (final classes)
  D3 ~C#: 19.92 (final classes, no getters/setters) 
  Java:   25.16 (final classes)    
  D4 ~C#: 29.16 (no getters/setters)       
  C#:      -
  D1 ~C#:  40.17

Key:
  D ~C# means D code that comes from the C# version.
  D ~C means D code that comes and looks from the C version.
&lt;/pre&gt;&lt;br /&gt;Note that the classes in C# code aren&apos;t final. As usual Java shows very good performance.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;On WindowsXp:&lt;br /&gt;DMD Digital Mars D Compiler v1.042&lt;br /&gt;gcc version 4.3.3-dw2-tdm-1 (GCC)&lt;br /&gt;&lt;br /&gt;dmd used with:&lt;br /&gt;dmd -O -release -inline&lt;br /&gt;&lt;br /&gt;ldc used with:&lt;br /&gt;ldc -O5 -release -inline&lt;br /&gt;&lt;br /&gt;gcc used with:&lt;br /&gt;gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;On Pubuntu:&lt;br /&gt;gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)&lt;br /&gt;ldc based on DMD v1.045 and llvm 2.6svn (Thu Jul  2 23:07:48 2009)&lt;br /&gt;&lt;br /&gt;ldc used with:&lt;br /&gt;ldc -O5 -release -inline&lt;br /&gt;&lt;br /&gt;gcc used with:&lt;br /&gt;gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native</description>
  <comments>http://leonardo-m.livejournal.com/84408.html</comments>
  <category>dmd</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>c</category>
  <category>d language</category>
  <category>python</category>
  <lj:security>public</lj:security>
  <lj:reply-count>8</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/84180.html</guid>
  <pubDate>Fri, 03 Jul 2009 16:13:28 GMT</pubDate>
  <title>amb chain - update</title>
  <link>http://leonardo-m.livejournal.com/84180.html</link>
  <description>Updated the &apos;amb chain&apos; article with better implementations:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/ar/amb_chain.html&quot;&gt;http://www.fantascienza.net/leonardo/ar/amb_chain.html&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/84180.html</comments>
  <category>benchmark</category>
  <category>d language</category>
  <category>python</category>
  <category>keywords: programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/83812.html</guid>
  <pubDate>Sun, 28 Jun 2009 13:31:52 GMT</pubDate>
  <link>http://leonardo-m.livejournal.com/83812.html</link>
  <description>A new informatics article of mine, about word chains, in Python and D:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/ar/amb_chain.html&quot;&gt;http://www.fantascienza.net/leonardo/ar/amb_chain.html&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/83812.html</comments>
  <category>programming</category>
  <category>benchmark</category>
  <category>d language</category>
  <category>python</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/83694.html</guid>
  <pubDate>Fri, 19 Jun 2009 22:01:48 GMT</pubDate>
  <title>Built-in hashing in D and Python</title>
  <link>http://leonardo-m.livejournal.com/83694.html</link>
  <description>Both the D1/D2 languages and Python have built-in hash, in Python there are dicts, sets and frozensets, while in D there are associative arrays.&lt;br /&gt;&lt;br /&gt;Having them built-in is very handy, and it allows a nice syntax. In future D2 language may keep such syntax while moving their implementation into the standard library, allowing more flexibility and keeping them about as handy as now (probably the compilation time may grow some and error messages may worsen, but the efficiency and flexibility will increase).&lt;br /&gt;&lt;br /&gt;Any Python object can be used inside a dict/set, but by default the hash value of the object is the ID and the equality too is done amond IDs. So to avoid duplications you must define __hash__() and __eq__ (or __cmp__()).&lt;br /&gt;&lt;br /&gt;In D you have to define the opEquals() and opEquals() and opCmp() too, because the hash is currently implemented in a different way, in case of hash collisions (when the hash value is equal when opEquals is true) they are resolved with a sorted search tree (that requires a &amp;lt;).&lt;br /&gt;&lt;br /&gt;This means that while Python dicts are currently usually faster or much faster than D associative arrays when used on normal data with normally good hash functions, D AAs switch from being about O(1) to being about O(log n) in insert and access time, while Python dicts in such situations may become O(n). So D AAs are safer when the hash function is bad (or even in situation of hash attacks).&lt;br /&gt;&lt;br /&gt;I have written small programs to benchmark the situation. (In the tests I have not used Psyco or the new LDC D compiler because their performance is the same or worse.)&lt;br /&gt;&lt;br /&gt;This is the Python test (only insertions are performed, so this benchameks is also strongly determined by the efficiency of the memory allocator):&lt;br /&gt;&lt;pre&gt;from sys import argv

class K(object):
    def __init__(self, x):
        self.x = x
    def __eq__(self, other):
        return self.x == other.x
    def __hash__(self):
        return 1
        #return self.x

def test(n):
    d = {}
    for i in xrange(n):
        d[K(i)] = i

n = int(argv[1]) if len(argv) == 2 else 10
test(n)
&lt;/pre&gt;&lt;br /&gt;Equivalent D1 code:&lt;br /&gt;&lt;pre&gt;import std.conv: toInt;

class K {
    int x;
    this(int xx) {
        this.x = xx;
    }
    int opEquals(Object other) {
        return this.x == (cast(K)other).x;
    }
    int opCmp(Object other) {
        int other_x = (cast(K)other).x;
        return (this.x == other_x) ? 0 : (this.x - other_x);
    }
    hash_t toHash() {
        return 1;
        //return cast(hash_t)this.x;
    }
}

void test(int n) {
    int[K] d;
    for (int i; i &amp;lt; n; i++)
        d[new K(i)] = i;
}

void main(string[] args) {
    int n = args.length == 2 ? toInt(args[1]) : 10;
    test(n);
}
&lt;/pre&gt;&lt;br /&gt;I have also tried the same D1 code t, using struct used by value (not allocated on the heap, well, allocated inside the associative array itself) instead of objects:&lt;br /&gt;&lt;pre&gt;import std.conv: toInt;

struct K {
    int x;
    int opEquals(K other) {
        return this.x == other.x;
    }
    int opCmp(K other) {
        return (this.x == other.x) ? 0 : (this.x - other.x);
    }
    hash_t toHash() {
        return 1;
        //return cast(hash_t)this.x;
    }
}

void test(int n) {
    int[K] d;
    for (int i; i &amp;lt; n; i++)
        d[K(i)] = i;
}

void main(string[] args) {
    int n = args.length == 2 ? toInt(args[1]) : 10;
    test(n);
}
&lt;/pre&gt;As you can see the hash methods alaways return 1, so every item is an hash collision.&lt;br /&gt;&lt;br /&gt;The timings:&lt;br /&gt;&lt;pre&gt;Timings with all collisions (hash=1):
  N =       1000  2000  3000  4000  10_000  100_000
  Python:   0.53  1.27  2.45  4.30  30.27
  D class:  0.06  0.09  0.12  0.17   1.07
  D struct: 0.06  0.09  0.09  0.11   0.50    44.26

With no collisions (hash=this.x):
  N =       1000  2000  3000  4000  10_000  100_000  1_000_000
  Python:   0.28  0.28  0.28  0.28   0.28     0.53    6.97
  D class:  0.05  0.05  0.05  0.05   0.06     0.12    1.01
  D struct: 0.05  0.05  0.05  0.05   0.06     0.07    0.39
&lt;/pre&gt;The timings don&apos;t show the clean grow factors I was talking about (probably also because of memory allocations), but you can clearly see how much faster timings grow in Python compared to D when there are many collisions.</description>
  <comments>http://leonardo-m.livejournal.com/83694.html</comments>
  <category>programming</category>
  <category>d language</category>
  <category>python</category>
  <lj:security>public</lj:security>
  <lj:reply-count>3</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/83309.html</guid>
  <pubDate>Tue, 16 Jun 2009 22:27:07 GMT</pubDate>
  <title>The Case for D - the other side of the coin</title>
  <link>http://leonardo-m.livejournal.com/83309.html</link>
  <description>Andrei Alexandrescu has written a nice article, &quot;The Case for D&quot; (click on &apos;Print&apos; to read it on a single page):&lt;br /&gt;&lt;a href=&quot;http://www.ddj.com/hpc-high-performance-computing/217801225&quot;&gt;http://www.ddj.com/hpc-high-performance-computing/217801225&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;D1 is a very nice language, and I use it often, but this article shows too much the good sides of the D2 language and its compilers, focusing on what it may do in future, ignoring their numerous current downsides and problems. Giving false expectations in possible new D users is dangerous. I think that giving a more balanced account of the current situation is better, even if in future most of current D problems may be fixed.&lt;br /&gt;&lt;br /&gt;A good article must show the current troubles of the language too, and not just talk about good implementations that may be found years from now. At the moment Java is a very fast language, the compiler helps the programmer avoid many bug-prone situations, and the toolchain is very good. But at the beginning Java was really slow and of limited usefulness, it was little more than a toy.&lt;br /&gt;&lt;br /&gt;This post isn&apos;t a list of all faults I see in the D language, it&apos;s a list of comments about the article by Andrei Alexandrescu.&lt;br /&gt;&lt;br /&gt;From the article:&lt;br /&gt;&lt;br /&gt;&amp;gt;In the process, the language&apos;s complexity has increased, which is in fact a good indicator because no language in actual use has ever gotten smaller.&amp;lt;&lt;br /&gt;&lt;br /&gt;D2 language is more complex than D1, and even if each thing added to D may have its justifications, C++ language clearly shows that too much complexity is bad. So higher complexity is not a good indicator.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Other implementations are underway, notably including an a .NET port and one using the LLVM infrastructure as backend.&amp;lt;&lt;br /&gt;&lt;br /&gt;The LDC compiler (with LLVM backend) is already usable on Linux to compile D1 code with the Tango standard lib (but it lacks the built-in profiler). On windows LLVM lacks exception support, so it can&apos;t be used yet.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;D could be best described as a high-level systems programming language.&amp;lt;&lt;br /&gt;&lt;br /&gt;It may be quite hard to think about using D to write something like the Linux kernel, or to write code for little embedded systems. D compiled programs are too much big for embedded systems with few kilobytes of RAM, an the D language relies too much on the GC (even if it can be switched off, etc) to be a good tool to write real-world kernel.&lt;br /&gt;&lt;br /&gt;So D is currently more like a systems programming-like language. A multi-level language that can be used to write code quite close to the &apos;metal&apos; or to write high-level generic code too.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;It encompasses features that are normally found in higher-level and even scripting languages -- such as a rapid edit-run cycle,&amp;lt;&lt;br /&gt;&lt;br /&gt;Being made of compiled modules, the edit-run cycle in a D program can be as fast as in other languages like C# and Java.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;In fact, D can link and call C functions directly with no intervening translation layer.&amp;lt;&lt;br /&gt;&lt;br /&gt;On Windows you usually have to compile the C code with DMC to do this.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;However, you&apos;d very rarely feel compelled to go that low because D&apos;s own facilities are often more powerful, safer, and just as efficient.&amp;lt;&lt;br /&gt;&lt;br /&gt;In practice currently there are situiations where using C-style code can lead to higher performance in D1 (especially if you use the DMD compiler instead of the LDC one).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;support for documentation and unit testing is built-in.&amp;lt;&lt;br /&gt;&lt;br /&gt;Such things are very handy and nice. But the current built-in support for documentation has many bugs, and the built-in unit testing is very primitive and limited: for example tests have no name, they just contain normal code and assert(), and their running stops as soon as the first assert fails.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;return printf(&quot;hello, world\n&quot;) &amp;lt; 0;&lt;br /&gt;&lt;br /&gt;This may be more correct C:&lt;br /&gt;&lt;br /&gt;if (printf(&quot;hello, world\n&quot;) &amp;gt;= 0)&lt;br /&gt;    return EXIT_SUCCESS;&lt;br /&gt;else&lt;br /&gt;    return EXIT_FAILURE;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;(and T!(X) or simply T!X for T&lt;x&gt;)&amp;lt;&lt;br /&gt;&lt;br /&gt;In D1 the T!X syntax isn&apos;t supported. In D2 there&apos;s another rule, you can&apos;t write:&lt;br /&gt;T!(U!(X))&lt;br /&gt;As:&lt;br /&gt;T!U!X&lt;br /&gt;This is an example where things are more complex in D2 just to save two chars.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;D&apos;s unit of compilation, protection, and modularity is the file. The unit of packaging is a directory.&amp;lt;&lt;br /&gt;&lt;br /&gt;D module system is nice and handy, but it currently has several bugs, and it has some semantic holes.&lt;br /&gt;&lt;br /&gt;The sensation it leaves in the programmer is that its design was started well, but then the development of such design has stopped mid-course, leaving some of its functionalities half-unfinished.&lt;br /&gt;&lt;br /&gt;For example if you import the module &apos;foo&apos;, in the current namespace it imports not just &apos;foo&apos;, but all the names contained into &apos;foo&apos;, and the &apos;foo&apos; name itself. This is silly.&lt;br /&gt;&lt;br /&gt;There are also troubles with circular import semantics, package semantics, safety (it lacks a syntax to import all names from a module. That&apos;s the default berhavour, and this is bad).&lt;br /&gt;&lt;br /&gt;Another downside is that all current D compilers aren&apos;t able to follow the module tree by themselves to compile code, so you need to tell the compiler all the modules you need to compile, even if such information is already fully present in the code itself. There are several tools that try to patch this basic functionality hole (very big programs need more complex building strategies, but experience shows me that most small D programs can be fine with that automatic compilation model).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;* One, the language&apos;s grammar allows separate and highly optimized lexing, parsing, and analysis steps.&amp;lt;&lt;br /&gt;&lt;br /&gt;This also has the downside that it limits the possible syntax that can be used in the language, for example it makes this code impossible:&lt;br /&gt;foreach (i, item in items)&lt;br /&gt;Forcing the language to use this, that is a bit less readable and a little more bug-prone:&lt;br /&gt;foreach (i, item; items)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;* Three, Walter Bright, the creator and original implementor of D, is an inveterate expert in optimization. &amp;lt;&lt;br /&gt;&lt;br /&gt;This is probably true, despite this the backend of DMD produces not much efficient code. LDC (LLVM-backend) is generally much better in this.&lt;br /&gt;Update1, Jun 17 2009: DMD (especially DMD D1) is faster than LDC in compiling code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Other procedural and object-oriented languages made only little improvements,&amp;lt;&lt;br /&gt;&lt;br /&gt;Untrue, see Clojure and Scala. Hopefully D will do as well or better.&lt;br /&gt;Update1, Jun 17 2009: both Clojure and Scala run on the JVM, so the situation is different.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;a state of affairs that marked a recrudescence of functional languages&amp;lt;&lt;br /&gt;&lt;br /&gt;Some other people may talk about a reinassance, instead :-)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;SafeD is focussed only on eliminating memory corruption possibilities.&amp;lt;&lt;br /&gt;&lt;br /&gt;It may be better to add other safeties to such SafeD modules.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;That makes Java and C# code remarkably easy to port into a working D implementation.&amp;lt;&lt;br /&gt;&lt;br /&gt;It&apos;s indeed quite easy to port C/Java code to D. But translating C headers to D may require some work. And currently the D garbage collector is much less efficient than the common Java ones, so D requires code that allocates less often.&lt;br /&gt;Update1, Jun 17 2009: there are tools that help convert C headers to D. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;such as an explicit override keyword to avoid accidental overriding,&amp;lt;&lt;br /&gt;&lt;br /&gt;It&apos;s optional.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;and a technique I can&apos;t mention because it&apos;s trademarked, so let&apos;s call it contract programming.&amp;lt;&lt;br /&gt;&lt;br /&gt;It&apos;s built-in in the language. It&apos;s not implemented in a very complete way, but it may be enough if you aren&apos;t used to Eiffel.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;The implementation now takes O(n) time, and tail call optimization (which D implements) takes care of the space complexity.&amp;lt;&lt;br /&gt;&lt;br /&gt;At the moment only the LDC compiler (a D1 compiler) is able to perform tail-call elimination (and probably only in simple situations. But probably as LLVM improves, LDC will improve).&lt;br /&gt;Update1, Jun 17 2009: I was wrong, DMD is able to tail-call optimize if the situation is simple.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;iron-clad functional purity guarantees, and comfortable implementation when iteration is the preferred method. If that&apos;s not cool, I don&apos;t know what is.&amp;lt;&lt;br /&gt;&lt;br /&gt;At the moment calls to pure functions aren&apos;t moved out of loops. There can be problems if the pure function generates an out of memory exception, or if it&apos;s involved a change in the floating point rounding mode.&lt;br /&gt;&lt;br /&gt;Functional programming juggles lot of immutable data, and this puts the garbage collector under a high pressure. Currently the D GC isn&apos;t efficient enough for such quick cycles of memory allocation, so it&apos;s not much fit yet for functional-style programming (or Java-style Object Oriented style of programming that allocates very frequently).&lt;br /&gt;&lt;br /&gt;All this isn&apos;t meant to discourage you from using the D1/D2 languages.&lt;br /&gt;&lt;br /&gt;-------------------------------&lt;br /&gt;&lt;br /&gt;Update1, Jun 17 2009:&lt;br /&gt;See also the discussion on Reddit:&lt;br /&gt;&lt;a href=&quot;http://www.reddit.com/r/programming/comments/8t7s1/the_case_for_d_the_other_side_of_the_coin/&quot;&gt;http://www.reddit.com/r/programming/comments/8t7s1/the_case_for_d_the_other_side_of_the_coin/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Answers to the received comments:&lt;br /&gt;&lt;br /&gt;Thank you Anonymous for your large amount of comments. I&apos;ll fix the blog post where I see it&apos;s necessary. Your comments will help me a lot in improving my blog post.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;For exception support, it&apos;s more C++&apos;s LLVM and Windows SEH issue, to get it right.&amp;lt;&lt;br /&gt;&lt;br /&gt;Eventually LLVM/Clang developers will support exceptions on Windows. Several things tell me that LDC will be a good compiler.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;As for profiler, I believe you can compile to LLVM bytecode and profile that by LLVM tools, but well, it&apos;s ugly.&amp;lt;&lt;br /&gt;&lt;br /&gt;Some things are already possible (I am trying KCachegrind now), but DMD is quite more handy, you can just add a &quot;-profile&quot; and it just works. (Code coverage of DMD too is handy, but it doesn&apos;t work on some bigger programs of mine). Walter has said more than one time that having easy to use tools helps people use them more often.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;but what we actually want are just more tools and more mature tools.&amp;lt;&lt;br /&gt;&lt;br /&gt;Command-line features like DMD profiler are enough for me in many situations.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Well, there is actually microkernel OS in D around:&amp;lt;&lt;br /&gt;&lt;br /&gt;I know, but I have read an half-serious proposal to create another compiler to compile the Linux kernel because GCC isn&apos;t too much fit for this purpose. So I guess D compilers too may be even less fit for that purpose.&lt;br /&gt;On the other hand Microsoft is trying to use a modified C# to write a OS (and they say the extra safety offered by C# allows to avoid some controls in the code, and this ends up creating globally efficient enough code), so it may be doable in D too.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;D programs are somewhat bigger minimal C apps (and esp., compiled by LLVM LDC) because of 3 things:&amp;lt;&lt;br /&gt;&lt;br /&gt;A GC can&apos;t be avoided, but maybe it&apos;s possible to keep it outside, dynamically linked.&lt;br /&gt;The runtime contains unicode management, associative arrays, dynamic arrays and more, but it may be possible to strip away some of such things when not used.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;(as a example of such multi-level language, but I&apos;d like to see OMeta-like stuff for D better).&amp;lt;&lt;br /&gt;&lt;br /&gt;OMeta is the future :-)&lt;br /&gt;See also Pymeta, Meta for Python:&lt;br /&gt;&lt;a href=&quot;http://washort.twistedmatrix.com/&quot;&gt;http://washort.twistedmatrix.com/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Exactly, but you always can reimpement your wheels (read: modules/packages via classes, and some design pattern around that), and feed them thru CTFE/mixins.&amp;lt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I&apos;d like the built-in unittest systems to be a bit more powerful, or you can of course re-implement them outside the language, but then it&apos;s better to remove the built-in unittest features. Keeping both is not good.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;That&apos;s actually matter not compiler itself, but your build system.&amp;lt;&lt;br /&gt;&lt;br /&gt;The DMD compiler already has built-in things that are beyond the purposes of a normal compiler. Adding this automatic build feature isn&apos;t essential but it&apos;s handy and positive.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Hey, that &apos;item in items&apos; stuff is not D semantic, and has nothing to with compiler itself.&amp;lt;&lt;br /&gt;&lt;br /&gt;D compiler is designed in several separated layers. So it seems that to change the syntax adding an &quot;in&quot; inside the foreach you have to add some feedback between layers, and this is seen as bad for the compiler (and probably Walter is right here).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;public/private import?&amp;lt;&lt;br /&gt;&lt;br /&gt;Imports are already private by default now in D. The problems are quite more big here.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;new instaneous dee0xd&amp;lt;&lt;br /&gt;&lt;br /&gt;Never seen that before.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Arguable: dmd still compiles faster, and binary sizes are smaller. LLVM optimizations are much more promising, though.&amp;lt;&lt;br /&gt;&lt;br /&gt;In most of my benchmarks LDC produces programs that are faster or much faster. DMD indeed compiles faster (DMD of D2 is a bit less fast). Binary sizes produced by LDC are sometimes bigger but they are working on this, and most times the size is similar.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Somewhat different playgrounds here: JVM-based or self-hosted.&amp;lt;&lt;br /&gt;&lt;br /&gt;You are right, the situation is different. But I think you can implement Clojure multiprocessing ideas even without a VM.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Just stub your own GC in. There are different GC strategies after all, why to hope &apos;one size fitts all&apos; on every cases?&amp;lt;&lt;br /&gt;&lt;br /&gt;Indeed, JavaVM ships with more than one GC to fulfill different purposes.&lt;br /&gt;My own GC is probably going to be worse than the current built-in one. I am not able to write a GC as good JavaVM ones. So what you write here is not good.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;Java GC&apos;s was much worse than Oberon&apos;s btw, when it just appeared.&amp;lt;&lt;br /&gt;&lt;br /&gt;Java at the beginning was WAY worse, I know, I have stated this at the beginning of my blog post.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;And if you have many of &apos;quick cycles of memory allocation&apos;, something is wrong with your memory allocator. It&apos;s not better when you have lotso manual malloc/free, its better when you have memory pools, arenas, zones, and right allocation (or GC) strategy, which fits better for you app.&amp;lt;&lt;br /&gt;&lt;br /&gt;If you look at most Java programs you can often see many small objects allocated in loops.&lt;br /&gt;At the same way, in functional-style languages/programs you can see lot of immutable data structures that are created and collected away all the time. From my benchmarks I think the current D GC isn&apos;t fit for such kinds of code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;So I believe we can&apos;t rely on one single GC for all use cases, but we need lotso strategies and pluggable GC&apos;s for different uses cases and different strategies.&amp;lt;&lt;br /&gt;&lt;br /&gt;I agree, but probably 2-3 GCs (built-in and switchable at compile time) can be enough for most D purposes. I am sure there are many ways to improve the current D GC (for example having a type system able to tell apart GC-manages pointers, and a hybrid moving\conservative GC that pins down memory manually managed, and moves and compacts all the other memory), my purpose was just to show and talk about the current situation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;gt;That shouldn&apos;t stop you in any way from using D&amp;lt;&lt;br /&gt;&lt;br /&gt;Of course. I don&apos;t waste hours of my time commenting about a language I don&apos;t like to program with :-)&lt;br /&gt;D is my second preferred language (after Python), I like it and I have written lot of D code :-)&lt;br /&gt;&lt;br /&gt;Thank you again for all your comments, as you see I agree with most of the things you have written here.</description>
  <comments>http://leonardo-m.livejournal.com/83309.html</comments>
  <category>programming</category>
  <category>d language</category>
  <lj:security>public</lj:security>
  <lj:reply-count>18</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/82967.html</guid>
  <pubDate>Sun, 07 Jun 2009 16:09:39 GMT</pubDate>
  <title>fpaq0 - update</title>
  <link>http://leonardo-m.livejournal.com/82967.html</link>
  <description>The ldc D compiler (&lt;a href=&quot;http://www.dsource.org/projects/ldc&quot;&gt;http://www.dsource.org/projects/ldc&lt;/a&gt; ) is based on the LLVM backend and it&apos;s very good. LLVM is currently a bit young still compared to GCC, and it&apos;s often possible to find code where GCC produces faster code, but such difference is usually within 5-20%, so in practice it&apos;s acceptable in all situations where performance isn&apos;t critical (and in such situations you may want to use something like the Intel compiler or inline asm anyway).&lt;br /&gt;&lt;br /&gt;So I have used ldc to recompile fpaq0, a simple order-0 arithmetic file compressor for stationary sources by by Matt Mahoney:&lt;br /&gt;&lt;a href=&quot;http://cs.fit.edu/%7Emmahoney/compression/fpaq0.cpp&quot;&gt;http://cs.fit.edu/%7Emmahoney/compression/fpaq0.cpp&lt;/a&gt;&lt;br /&gt;That I have translated to D. I have also added a version of the imports for the D Tango standard library too. As input file I use a large purely ASCII text of 6_500_314 bytes, mixed English texts (originally by Peter Norvig, cleaned up a bit).&lt;br /&gt;&lt;br /&gt;The D code and more details about the timings:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/so/fpaq0.zip&quot;&gt;http://www.fantascienza.net/leonardo/so/fpaq0.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The D timing results are very good, better or equal than the C++ code compiled with G++. In both the D and C++ version I have also used putc_unlocked/getc_unlocked to increase I/O performance (I&apos;d like to see such functions in Tango).&lt;br /&gt;&lt;pre&gt;Timings Windows, seconds:
  DMD:        3.01  (100%)
  LLVM-G++ C: 1.78  ( 59%) (100%)
  LLVM-G++ A: 1.73  ( 57%) ( 97%) (100%)
  G++ C:      1.41  ( 47%) ( 79%) ( 81%)
  G++ A:      1.35  ( 45%) ( 76%) ( 78%)

Timings Pubuntu, seconds:
  G++ A:      3.53
  LDC:        1.92 (100%)
  G++ B:      1.55 ( 81%)
  G++ B:      1.48 ( 78%) putc_unlocked/getc_unlocked
  LDC:        1.41        putc_unlocked/getc_unlocked 
&lt;/pre&gt;&lt;strong&gt;Compilers used:&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;On WindowsXp:&lt;br /&gt;DMD Digital Mars D Compiler v1.042&lt;br /&gt;gcc version 4.3.3-dw2-tdm-1 (GCC)&lt;br /&gt;gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)&lt;br /&gt;&lt;br /&gt;On Pubuntu:&lt;br /&gt;gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)&lt;br /&gt;ldc based on DMD v1.045 and llvm 2.6svn (Sun Jun  7 14:18:55 2009)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;dmd used with:&lt;br /&gt;dmd -O -release -inline&lt;br /&gt;&lt;br /&gt;ldc used with:&lt;br /&gt;ldc -O5 -release -inline&lt;br /&gt;&lt;br /&gt;g++ A used with:&lt;br /&gt;g++ -O3 -s&lt;br /&gt;&lt;br /&gt;g++ B used with:&lt;br /&gt;g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native&lt;br /&gt;&lt;br /&gt;g++/llvm-gcc C used with:&lt;br /&gt;g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2&lt;br /&gt;llvm-g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2</description>
  <comments>http://leonardo-m.livejournal.com/82967.html</comments>
  <category>dmd</category>
  <category>g++</category>
  <category>programming</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>d language</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/82803.html</guid>
  <pubDate>Wed, 03 Jun 2009 21:07:54 GMT</pubDate>
  <title>SciMark 2.0 in C and D</title>
  <link>http://leonardo-m.livejournal.com/82803.html</link>
  <description>SciMark 2.0 is a Java benchmark for scientific and numerical computing, developed by Roldan Pozo and Bruce R Miller:&lt;br /&gt;&lt;a href=&quot;http://math.nist.gov/scimark2/index.html&quot;&gt;http://math.nist.gov/scimark2/index.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;On my PC the applet (run locally) scores 388, I have a Core2 at 2 Ghz. (The site shows that some people have scored 7300, I don&apos;t know how).&lt;br /&gt;&lt;br /&gt;The site also offers C code, that I have run and then ported to D, for benchmarks (it can also be possible to port the Java code to D, but probably the C code leads to higher performance). (Update1: I have converted the Java code too to D).&lt;br /&gt;&lt;br /&gt;Here you can find the C and D code (for Phobos and Tango standard libraries), plus all the detailed results of the benchmarks:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/scimark2.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/scimark2.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Thanks to the latest LLVM backend the results of LDC are quite good, better than C compiled with GCC 4.2.1.&lt;br /&gt;&lt;br /&gt;I have tested LDC on the same PC using Pubuntu, that is a &quot;Portable Ubuntu&quot;, that leads to a performance a bit decreased compared to a native Ubuntu running on the same CPU.&lt;br /&gt;&lt;pre&gt;Timings on WinXp 32 bit, 2.00 seconds min time:
               small   large
  D DMD:         299    207
  D DMD:         401    268 (OOP version)
  C GCC:         578    314
  C LLVM-GCC:    609    351

Timings on Pubuntu 32 bit, 2.00 seconds min time:
               small  large
  D LDC:        462     272 (OOP version)
  C GCC:        568     315
  D LDC:        575     327

CPU: Core2 at 2 Ghz, 2 GB RAM.

Compilers used on WindowsXp:
  gcc version 4.3.3-dw2-tdm-1 (GCC)
  gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)

Compilation argumenents on WindowsXp:
  dmd -O -release -inline scimark2_d_phobos.d
  gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2 scimark2_c.c -o scimark2_c_gcc
  llvm-gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2 scimark2_c.c -o scimark2_c_llvm


Compilers uses on Pubuntu:
  gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  LLVM D Compiler rev. based on DMD v1.045 and llvm 2.6svn (Mon Jun  1 22:54:33 2009)

Compilation argumenents on Pubuntu:
  ldc -O5 -release -inline scimark2_d.d
  gcc -O3 -s -fomit-frame-pointer -msse3 -march=native -lm scimark2_c.c -o scimark2_c
&lt;/pre&gt;&lt;br /&gt;-----------------&lt;br /&gt;&lt;br /&gt;Update 1, Jun 5 2009: added OOP D version translated from the Java code (the code is a bit raw, it&apos;s not an example of good code), plus its timings. The class-based version is slower with LDC but faster with DMD, I don&apos;t know why.&lt;br /&gt;&lt;br /&gt;-----------------&lt;br /&gt;&lt;br /&gt;Update Jul 17 2009:&lt;br /&gt;&lt;br /&gt;I have used an updated LDC, based on DMD v1.045 and llvm 2.6svn (Mon Jul 13 06:47:53 2009).&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;
Scores, without LTO:
Timings on Pubuntu 32 bit, 2.00 seconds min time:
  C GCC:        570     317
  D LDC:        570     326 (OOP version)
  D LDC:        595     317
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Then I have tried to use Link-Time optimizations (not currently done by LDC yet), to activate them I have used:&lt;br /&gt;&lt;pre&gt;
ldc -O5 -release -inline -output-bc scimark2_d.d

opt -std-compile-opts scimark2_d.bc &amp;gt; sci.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread sci.bc


Scores, with LTO:
Timings on Pubuntu 32 bit, 2.00 seconds min time:
  C GCC:        570     317
  D LDC:        582     325 (OOP version)
  D LDC:        626     329


With LTO:
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:          625.55
FFT             Mflops:   514.53    (N=1024)
SOR             Mflops:   573.50    (100 x 100)
MonteCarlo:     Mflops:   305.04
Sparse matmult  Mflops:   672.16    (N=1000, nz=5000)
LU              Mflops:  1062.52    (M=100, N=100)
&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;====================================================&lt;br /&gt;&lt;br /&gt;Another update Jul 17 2009:&lt;br /&gt;&lt;br /&gt;I have found a way to internalize the main correctly, this allows for more improvements during the LTO Phase, new score is 680!&lt;br /&gt;&lt;br /&gt;Compile the code with:&lt;br /&gt;ldc -O5 -release -inline -output-bc scimark2_d.d&lt;br /&gt;&lt;br /&gt;opt -std-compile-opts scimark2_d.bc &amp;gt; scimark2_d_opt.bc&lt;br /&gt;&lt;br /&gt;llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=scimark2_d scimark2_d_opt.bc &lt;br /&gt;&lt;pre&gt;
Scores, with LTO + internalize:
Timings on Pubuntu 32 bit, 2.00 seconds min time:
  D LDC:        612     348 (OOP version)
  D LDC:        680     369

**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:          680.44
FFT             Mflops:   520.67    (N=1024)
SOR             Mflops:   573.50    (100 x 100)
MonteCarlo:     Mflops:   568.12
Sparse matmult  Mflops:   677.37    (N=1000, nz=5000)
LU              Mflops:  1062.52    (M=100, N=100)
&lt;/pre&gt;</description>
  <comments>http://leonardo-m.livejournal.com/82803.html</comments>
  <category>llvm-gcc</category>
  <category>gcc</category>
  <category>ldc</category>
  <category>benchmark</category>
  <category>c</category>
  <category>d language</category>
  <category>java</category>
  <lj:security>public</lj:security>
  <lj:reply-count>4</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://leonardo-m.livejournal.com/82339.html</guid>
  <pubDate>Wed, 13 May 2009 16:15:18 GMT</pubDate>
  <title>List deletions problem</title>
  <link>http://leonardo-m.livejournal.com/82339.html</link>
  <description>Once in a while in my programs I need to keep an ordered sequence of items, process them all iteratively, and during each iteration remove few of them that aren&apos;t good anymore. So in each iteration only a small number of items is removed, and such removals happen in random positions.&lt;br /&gt;&lt;br /&gt;I have written a small paper about how to solve this common programming problem in an efficient way, in Python+Psyco, D language and C:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/ar/list_deletions.html&quot;&gt;http://www.fantascienza.net/leonardo/ar/list_deletions.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The code of the paper:&lt;br /&gt;&lt;a href=&quot;http://www.fantascienza.net/leonardo/js/list_deletions.zip&quot;&gt;http://www.fantascienza.net/leonardo/js/list_deletions.zip&lt;/a&gt;</description>
  <comments>http://leonardo-m.livejournal.com/82339.html</comments>
  <category>psyco</category>
  <category>programming</category>
  <category>c</category>
  <category>d language</category>
  <category>python</category>
  <category>benchmarks</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
</channel>
</rss>
