leonardo ([info]leonardo_m) wrote,
@ 2009-09-19 17:05:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:benchmark, d language, dmd, java, ldc, programming

Code performance in D/Java
The code discussed in this post:
http://www.fantascienza.net/leonardo/js/life_bench.zip


To try to find "performance bugs" in both the LDC D compiler and the LLVM back-end I am exploring the performance signature of many small programs, often translating them to D. To perform such tests I compare the timings of the D code with the original C or Java code.

Java here is useful because its performance profiles are a little different from the usual C ones. Java HotSpot is able to inline many virtual calls, and its GarbageCollector is quite efficient (both things aren't good in all current D implementations).

I've found a small Life (Horton Conway's game) implementation in Java that shows a higher performance compared to D code (the original Java code isn't mine), so I cleaned up the Java code, I have removed the useless stuff from it (see the zip for the Java code), and I have translated it to as much close as possible D code (able to run both on Tango and Phobos). The result is the first D program (life1_d.d).

I have taken care of setting as final the main class in the D code, to allow inlining.

The Java GC (of Sun) is more efficient than the nonmoving D GC, so first of all I have taken a look at the number of memory allocations, bue they weren't the cause of the performance difference. I've profiled both the D and Java code, and I've seen that the calc_new() method was the one taking most time. I've also seen that the amount of inlining with LDC (using -O5 -release -inline) was not enough, so I've compiled the D code with the following, that forces a more aggressive inlining, that improves performance:

ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d

I have also seen that for this program Link-Time Optimization plus Interning improve the performance of the code:

ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d

opt -std-compile-opts life1_d.bc > life1_do.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc

I have done many more experiments, here you can find the ones that have given good results. I have split the Life class in its methods plus global values, I don't know why this increases performance with the LDC compiler (see the timings), see life2_d.d.

Later I have uses simple pointers as function arguments instead of arrays (see see life3_d.d), again I don't know why this increases the performance with the LDC compiler, also because most of sych functions get inlined anyway.

Now the performance of life3_d.d was bad only for large values (the last values of the use_Sizes array). After several more experiments I have by chance found that the cause was in the inner loop of the calc_new() function, where the whole botRow array is reset to zero. I think here the Java compiler recognizes such loop as a clearing, and replaces it with a call to a function like memset(). This makes the Java code slower than the D one for small botRow arrays, but faster for the longer ones (because it seems an inlined loop is faster than a memset when the length is small, about 50-100 or so if array items are 4 byte long).

So in life4_d.d I have replaced the loop with something a little more complex that takes a look at botRow length:

if (botRow.length > 100)
    botRow[] = 0;
else
    for (int c = 0; c < botRow.length; c++)
        botRow[c] = 0;
The timings of the 4th D version are now good enough, but not the best still, I don't know why. (I am now trying to find a faster array reset that uses an asm routine that contains the movntps SSE instuction).

------------------------

Scores on Windows Vista, using DMD compiler for the D code (bigger is better):
java -server Life
Size    average
Adjusting 6744 to 2811246
5       14288
6       14974
8       14125
10      15697
15      14175
25      15369
50      14595
250     6331
1000    2218
2500    880


life1_d.exe
Size    average
5       9248
6       8938
8       10213
10      10469
15      10974
25      9898
50      7189
250     1694
1000    421
2500    170


life2_d.exe
Size    average
5       9424
6       8270
8       8083
10      7876
15      8389
25      7191
50      5288
250     1331
1000    331
2500    147


life3_d.exe
Size    average
5       10156
6       9172
8       9353
10      9043
15      8802
25      8452
50      5865
250     1357
1000    356
2500    151


life4_d.exe
Size    average
Adjusting 917315 to 0
5       10145
6       9275
8       9631
10      10282
15      10061
25      9083
50      6286
250     4747
1000    1596
2500    743

Home Vista Basic with 2 GB RAM, Celeron 2.13 GHz

Compilers used:

Java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b66)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)

DMD Digital Mars D Compiler v1.042

------------------------

Scores on Ubuntu running on VirtualBox running on Vista, using LDC compiler for the D code (bigger is better):
java -server Life
Size	average
Adjusting 28766 to 951466
Adjusting 951466 to 2616313
5	13740
6	16494
8	14922
10	19481
15	20602
25	21249
50	19979
250	7493
1000	2609
2500	1004


ldc -O5 -release -inline life1_d.d
Size	average
5	17810
6	16661
8	15147
10	15201
15	14818
25	12901
50	4384
250	1893
1000	485
2500	198


ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
Size	average
5	16016
6	14690
8	13183
10	11823
15	11100
25	10144
50	5203
250	2305
1000	571
2500	256


ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d
opt -std-compile-opts life1_d.bc > life1_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
Size	average
5	22628
6	18773
8	14326
10	18820
15	17558
25	15264
50	10155
250	2329
1000	578
2500	251


ldc -O5 -release -inline -inline-threshold=2000000001 life2_d.d
Size	average
5	16557
6	15205
8	13149
10	12037
15	10870
25	10051
50	6470
250	2327
1000	571
2500	247


ldc -O5 -release -inline -inline-threshold=2000000001 life3_d.d
Size	average
5	18205
6	17032
8	15948
10	17798
15	17907
25	15656
50	10191
250	2310
1000	574
2500	248


ldc -O5 -release -inline -inline-threshold=2000000001 life4_d.d
Size	average
5	18169
6	16576
8	16129
10	17616
15	17937
25	15486
50	9667
250	6714
1000	2006
2500	886


ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life4_d.d
opt -std-compile-opts life4_d.bc > life4_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life4_do life4_do.bc
Size	average
5	25522
6	22085
8	21079
10	20964
15	21987
25	19300
50	13160
250	6774
1000	1986
2500	900


This version (not included in the zip) is the first D one, but the cleaning of the botRow array in calc_new() is done as in life4_d.d:
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1b_d.d
opt -std-compile-opts life1b_d.bc > life1b_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1b_do life1b_do.bc
Size	average
5	22957
6	19394
8	16457
10	17208
15	18177
25	15522
50	5089
250	6638
1000	1971
2500	889


Code running on Ubuntu 9.4 running on VirtualBox 3.0.6 r52128.

Compilers used:

LDC based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)

Java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)

------------------------



Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…