| leonardo ( @ 2009-09-19 17:05:00 |
| Entry tags: | benchmark, d language, dmd, java, ldc, programming |
Code performance in D/Java
The code discussed in this post:
http://www.fantascienza.net/leonard
To try to find "performance bugs" in both the LDC D compiler and the LLVM back-end I am exploring the performance signature of many small programs, often translating them to D. To perform such tests I compare the timings of the D code with the original C or Java code.
Java here is useful because its performance profiles are a little different from the usual C ones. Java HotSpot is able to inline many virtual calls, and its GarbageCollector is quite efficient (both things aren't good in all current D implementations).
I've found a small Life (Horton Conway's game) implementation in Java that shows a higher performance compared to D code (the original Java code isn't mine), so I cleaned up the Java code, I have removed the useless stuff from it (see the zip for the Java code), and I have translated it to as much close as possible D code (able to run both on Tango and Phobos). The result is the first D program (life1_d.d).
I have taken care of setting as final the main class in the D code, to allow inlining.
The Java GC (of Sun) is more efficient than the nonmoving D GC, so first of all I have taken a look at the number of memory allocations, bue they weren't the cause of the performance difference. I've profiled both the D and Java code, and I've seen that the calc_new() method was the one taking most time. I've also seen that the amount of inlining with LDC (using -O5 -release -inline) was not enough, so I've compiled the D code with the following, that forces a more aggressive inlining, that improves performance:
ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
I have also seen that for this program Link-Time Optimization plus Interning improve the performance of the code:
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d
opt -std-compile-opts life1_d.bc > life1_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
I have done many more experiments, here you can find the ones that have given good results. I have split the Life class in its methods plus global values, I don't know why this increases performance with the LDC compiler (see the timings), see life2_d.d.
Later I have uses simple pointers as function arguments instead of arrays (see see life3_d.d), again I don't know why this increases the performance with the LDC compiler, also because most of sych functions get inlined anyway.
Now the performance of life3_d.d was bad only for large values (the last values of the use_Sizes array). After several more experiments I have by chance found that the cause was in the inner loop of the calc_new() function, where the whole botRow array is reset to zero. I think here the Java compiler recognizes such loop as a clearing, and replaces it with a call to a function like memset(). This makes the Java code slower than the D one for small botRow arrays, but faster for the longer ones (because it seems an inlined loop is faster than a memset when the length is small, about 50-100 or so if array items are 4 byte long).
So in life4_d.d I have replaced the loop with something a little more complex that takes a look at botRow length:
if (botRow.length > 100)
botRow[] = 0;
else
for (int c = 0; c < botRow.length; c++)
botRow[c] = 0;
The timings of the 4th D version are now good enough, but not the best still, I don't know why. (I am now trying to find a faster array reset that uses an asm routine that contains the movntps SSE instuction).------------------------
Scores on Windows Vista, using DMD compiler for the D code (bigger is better):
java -server Life Size average Adjusting 6744 to 2811246 5 14288 6 14974 8 14125 10 15697 15 14175 25 15369 50 14595 250 6331 1000 2218 2500 880 life1_d.exe Size average 5 9248 6 8938 8 10213 10 10469 15 10974 25 9898 50 7189 250 1694 1000 421 2500 170 life2_d.exe Size average 5 9424 6 8270 8 8083 10 7876 15 8389 25 7191 50 5288 250 1331 1000 331 2500 147 life3_d.exe Size average 5 10156 6 9172 8 9353 10 9043 15 8802 25 8452 50 5865 250 1357 1000 356 2500 151 life4_d.exe Size average Adjusting 917315 to 0 5 10145 6 9275 8 9631 10 10282 15 10061 25 9083 50 6286 250 4747 1000 1596 2500 743
Home Vista Basic with 2 GB RAM, Celeron 2.13 GHz
Compilers used:
Java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b66)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)
DMD Digital Mars D Compiler v1.042
------------------------
Scores on Ubuntu running on VirtualBox running on Vista, using LDC compiler for the D code (bigger is better):
java -server Life Size average Adjusting 28766 to 951466 Adjusting 951466 to 2616313 5 13740 6 16494 8 14922 10 19481 15 20602 25 21249 50 19979 250 7493 1000 2609 2500 1004 ldc -O5 -release -inline life1_d.d Size average 5 17810 6 16661 8 15147 10 15201 15 14818 25 12901 50 4384 250 1893 1000 485 2500 198 ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d Size average 5 16016 6 14690 8 13183 10 11823 15 11100 25 10144 50 5203 250 2305 1000 571 2500 256
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d
opt -std-compile-opts life1_d.bc > life1_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
Size average 5 22628 6 18773 8 14326 10 18820 15 17558 25 15264 50 10155 250 2329 1000 578 2500 251 ldc -O5 -release -inline -inline-threshold=2000000001 life2_d.d Size average 5 16557 6 15205 8 13149 10 12037 15 10870 25 10051 50 6470 250 2327 1000 571 2500 247 ldc -O5 -release -inline -inline-threshold=2000000001 life3_d.d Size average 5 18205 6 17032 8 15948 10 17798 15 17907 25 15656 50 10191 250 2310 1000 574 2500 248 ldc -O5 -release -inline -inline-threshold=2000000001 life4_d.d Size average 5 18169 6 16576 8 16129 10 17616 15 17937 25 15486 50 9667 250 6714 1000 2006 2500 886
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life4_d.d
opt -std-compile-opts life4_d.bc > life4_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life4_do life4_do.bc
Size average 5 25522 6 22085 8 21079 10 20964 15 21987 25 19300 50 13160 250 6774 1000 1986 2500 900
This version (not included in the zip) is the first D one, but the cleaning of the botRow array in calc_new() is done as in life4_d.d:
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1b_d.d
opt -std-compile-opts life1b_d.bc > life1b_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1b_do life1b_do.bc
Size average 5 22957 6 19394 8 16457 10 17208 15 18177 25 15522 50 5089 250 6638 1000 1971 2500 889
Code running on Ubuntu 9.4 running on VirtualBox 3.0.6 r52128.
Compilers used:
LDC based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)
Java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)
------------------------