| All the code of the tests: http://www.fantascienza.net/leonardo/js/mem_bench.zip
To see the relative performance of various ways to allocate memory I have timed various alternative versions of this program, in C and D languages:
// C99 code
#include "stdio.h"
#include "stdlib.h"
int compute(int n) {
int *arr = malloc(n * sizeof(int));
if (arr == NULL) exit(1);
for (int i = 0; i < n; i++)
arr[i] = i;
int tot = 0;
for (int i = 0; i < n; i++)
tot += arr[i];
free(arr);
return tot;
}
int main() {
int n = 2 * 1000 * 1000;
int tot = 0;
for (int i = 1; i < n; i++)
tot += compute(i % 500);
printf("%d\n", tot);
return 0;
}
In D deleting the dynamic array manually improves the timings a lot.
Timings on Virtual Box, n=2_000_000, % 500, best of 6, seconds:
C 1: 0.98 malloc + free
C 2: 0.81 C99 dynamic stack-allocated array
C 3: alloca
D 1: 2.83 dynamic array
D 2: 2.79 scoped dynamic array
D 3: 1.13 dynamic array + delete at the end
D 4: 0.88 C heap malloc + free
D 5: 0.74 alloca
D 6: 1.13 GC heap malloc + free/relloc
C code compiled with:
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
gcc -Wall -s -std=c99 -O3
D code compiled with (with Tango):
ldc based on DMD v1.051 and llvm 2.6 (Fri Nov 27 12:54:12 2009)
ldc -O5 -release -inline memtest1_d.d
---------------------
Timings on Vista, n=2_000_000, % 500, best of 6, seconds:
C 1: 1.57 malloc + free
C 2: 0.69 C99 dynamic stack-allocated array
C 3: alloca
D 1: 2.60 dynamic array
D 2: 2.62 scoped dynamic array
D 3: 2.50 dynamic array + delete at the end
D 4: 1.24 C heap malloc + free
D 5: 0.74 alloca
D 6: 1.33 GC heap malloc + free/relloc
(Virtual box not started, so it's a little faster)
C code compiled with:
gcc version 4.3.3-dw2-tdm-1 (GCC)
gcc -Wall -s -std=c99 -O3
D code compiled with (with Phobos):
Digital Mars D Compiler v1.042
ldc -O -release -inline memtest1_d.d
---------------------
Timings on Vista, n=2_000_000, % 500, best of 6, seconds:
C 1: 1.47 malloc + free
C 2: 0.63 C99 dynamic stack-allocated array
C 3: alloca
D 1: 2.62 dynamic array
D 2: 2.58 scoped dynamic array
D 3: 2.31 dynamic array + delete at the end
D 4: 1.27 C heap malloc + free
D 5: 0.75 alloca
D 6: 1.40 GC heap malloc + free/relloc
(Virtual box not started, so it's a little faster)
C code compiled with:
gcc version 4.3.3-dw2-tdm-1 (GCC)
llvm-gcc -Wall -s -Wl,--enable-stdcall-fixup -std=c99 -O3
D code compiled with (with Phobos):
Digital Mars D Compiler v2.037
ldc -O -release -inline memtest1_d.d
| comments: 2 comments or Leave a comment  |
| The code discussed in this post: http://www.fantascienza.net/leonardo/js/life_bench.zip
To try to find "performance bugs" in both the LDC D compiler and the LLVM back-end I am exploring the performance signature of many small programs, often translating them to D. To perform such tests I compare the timings of the D code with the original C or Java code.
Java here is useful because its performance profiles are a little different from the usual C ones. Java HotSpot is able to inline many virtual calls, and its GarbageCollector is quite efficient (both things aren't good in all current D implementations).
I've found a small Life (Horton Conway's game) implementation in Java that shows a higher performance compared to D code (the original Java code isn't mine), so I cleaned up the Java code, I have removed the useless stuff from it (see the zip for the Java code), and I have translated it to as much close as possible D code (able to run both on Tango and Phobos). The result is the first D program (life1_d.d).
I have taken care of setting as final the main class in the D code, to allow inlining.
The Java GC (of Sun) is more efficient than the nonmoving D GC, so first of all I have taken a look at the number of memory allocations, bue they weren't the cause of the performance difference. I've profiled both the D and Java code, and I've seen that the calc_new() method was the one taking most time. I've also seen that the amount of inlining with LDC (using -O5 -release -inline) was not enough, so I've compiled the D code with the following, that forces a more aggressive inlining, that improves performance:
ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
I have also seen that for this program Link-Time Optimization plus Interning improve the performance of the code:
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d
opt -std-compile-opts life1_d.bc > life1_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
I have done many more experiments, here you can find the ones that have given good results. I have split the Life class in its methods plus global values, I don't know why this increases performance with the LDC compiler (see the timings), see life2_d.d.
Later I have uses simple pointers as function arguments instead of arrays (see see life3_d.d), again I don't know why this increases the performance with the LDC compiler, also because most of sych functions get inlined anyway.
Now the performance of life3_d.d was bad only for large values (the last values of the use_Sizes array). After several more experiments I have by chance found that the cause was in the inner loop of the calc_new() function, where the whole botRow array is reset to zero. I think here the Java compiler recognizes such loop as a clearing, and replaces it with a call to a function like memset(). This makes the Java code slower than the D one for small botRow arrays, but faster for the longer ones (because it seems an inlined loop is faster than a memset when the length is small, about 50-100 or so if array items are 4 byte long).
So in life4_d.d I have replaced the loop with something a little more complex that takes a look at botRow length:
if (botRow.length > 100)
botRow[] = 0;
else
for (int c = 0; c < botRow.length; c++)
botRow[c] = 0;
The timings of the 4th D version are now good enough, but not the best still, I don't know why. (I am now trying to find a faster array reset that uses an asm routine that contains the movntps SSE instuction).
------------------------
Scores on Windows Vista, using DMD compiler for the D code (bigger is better):
java -server Life
Size average
Adjusting 6744 to 2811246
5 14288
6 14974
8 14125
10 15697
15 14175
25 15369
50 14595
250 6331
1000 2218
2500 880
life1_d.exe
Size average
5 9248
6 8938
8 10213
10 10469
15 10974
25 9898
50 7189
250 1694
1000 421
2500 170
life2_d.exe
Size average
5 9424
6 8270
8 8083
10 7876
15 8389
25 7191
50 5288
250 1331
1000 331
2500 147
life3_d.exe
Size average
5 10156
6 9172
8 9353
10 9043
15 8802
25 8452
50 5865
250 1357
1000 356
2500 151
life4_d.exe
Size average
Adjusting 917315 to 0
5 10145
6 9275
8 9631
10 10282
15 10061
25 9083
50 6286
250 4747
1000 1596
2500 743
Home Vista Basic with 2 GB RAM, Celeron 2.13 GHz
Compilers used:
Java version "1.7.0-ea" Java(TM) SE Runtime Environment (build 1.7.0-ea-b66) Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)
DMD Digital Mars D Compiler v1.042
------------------------
Scores on Ubuntu running on VirtualBox running on Vista, using LDC compiler for the D code (bigger is better):
java -server Life
Size average
Adjusting 28766 to 951466
Adjusting 951466 to 2616313
5 13740
6 16494
8 14922
10 19481
15 20602
25 21249
50 19979
250 7493
1000 2609
2500 1004
ldc -O5 -release -inline life1_d.d
Size average
5 17810
6 16661
8 15147
10 15201
15 14818
25 12901
50 4384
250 1893
1000 485
2500 198
ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
Size average
5 16016
6 14690
8 13183
10 11823
15 11100
25 10144
50 5203
250 2305
1000 571
2500 256
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d opt -std-compile-opts life1_d.bc > life1_do.bc llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
Size average
5 22628
6 18773
8 14326
10 18820
15 17558
25 15264
50 10155
250 2329
1000 578
2500 251
ldc -O5 -release -inline -inline-threshold=2000000001 life2_d.d
Size average
5 16557
6 15205
8 13149
10 12037
15 10870
25 10051
50 6470
250 2327
1000 571
2500 247
ldc -O5 -release -inline -inline-threshold=2000000001 life3_d.d
Size average
5 18205
6 17032
8 15948
10 17798
15 17907
25 15656
50 10191
250 2310
1000 574
2500 248
ldc -O5 -release -inline -inline-threshold=2000000001 life4_d.d
Size average
5 18169
6 16576
8 16129
10 17616
15 17937
25 15486
50 9667
250 6714
1000 2006
2500 886
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life4_d.d opt -std-compile-opts life4_d.bc > life4_do.bc llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life4_do life4_do.bc
Size average
5 25522
6 22085
8 21079
10 20964
15 21987
25 19300
50 13160
250 6774
1000 1986
2500 900
This version (not included in the zip) is the first D one, but the cleaning of the botRow array in calc_new() is done as in life4_d.d: ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1b_d.d opt -std-compile-opts life1b_d.bc > life1b_do.bc llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1b_do life1b_do.bc
Size average
5 22957
6 19394
8 16457
10 17208
15 18177
25 15522
50 5089
250 6638
1000 1971
2500 889
Code running on Ubuntu 9.4 running on VirtualBox 3.0.6 r52128.
Compilers used:
LDC based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)
Java version "1.6.0_16" Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)
------------------------ | comments: Leave a comment  |
| All the code discussed here: http://www.fantascienza.net/leonardo/js/silly_bench.zip
Slava Pestov, author of the programming language Factor, has created a very simple benchmark. Despite being defined "silly" such benchmark has shown me some interesting things.
This the original posts: "Performance comparison between Factor and Java on a contrived benchmark" http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html
"Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster": http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html
What the benchmark does: - It works on points, which are triplets of single-precision floats, (x,y,z) - First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2). -Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z). - Finally, the maximum x, y, and z components are found, for all points, and this is printed out.
The purpose of this benchmark is to compare performance of an array of struct in Factor to an array of class references in Java. It also shows how trigonometric functions are managed.
Slava has given Java code, that I have modified a bit, adding more timings inside, pulling out a subclass, making it static. The overall running time of the Java version is improved just a little.
The Java version is slow for the large amount of memory allocations, and also because it doesn't use the CPU native trigonometric instructions (sin, cos or sincos) when the input is arge, because of cross platform issues and because the Intel CPU instructions return quite inaccurate results in some cases, like when the input numers are large.
So I have created a second Java version, that uses an helper class (probably written by Razzi, http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsumsā©=java&id=4 ) to reduce the input range of trig functions. The result is much faster. (2.4 seconds instead of 11.4).
Then I have created three D versions: - The first version is designed to be very similar to the Java version, the array is filled with class references. It's slower on performing the sqrt() and other code, but even in non-release mode it's overall quite faster than the first Java version. Bothe the Java versions and this D version use lot of RAM on each iteration of the program, the GC doesn't free it efficiently. - The second D version is more optimized. The rray is now composed of struct values. I also delete the array at the end of each loop, so the memory used by the program is about constant. I have also used free functions instead of Silly class, but this has not much effect. D1 structs don't allow constructors (D2 has struct constructors), so I have had to change code a bit there too. The result is much faster, almost two times faster than the second Java version. - In the third D version I've tried to manually optimize what possible. The DMD compiler doesn't inline functions with ref arguments and some other things. I have avoided an useless inizialization of the struct array and I have manually inlined its filling. I have used for loops insead of foreach, and allocated memory from the GC heap in a raw way. I have optimized the code using only one call to sin() because the DMD compiler isn't able to see the two calls to sin() have the same results. I have hand-optimized the computation of the max struct, even if takes only a small percentage of the whole runtime. The result is a nice speeup compared to the second D version. In D there's no simple way to allocate an array of uninitialized structs (I have to test something with LDC: it may not initialize the array if the struct data is float x=void,y=void,z=void; To be tested). LDC compiler can probably improve such timings.
Later, to have a baseline timing, I have translated the code to C++, and I have compiled it with G++ and LLVM-GCC. The second compiler uses calls to tle libc even using the -ffast-math compilation argument, so the running time is quite slower. Compiled with G++ the program is quite faster than the D code compiled with DMD.
This code in Haskell (or Python, or D with my dlibs) can be lazy, and reduce a lot memory used and probably running time too.
When possible I'll add timings with the LDC D compiler.
(I have not put the Python+Psyco timings here because I have not enough RAM to test it with n=5 millions. With n=1 million the best Python+Psyco time is about 3700 milliseconds. With the first Java program the best timing with n=1 million is 577 milliseconds, about seven times faster).
--------------------
Timings on Windows:
...>java -Xms512m -Xmx512m -server Silly1
Run #0
0.8944272, 1.0, 0.4472136
10810 422 234 0 Time: 11466
...>java -Xms512m -Xmx512m -server Silly2
Run #0
0.8944272, 1.0, 0.4472136
1966 406 124 0 Time: 2496
...>silly_bench_cpp (g++)
Run #0
0.894427, 1.000000, 0.447214
334 255 36 5 Total=632
...>silly_bench_cpp (llvm-g++)
Run #2
0.894427, 1.000000, 0.447214
1304 374 21 5 Total=1705
...>silly_bench1_d (No -release mode):
Run #0
0.894427, 1.000000, 0.447214
6941 444 80 0 total = 7466
...>silly_bench2_d
Run #1
0.894427, 1.000000, 0.447214
785 446 74 0 total = 1305
...>silly_bench3_d
Run #2
0.894427, 1.000000, 0.447214
488 443 28 0 Total = 960
------------------
Compilation arguments for g++ and llvm-g++: g++ -Wall -O3 -s -fomit-frame-pointer -msse -msse2 -msse3 -march=native -ffast-math silly_bench_cpp.cpp -o silly_bench_cpp
Compilation arguments for DMD: dmd -O -release -inline:
Intel CPU Celeron 560 at 2.13 GHz, 1GB RAM, Vista Home Basic.
Compilers used: DMD v1.042 gcc v. 4.3.3-dw2-tdm-1 (GCC) LLVM-G++ gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build) Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing) | comments: 2 comments or Leave a comment  |
| This old post of mine is about virtual methods and devirtualizations: http://leonardo-m.livejournal.com/76547.html
I have performed more benchmarks about virtual methods and its costs with LDC. I have used a little well known benchmark, the Richards one. I have used the code from here (on the Web Archive because it seems to be not online anymore): http://web.archive.org/web/20060715074131/lissett.port5.com/ben/bench1.htm http://web.archive.org/web/20060715074131/http://lissett.port5.com/ben/bench3.htm
All the code of the following benchmarks: http://www.fantascienza.net/leonardo/js/richards.zip
Timing results:
Windows, n = 10_000_000:
C: 1.26
D ~C: 2.01
Java: 2.04 (final classes, -server)
D2 ~C#: 2.36 (final classes)
D3 ~C#: 2.36 (final classes, no getters/setters)
Java: 2.73 (final classes)
D4 ~C#: 3.07 (no getters/setters)
C#: 3.98
D1 ~C#: 4.23
Windows, n = 100_000_000:
C: 12.16
Java: 18.73 (final classes, -server)
D ~C: 18.86
D2 ~C#: 23.11 (final classes)
D3 ~C#: 23.12 (final classes, no getters/setters)
Java: 25.40 (final classes)
D4 ~C#: 30.16 (no getters/setters)
C#: 38.39
D1 ~C#: 41.64
Pubuntu, n = 10_000_000:
D ~C: 1.35
C: 1.39
D2 ~C#: 1.98 (final classes)
D3 ~C#: 2.00 (final classes, no getters/setters)
Java: 2.73 (final classes)
D4 ~C#: 2.94 (no getters/setters)
C#: -
D1 ~C#: 4.03
Pubuntu, n = 100_000_000:
D ~C: 13.24
C: 13.77
D2 ~C#: 19.64 (final classes)
D3 ~C#: 19.92 (final classes, no getters/setters)
Java: 25.16 (final classes)
D4 ~C#: 29.16 (no getters/setters)
C#: -
D1 ~C#: 40.17
Key:
D ~C# means D code that comes from the C# version.
D ~C means D code that comes and looks from the C version.
Note that the classes in C# code aren't final. As usual Java shows very good performance.
On WindowsXp: DMD Digital Mars D Compiler v1.042 gcc version 4.3.3-dw2-tdm-1 (GCC)
dmd used with: dmd -O -release -inline
ldc used with: ldc -O5 -release -inline
gcc used with: gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2
On Pubuntu: gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4) ldc based on DMD v1.045 and llvm 2.6svn (Thu Jul 2 23:07:48 2009)
ldc used with: ldc -O5 -release -inline
gcc used with: gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native | comments: 8 comments or Leave a comment  |
| The ldc D compiler (http://www.dsource.org/projects/ldc ) is based on the LLVM backend and it's very good. LLVM is currently a bit young still compared to GCC, and it's often possible to find code where GCC produces faster code, but such difference is usually within 5-20%, so in practice it's acceptable in all situations where performance isn't critical (and in such situations you may want to use something like the Intel compiler or inline asm anyway).
So I have used ldc to recompile fpaq0, a simple order-0 arithmetic file compressor for stationary sources by by Matt Mahoney: http://cs.fit.edu/%7Emmahoney/compression/fpaq0.cpp That I have translated to D. I have also added a version of the imports for the D Tango standard library too. As input file I use a large purely ASCII text of 6_500_314 bytes, mixed English texts (originally by Peter Norvig, cleaned up a bit).
The D code and more details about the timings: http://www.fantascienza.net/leonardo/so/fpaq0.zip
The D timing results are very good, better or equal than the C++ code compiled with G++. In both the D and C++ version I have also used putc_unlocked/getc_unlocked to increase I/O performance (I'd like to see such functions in Tango).
Timings Windows, seconds:
DMD: 3.01 (100%)
LLVM-G++ C: 1.78 ( 59%) (100%)
LLVM-G++ A: 1.73 ( 57%) ( 97%) (100%)
G++ C: 1.41 ( 47%) ( 79%) ( 81%)
G++ A: 1.35 ( 45%) ( 76%) ( 78%)
Timings Pubuntu, seconds:
G++ A: 3.53
LDC: 1.92 (100%)
G++ B: 1.55 ( 81%)
G++ B: 1.48 ( 78%) putc_unlocked/getc_unlocked
LDC: 1.41 putc_unlocked/getc_unlocked
Compilers used:
On WindowsXp: DMD Digital Mars D Compiler v1.042 gcc version 4.3.3-dw2-tdm-1 (GCC) gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)
On Pubuntu: gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4) ldc based on DMD v1.045 and llvm 2.6svn (Sun Jun 7 14:18:55 2009)
dmd used with: dmd -O -release -inline
ldc used with: ldc -O5 -release -inline
g++ A used with: g++ -O3 -s
g++ B used with: g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native
g++/llvm-gcc C used with: g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2 llvm-g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2 | comments: 2 comments or Leave a comment  |
| |