| leonardo ( @ 2009-08-30 21:45:00 |
| Entry tags: | benchmark, c++, d language, dmd, g++, java, llvm-g++, programming, python |
"Silly benchmark" in D/C++
All the code discussed here:
http://www.fantascienza.net/leonard
Slava Pestov, author of the programming language Factor, has created a very simple benchmark. Despite being defined "silly" such benchmark has shown me some interesting things.
This the original posts:
"Performance comparison between Factor and Java on a contrived benchmark"
http://factor-language.blogspot.com/200
"Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster":
http://factor-language.blogspot.com/200
What the benchmark does:
- It works on points, which are triplets of single-precision floats, (x,y,z)
- First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
-Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
- Finally, the maximum x, y, and z components are found, for all points, and this is printed out.
The purpose of this benchmark is to compare performance of an array of struct in Factor to an array of class references in Java. It also shows how trigonometric functions are managed.
Slava has given Java code, that I have modified a bit, adding more timings inside, pulling out a subclass, making it static. The overall running time of the Java version is improved just a little.
The Java version is slow for the large amount of memory allocations, and also because it doesn't use the CPU native trigonometric instructions (sin, cos or sincos) when the input is arge, because of cross platform issues and because the Intel CPU instructions return quite inaccurate results in some cases, like when the input numers are large.
So I have created a second Java version, that uses an helper class (probably written by Razzi, http://shootout.alioth.debian.org/gp4/b
Then I have created three D versions:
- The first version is designed to be very similar to the Java version, the array is filled with class references. It's slower on performing the sqrt() and other code, but even in non-release mode it's overall quite faster than the first Java version. Bothe the Java versions and this D version use lot of RAM on each iteration of the program, the GC doesn't free it efficiently.
- The second D version is more optimized. The rray is now composed of struct values. I also delete the array at the end of each loop, so the memory used by the program is about constant. I have also used free functions instead of Silly class, but this has not much effect. D1 structs don't allow constructors (D2 has struct constructors), so I have had to change code a bit there too. The result is much faster, almost two times faster than the second Java version.
- In the third D version I've tried to manually optimize what possible. The DMD compiler doesn't inline functions with ref arguments and some other things. I have avoided an useless inizialization of the struct array and I have manually inlined its filling. I have used for loops insead of foreach, and allocated memory from the GC heap in a raw way. I have optimized the code using only one call to sin() because the DMD compiler isn't able to see the two calls to sin() have the same results. I have hand-optimized the computation of the max struct, even if takes only a small percentage of the whole runtime. The result is a nice speeup compared to the second D version. In D there's no simple way to allocate an array of uninitialized structs (I have to test something with LDC: it may not initialize the array if the struct data is float x=void,y=void,z=void; To be tested). LDC compiler can probably improve such timings.
Later, to have a baseline timing, I have translated the code to C++, and I have compiled it with G++ and LLVM-GCC. The second compiler uses calls to tle libc even using the -ffast-math compilation argument, so the running time is quite slower. Compiled with G++ the program is quite faster than the D code compiled with DMD.
This code in Haskell (or Python, or D with my dlibs) can be lazy, and reduce a lot memory used and probably running time too.
When possible I'll add timings with the LDC D compiler.
(I have not put the Python+Psyco timings here because I have not enough RAM to test it with n=5 millions. With n=1 million the best Python+Psyco time is about 3700 milliseconds. With the first Java program the best timing with n=1 million is 577 milliseconds, about seven times faster).
--------------------
Timings on Windows: ...>java -Xms512m -Xmx512m -server Silly1 Run #0 0.8944272, 1.0, 0.4472136 10810 422 234 0 Time: 11466 ...>java -Xms512m -Xmx512m -server Silly2 Run #0 0.8944272, 1.0, 0.4472136 1966 406 124 0 Time: 2496 ...>silly_bench_cpp (g++) Run #0 0.894427, 1.000000, 0.447214 334 255 36 5 Total=632 ...>silly_bench_cpp (llvm-g++) Run #2 0.894427, 1.000000, 0.447214 1304 374 21 5 Total=1705 ...>silly_bench1_d (No -release mode): Run #0 0.894427, 1.000000, 0.447214 6941 444 80 0 total = 7466 ...>silly_bench2_d Run #1 0.894427, 1.000000, 0.447214 785 446 74 0 total = 1305 ...>silly_bench3_d Run #2 0.894427, 1.000000, 0.447214 488 443 28 0 Total = 960------------------
Compilation arguments for g++ and llvm-g++:
g++ -Wall -O3 -s -fomit-frame-pointer -msse -msse2 -msse3 -march=native -ffast-math silly_bench_cpp.cpp -o silly_bench_cpp
Compilation arguments for DMD:
dmd -O -release -inline:
Intel CPU Celeron 560 at 2.13 GHz, 1GB RAM, Vista Home Basic.
Compilers used:
DMD v1.042
gcc v. 4.3.3-dw2-tdm-1 (GCC)
LLVM-G++ gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)