leonardo ([info]leonardo_m) wrote,
@ 2009-08-30 21:45:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:benchmark, c++, d language, dmd, g++, java, llvm-g++, programming, python

"Silly benchmark" in D/C++
All the code discussed here:
http://www.fantascienza.net/leonardo/js/silly_bench.zip

Slava Pestov, author of the programming language Factor, has created a very simple benchmark. Despite being defined "silly" such benchmark has shown me some interesting things.

This the original posts:
"Performance comparison between Factor and Java on a contrived benchmark"
http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html

"Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster":
http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html

What the benchmark does:
- It works on points, which are triplets of single-precision floats, (x,y,z)
- First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
-Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
- Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

The purpose of this benchmark is to compare performance of an array of struct in Factor to an array of class references in Java. It also shows how trigonometric functions are managed.

Slava has given Java code, that I have modified a bit, adding more timings inside, pulling out a subclass, making it static. The overall running time of the Java version is improved just a little.

The Java version is slow for the large amount of memory allocations, and also because it doesn't use the CPU native trigonometric instructions (sin, cos or sincos) when the input is arge, because of cross platform issues and because the Intel CPU instructions return quite inaccurate results in some cases, like when the input numers are large.

So I have created a second Java version, that uses an helper class (probably written by Razzi, http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums〈=java&id=4 ) to reduce the input range of trig functions. The result is much faster. (2.4 seconds instead of 11.4).

Then I have created three D versions:
- The first version is designed to be very similar to the Java version, the array is filled with class references. It's slower on performing the sqrt() and other code, but even in non-release mode it's overall quite faster than the first Java version. Bothe the Java versions and this D version use lot of RAM on each iteration of the program, the GC doesn't free it efficiently.
- The second D version is more optimized. The rray is now composed of struct values. I also delete the array at the end of each loop, so the memory used by the program is about constant. I have also used free functions instead of Silly class, but this has not much effect. D1 structs don't allow constructors (D2 has struct constructors), so I have had to change code a bit there too. The result is much faster, almost two times faster than the second Java version.
- In the third D version I've tried to manually optimize what possible. The DMD compiler doesn't inline functions with ref arguments and some other things. I have avoided an useless inizialization of the struct array and I have manually inlined its filling. I have used for loops insead of foreach, and allocated memory from the GC heap in a raw way. I have optimized the code using only one call to sin() because the DMD compiler isn't able to see the two calls to sin() have the same results. I have hand-optimized the computation of the max struct, even if takes only a small percentage of the whole runtime. The result is a nice speeup compared to the second D version. In D there's no simple way to allocate an array of uninitialized structs (I have to test something with LDC: it may not initialize the array if the struct data is float x=void,y=void,z=void; To be tested). LDC compiler can probably improve such timings.

Later, to have a baseline timing, I have translated the code to C++, and I have compiled it with G++ and LLVM-GCC. The second compiler uses calls to tle libc even using the -ffast-math compilation argument, so the running time is quite slower. Compiled with G++ the program is quite faster than the D code compiled with DMD.

This code in Haskell (or Python, or D with my dlibs) can be lazy, and reduce a lot memory used and probably running time too.

When possible I'll add timings with the LDC D compiler.

(I have not put the Python+Psyco timings here because I have not enough RAM to test it with n=5 millions. With n=1 million the best Python+Psyco time is about 3700 milliseconds. With the first Java program the best timing with n=1 million is 577 milliseconds, about seven times faster).

--------------------

Timings on Windows:

...>java -Xms512m -Xmx512m -server Silly1
Run #0
0.8944272, 1.0, 0.4472136
10810 422 234 0  Time: 11466

...>java -Xms512m -Xmx512m -server Silly2
Run #0
0.8944272, 1.0, 0.4472136
1966 406 124 0  Time: 2496

...>silly_bench_cpp (g++)
Run #0
0.894427, 1.000000, 0.447214
334 255 36 5  Total=632

...>silly_bench_cpp (llvm-g++)
Run #2
0.894427, 1.000000, 0.447214
1304 374 21 5  Total=1705

...>silly_bench1_d (No -release mode):
Run #0
0.894427, 1.000000, 0.447214
6941 444 80 0  total = 7466

...>silly_bench2_d
Run #1
0.894427, 1.000000, 0.447214
785 446 74 0  total = 1305

...>silly_bench3_d
Run #2
0.894427, 1.000000, 0.447214
488 443 28 0  Total = 960
------------------

Compilation arguments for g++ and llvm-g++:
g++ -Wall -O3 -s -fomit-frame-pointer -msse -msse2 -msse3 -march=native -ffast-math silly_bench_cpp.cpp -o silly_bench_cpp

Compilation arguments for DMD:
dmd -O -release -inline:

Intel CPU Celeron 560 at 2.13 GHz, 1GB RAM, Vista Home Basic.

Compilers used:
DMD v1.042
gcc v. 4.3.3-dw2-tdm-1 (GCC)
LLVM-G++ gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)



(2 comments) - (Post a new comment)


[info]scale
2009-08-31 08:08 am UTC (link)
I tried this benchmark quickly in C# to compare the performance of using a class or a struct for Point. On my machine the class version is about 30% slower as expected. Also the FastMath class is much slower than the System.Math class (about 70%), probably because the latter uses intrinsics.
This is the struct version, to test the class version just replace "struct Point" with "class Point" and uncomment the first constructor:
http://www.snowcovered.it/prog/Silly2.rar
Note that code for normalizePoint() is slightly different because the compiler complained about loss of data (in .Net all floating point operations are handled as double, so it wants an explicit cast to float).

(Reply to this) (Thread)


[info]leonardo_m
2009-08-31 01:11 pm UTC (link)
Thank you for your C# version and timings.

>I tried this benchmark quickly in C# to compare the performance of using a class or a struct for Point. On my machine the class version is about 30% slower as expected.<

On Java, D and C++ the version that uses reference-based classes is much slower than the value-based version, so your result is not expected at all.


>Also the FastMath class is much slower than the System.Math class (about 70%), probably because the latter uses intrinsics.<

So C# accepts the errors produced by the asm instruction, or it already performs input range reduction for trigonometric functions. I don't know. A look at the produced asm will tell.


>Note that code for normalizePoint() is slightly different because the compiler complained about loss of data (in .Net all floating point operations are handled as double, so it wants an explicit cast to float).<

D may uses reals (79 bit floats) for some FP operations on Intel CPUs (if the compiler doesn't use SSE, LDC usually uses SSE instructions, that don't use 79 bits), but D1 doesn't have such warnings. C# is probably safer here (D2 has added similar warnings for integral values).

This is how G++, with good compilation arguments, compiles the struct constructor, the two sin and the cos are done with a single quick fsincos instruction:
L20:
	pushl	%ecx
	incl	%ecx
	fildl	(%esp)
	addl	$4, %esp
	fsincos
	fld	%st(1)
	fmul	%st(2), %st
	fmuls	LC4
	fstps	(%edx)
	fmuls	LC5
	fstps	-4(%edx)
	fstps	-8(%edx)
	addl	$12, %edx
	cmpl	%ecx, %ebx
	jg	L20

I don't know if you can do better with the current CPUs.

(Reply to this) (Parent)


(2 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…