This is a small 3D rendering benchmark, ao bench, by Syoyo Fujita:
I have improved the C and Python versions, and I have created versions for ShedSkin, D, Java, etc:
- This time the timings for D are good.
- I have not tried the LDC D compiler, but other people may do it.
- In Python I have inlined one function (vdot) manually.
- ao2_py uses both the multiprocessing module of Python 2.6 and Psyco.
- The ShedSkin version is slow.
- Compiling ao_py with the version 0.1 of ShedSkin is a bit boring and slow. You need to use -i -w and then wait for it to perform the maximum of 30 iterations. Then for max performance you have to manually modify the CCFLAGS inside the make before compiling the C++ code.
- To create the ao2_py version I have had just to turn the render function into a pure function (well, it uses some global data, but doesn't change it), and the multiprocessing is done by a single line of code:
Pool().map(render, xrange(HEIGHT), chunksize=2)
- I'd like to see how well the Psyco-processing version goes with four cores (if you want to run Python on a 4-core 64-bit CPU you may be tempted to use a 64-bit operating system, because sometimes 64-bit code is faster. But using 64 bits you can't use Psyco, so the end result is often a strong slowdown of your Python code. Using a 64 bit OS is good if you need a lot of RAM).
- The Java version (AO.java) is a port from the original Processing version, it's quite naive (despite I have removed some useless allocation of arrays, that don't slow down code much), so it's probably easy to speed it up.
- The ao2_d D version is naive and very slow, it's a very close translation of the Java version. It shows that D code written in Java-like style can be very slow when run by D. So Java programmers coming to D have to be careful. For example current D compilers aren't able to inline virtual calls as HotSpot is sometimes able to do.
- I have converted the Java code into an applet (http://www.fantascienza.net/leonardo/blog_pics/ao_bench/ao_benchmark.html ), it runs about as fast as the Java code for console. On my PC the Processing version (on Syoyo's site) runs in about 8.7 seconds, while this naive pure Java applet needs 6.46 seconds. I don't know why Processing is so much slower. The code is almost the same.
- The ao3_d D version is derived from the ao2_d version, but I have declared 'scope' some object creations to reduce heap allocations. The resulting code is significantly faster than ao2_d (but much slower than the optimized D code still), but you have to be careful when declaring 'scope' because it may lead to bugs if the created objects then escape the scope.
Timings, best of 3, seconds: ao_d, float: 3.67 ao_c with gcc-llvm, float: 3.72 ao_c with gcc-llvm, double: 3.83 ao_c with gcc, double: 3.99 ao_c with gcc, float: 4.04 ao_d, double: 4.10 AO.java, float, naive: 6.35 ao2_py with Psyco: 16.72 ao3_d, float, naive: 24.09 ao_py with Psyco: 29.75 ao2_d, float, naive: 31.62 ao_py with ShedSkin 48.58 ao2_py without Psyco: 70.6 ao_py without Psyco: 138.46 Timings on Pubuntu: ao1_d: 2.95 ao_c, gcc, float: 3.82 ao3_d: 8.78 (BUG) ao2_d: 16.52 (BUG) Timings on Pubuntu with LTO+internalizing: ao1_d: 2.87 Parameters used: WIDTH = HEIGHT = 256 NSUBSAMPLES = 2 NAO_SAMPLES = 8 -------------- CPU: Intel Core 2, 2 GHz (2 cores) D code compiled with: DMD v1.041 -O -release -inline C code compiled with: gcc: V. 4.3.3-dw2-tdm-1 (GCC) LLVM: gcc version 4.2.1 (Based on Apple Inc. build 5636) For both: -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2 Python: ActivePython 126.96.36.199 (r261:67515, Dec 5 2008, 13:58:38) [MSC v.1500 32 bit (Intel)] on win32 Psyco for Python 2.6, V.1.6.0 final 0 ShedSkin V. 0.1 ss -i -w ao_ss.py using gcc version 4.3.3-dw2-tdm-1 (GCC) CCFLAGS=-O3 -s -fomit-frame-pointer -msse2 ... Javac 1.6.0_12 Java SE runtime build 1.6.0_07-b06
To compile it this way (LTO+internalizing) you need three more complex commands:
ldc -O5 -release -inline -output-bc ao1_d.d
opt -std-compile-opts ao1_d.bc > ao1_d_opt.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=ao1_d ao1_d_opt.bc
Update 1, Mar 26 2009: removed the ao_ss version because I've found that ShedSkin is able to slowly compile ao_py too after all. Added the ao2_py version that uses the multiprocessing module. A D version that uses threads can be created.
Update 2, Mar 27 2009: added a naive Java version (AO.java) adapted from the original Processing version, plus a D translation (ao2_d.d) of this Java version to show how Java-style code can be slow in D.
Update 3, Mar 27 2009: improved a bit the Java version. Added a Java applet version of the Java code.
Update 4, Mar 28 2009: little changes in the ao2_d version, and added ao3_d version, plus few timings on Linux.
Update 5, Jun 21 2009: fixed ao1_d for Tango too. On Tango rand() returns a number in (0..maxint). ao2_d and ao3_d have a similar problem.
Update 6, Jul 17 2009: I have found a way to use Link-Time Optimization with LDC, and to perform a correct internalize of the main.
|comments: Leave a comment|