Log in

No account? Create an account

ao benchmark in various languages - leonardo
View:Recent Entries.
View:Website (My Website).

Tags:, , , , , , ,
Subject:ao benchmark in various languages
Time:12:03 pm
This is a small 3D rendering benchmark, ao bench, by Syoyo Fujita:

I have improved the C and Python versions, and I have created versions for ShedSkin, D, Java, etc:

- This time the timings for D are good.
- I have not tried the LDC D compiler, but other people may do it.
- In Python I have inlined one function (vdot) manually.
- ao2_py uses both the multiprocessing module of Python 2.6 and Psyco.
- The ShedSkin version is slow.
- Compiling ao_py with the version 0.1 of ShedSkin is a bit boring and slow. You need to use -i -w and then wait for it to perform the maximum of 30 iterations. Then for max performance you have to manually modify the CCFLAGS inside the make before compiling the C++ code.
- To create the ao2_py version I have had just to turn the render function into a pure function (well, it uses some global data, but doesn't change it), and the multiprocessing is done by a single line of code:
Pool().map(render, xrange(HEIGHT), chunksize=2)
- I'd like to see how well the Psyco-processing version goes with four cores (if you want to run Python on a 4-core 64-bit CPU you may be tempted to use a 64-bit operating system, because sometimes 64-bit code is faster. But using 64 bits you can't use Psyco, so the end result is often a strong slowdown of your Python code. Using a 64 bit OS is good if you need a lot of RAM).
- The Java version (AO.java) is a port from the original Processing version, it's quite naive (despite I have removed some useless allocation of arrays, that don't slow down code much), so it's probably easy to speed it up.
- The ao2_d D version is naive and very slow, it's a very close translation of the Java version. It shows that D code written in Java-like style can be very slow when run by D. So Java programmers coming to D have to be careful. For example current D compilers aren't able to inline virtual calls as HotSpot is sometimes able to do.
- I have converted the Java code into an applet (http://www.fantascienza.net/leonardo/blog_pics/ao_bench/ao_benchmark.html ), it runs about as fast as the Java code for console. On my PC the Processing version (on Syoyo's site) runs in about 8.7 seconds, while this naive pure Java applet needs 6.46 seconds. I don't know why Processing is so much slower. The code is almost the same.
- The ao3_d D version is derived from the ao2_d version, but I have declared 'scope' some object creations to reduce heap allocations. The resulting code is significantly faster than ao2_d (but much slower than the optimized D code still), but you have to be careful when declaring 'scope' because it may lead to bugs if the created objects then escape the scope.
- I have done few more tests. On the same PC a 128x128 benchmark with Javascript takes about 22.36 s with Firefox 3.03 and about 3 seconds with Chrome build 21 running through Wine. I have also seen that on Ubuntu 8.10, compiling with gcc V.4.3.2 the C code needs just 2.54 s (and 2.8 s using the standard drand48), this timing is similar to the C timing on Syoyo site. I don't know why MinGW on Windows gives so much slower code. I have not tested LDC on Linux.
Timings, best of 3, seconds:
  ao_d, float:                  3.67
  ao_c with gcc-llvm, float:    3.72
  ao_c with gcc-llvm, double:   3.83
  ao_c with gcc, double:        3.99
  ao_c with gcc, float:         4.04
  ao_d, double:                 4.10
  AO.java, float, naive:        6.35
  ao2_py with Psyco:           16.72
  ao3_d, float, naive:         24.09
  ao_py with Psyco:            29.75
  ao2_d, float, naive:         31.62
  ao_py with ShedSkin          48.58
  ao2_py without Psyco:        70.6
  ao_py without Psyco:        138.46

Timings on Pubuntu:
  ao1_d:             2.95
  ao_c, gcc, float:  3.82
  ao3_d:             8.78 (BUG)
  ao2_d:            16.52 (BUG)

Timings on Pubuntu with LTO+internalizing:
  ao1_d:             2.87 

Parameters used:
  WIDTH = HEIGHT = 256


CPU: Intel Core 2, 2 GHz (2 cores)

D code compiled with:
  DMD v1.041
  -O -release -inline

C code compiled with:
  gcc: V. 4.3.3-dw2-tdm-1 (GCC)
  LLVM: gcc version 4.2.1 (Based on Apple Inc. build 5636)
  For both:
  -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2

  (r261:67515, Dec  5 2008, 13:58:38)
  [MSC v.1500 32 bit (Intel)] on win32

Psyco for Python 2.6, V.1.6.0 final 0

ShedSkin V. 0.1
  ss -i -w ao_ss.py
  using gcc version 4.3.3-dw2-tdm-1 (GCC)
  CCFLAGS=-O3 -s -fomit-frame-pointer -msse2 ...

Javac 1.6.0_12
Java SE runtime build 1.6.0_07-b06

To compile it this way (LTO+internalizing) you need three more complex commands:
ldc -O5 -release -inline -output-bc ao1_d.d

opt -std-compile-opts ao1_d.bc > ao1_d_opt.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=ao1_d ao1_d_opt.bc

Update 1, Mar 26 2009: removed the ao_ss version because I've found that ShedSkin is able to slowly compile ao_py too after all. Added the ao2_py version that uses the multiprocessing module. A D version that uses threads can be created.

Update 2, Mar 27 2009: added a naive Java version (AO.java) adapted from the original Processing version, plus a D translation (ao2_d.d) of this Java version to show how Java-style code can be slow in D.

Update 3, Mar 27 2009: improved a bit the Java version. Added a Java applet version of the Java code.

Update 4, Mar 28 2009: little changes in the ao2_d version, and added ao3_d version, plus few timings on Linux.

Update 5, Jun 21 2009: fixed ao1_d for Tango too. On Tango rand() returns a number in (0..maxint). ao2_d and ao3_d have a similar problem.

Update 6, Jul 17 2009: I have found a way to use Link-Time Optimization with LDC, and to perform a correct internalize of the main.
comments: Leave a comment Previous Entry Share Next Entry

Subject:Can you make it go faster?
Time:2010-01-02 10:18 am (UTC)
Just found this and I must say, thank you so much for sharing all this!

I'm hoping to get a version working in Houdini and learn enough to tackle the algorithm in the GPU Gems 2 book, which is posted online.

Do you have any suggestions?
(Reply) (Thread)

ao benchmark in various languages - leonardo
View:Recent Entries.
View:Website (My Website).