|So far the Shootout site has refused to add a comparison between the LLVM compiler and the other ones:|
So I have done few benchmarks myself, using (mostly) the C code. The code used is exactly the same for both compilers.
C souce code (from the Shootout site), and timings in OpenOffice format:
In the nbody benchmark there's a large difference, I don't know its origin (I hope LLVM will fix those problems. And I hope LLVM will someday support exceptions on Windows too).
Generally for LLVM-gcc it's generally better to compile with -msse3 (without it some timings become quite bad, expecially for the mandelbrot benchmark).
Compilers used: LLVM-gcc V. 2.4 GCC V. 4.2.1-dw2 (mingw32-2) Compiler options used: GCC: -O3 -s -fomit-frame-pointer LLVM-gcc: -O3 -s -fomit-frame-pointer Benchmarks using FP numbers are compiled with -msse3 too. CPU used: Intel Core2, 2 GHz (32-bit mode). All benchmarks use only 1 core. TIMINGS GCC, best of 3: bintrees, n=15: 4.24 s fannkuck, n=11: 5.24 s fasta, n=9_000_000 (> NUL): 3.76 s fasta, n=9_000_000 (a): 4.17 s k_nucleotide, (d): 4.63 s mandelbrot, (c) n=4_000: 2.49 s meteor_contest_ccp, n=2_098: 0.12 s meteor_contest_c, n=2_098: 0.17 s nbody, (c) n=10_000_000: 5.92 s nsieve, n=12: 5.47 s nsieve_bits, n=13: 4.31 s partial_sums, (c) n=7_000_000: 5.77 s recursive, (c) n=12: 5.82 s reverse_complement (b) (> NUL): 1.77 s reverse_complement (b) (a): 2.54 s spectral_norm, (c) n=3000: 6.78 s sum_file, input=71_974_912 bytes: 2.28 s TIMINGS LLVM-gcc, best of 3: bintrees, n=15: 4.26 s fannkuck, n=11: 5.45 s fasta, n=9_000_000 (> NUL): 3.69 s fasta, n=9_000_000 (a): 4.01 s k_nucleotide, (d): 4.71 s mandelbrot, (c) n=4_000: 2.40 s meteor_contest_ccp, n=2_098: 0.13 s meteor_contest_c, n=2_098: 0.14 s nbody, (c) n=10_000_000: 16.63 s nsieve, n=12: 5.47 s nsieve_bits, n=13: 4.15 s partial_sums (c), n=7_000_000: 6.52 s recursive, (c) n=12: 6.47 s reverse_complement (b) (> NUL): 1.90 s reverse_complement (b) (a): 2.60 s spectral_norm, (c) n=3000: 5.96 s sum_file, input=71_974_912 bytes: 3.28 s Key: (a) = to no existing output file. (b) = input generated by fasta with N=9_000_000. (c) = compiled with -msse3 too. (d) = from fasta file n=1_000_000 Note, useful as reference point: nbody.java, N=10_000_000: 5.48 s
After a suggestion I have compiled again all the programs with a more fitting march:
llvm-gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2 Or: llvm-g++ -O3 -s -fomit-frame-pointer -msse3 -march=core2 Some timings are changed a little: TIMINGS LLVM-gcc core2, best of 3: fasta, n=9_000_000 (> NUL): 3.69 s ==> 2.75 s fasta, n=9_000_000 (a): 4.01 s ==> 3.08 s reverse_complement (b) (> NUL): 1.90 s ==> 1.88 s reverse_complement (b) (a): 2.60 s ==> 2.97 s
So overall there's an improvement. Ignoring the timings for nbody the total of the other timings (with -march=core2) is close enough to the total for gcc.
In the meantime LLVM developers have found the problem with nbody (and filed a bug performance report), the compiler doesn't inline the sqrt() in the following line, that is the most hot loop:
double distance = sqrt(dx * dx + dy * dy + dz * dz);
For the Java code of 'nbody' see also here:
You can find that reformatted nbody Java code into the zip too.
Using idea from the following two pages:
Installing the "self-extracting DEBUG Jar file", and then using:
java -XX:+PrintOptoAssembly -server -cp . nbody
I was able to find the asm code produced by the JavaVM for the nbody benchmark. It essentially uses only the SSE registers, and no floating point stack. It contains three inlines calls to the sqrt (but the program contains only two of them). At a first look, that asm doesn't look much different frm the asm produced by LLVM-gcc (but LLVM-gcc doesn't inline the call to the sqrt).
I have seen that the last C++ version of the nbody (you can find it too inside the zip) compiled with LLVM-gcc is able to run in 4.98 s, but it uses lot of intrinsics like __builtin_ia32_haddpd(), that will not be the best for future CPUs (while the Java code is perfectly general), in practice it's partially asm already.
Update 1: I have added CPU used, compiler version used, changed the title of the post a little.
Update 2: I have cleaned up timings and the graph, leaving only the ones with -msse3 for benchmarks that use FP numbers.
Update 3: I have added timings for -march=core2, link to bug #3219, and fixed the key a little.
Update 4, Dec 19: I have added the Java code and relative asm and comments.
See a follow up: http://leonardo-m.livejournal.com/7
|comments: Leave a comment|