leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 6 entries.

Tags:, , , , , ,
Subject:Benchmark of small memory allocations in D and C
Time:06:10 pm
All the code of the tests:
http://www.fantascienza.net/leonardo/js/mem_bench.zip

To see the relative performance of various ways to allocate memory I have timed various alternative versions of this program, in C and D languages:
// C99 code
#include "stdio.h"
#include "stdlib.h"

int compute(int n) {
    int *arr = malloc(n * sizeof(int));
    if (arr == NULL) exit(1);
    for (int i = 0; i < n; i++)
        arr[i] = i;
    int tot = 0;
    for (int i = 0; i < n; i++)
        tot += arr[i];
    free(arr);
    return tot;
}

int main() {
    int n = 2 * 1000 * 1000;
    int tot = 0;
    for (int i = 1; i < n; i++)
        tot += compute(i % 500);
    printf("%d\n", tot);
    return 0;
}

In D deleting the dynamic array manually improves the timings a lot.
Timings on Virtual Box, n=2_000_000, % 500, best of 6, seconds:
  C 1: 0.98  malloc + free
  C 2: 0.81  C99 dynamic stack-allocated array
  C 3:       alloca
  D 1: 2.83  dynamic array
  D 2: 2.79  scoped dynamic array
  D 3: 1.13  dynamic array + delete at the end
  D 4: 0.88  C heap malloc + free
  D 5: 0.74  alloca
  D 6: 1.13  GC heap malloc + free/relloc
  
C code compiled with:
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) 
gcc -Wall -s -std=c99 -O3

D code compiled with (with Tango):
ldc based on DMD v1.051 and llvm 2.6 (Fri Nov 27 12:54:12 2009)
ldc -O5 -release -inline memtest1_d.d

---------------------

Timings on Vista, n=2_000_000, % 500, best of 6, seconds:
  C 1: 1.57  malloc + free
  C 2: 0.69  C99 dynamic stack-allocated array
  C 3:       alloca
  D 1: 2.60  dynamic array
  D 2: 2.62  scoped dynamic array
  D 3: 2.50  dynamic array + delete at the end
  D 4: 1.24  C heap malloc + free
  D 5: 0.74  alloca
  D 6: 1.33  GC heap malloc + free/relloc
  
(Virtual box not started, so it's a little faster)

C code compiled with:
gcc version 4.3.3-dw2-tdm-1 (GCC)
gcc -Wall -s -std=c99 -O3

D code compiled with (with Phobos):
Digital Mars D Compiler v1.042
ldc -O -release -inline memtest1_d.d

---------------------

Timings on Vista, n=2_000_000, % 500, best of 6, seconds:
  C 1: 1.47  malloc + free
  C 2: 0.63  C99 dynamic stack-allocated array
  C 3:       alloca
  D 1: 2.62  dynamic array
  D 2: 2.58  scoped dynamic array
  D 3: 2.31  dynamic array + delete at the end
  D 4: 1.27  C heap malloc + free
  D 5: 0.75  alloca
  D 6: 1.40  GC heap malloc + free/relloc
  
(Virtual box not started, so it's a little faster)

C code compiled with:
gcc version 4.3.3-dw2-tdm-1 (GCC)
llvm-gcc -Wall -s -Wl,--enable-stdcall-fixup -std=c99 -O3

D code compiled with (with Phobos):
Digital Mars D Compiler v2.037
ldc -O -release -inline memtest1_d.d
comments: 2 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , ,
Subject:Code performance in D/Java
Time:05:05 pm
The code discussed in this post:
http://www.fantascienza.net/leonardo/js/life_bench.zip


To try to find "performance bugs" in both the LDC D compiler and the LLVM back-end I am exploring the performance signature of many small programs, often translating them to D. To perform such tests I compare the timings of the D code with the original C or Java code.

Java here is useful because its performance profiles are a little different from the usual C ones. Java HotSpot is able to inline many virtual calls, and its GarbageCollector is quite efficient (both things aren't good in all current D implementations).

I've found a small Life (Horton Conway's game) implementation in Java that shows a higher performance compared to D code (the original Java code isn't mine), so I cleaned up the Java code, I have removed the useless stuff from it (see the zip for the Java code), and I have translated it to as much close as possible D code (able to run both on Tango and Phobos). The result is the first D program (life1_d.d).

I have taken care of setting as final the main class in the D code, to allow inlining.

The Java GC (of Sun) is more efficient than the nonmoving D GC, so first of all I have taken a look at the number of memory allocations, bue they weren't the cause of the performance difference. I've profiled both the D and Java code, and I've seen that the calc_new() method was the one taking most time. I've also seen that the amount of inlining with LDC (using -O5 -release -inline) was not enough, so I've compiled the D code with the following, that forces a more aggressive inlining, that improves performance:

ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d

I have also seen that for this program Link-Time Optimization plus Interning improve the performance of the code:

ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d

opt -std-compile-opts life1_d.bc > life1_do.bc

llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc

I have done many more experiments, here you can find the ones that have given good results. I have split the Life class in its methods plus global values, I don't know why this increases performance with the LDC compiler (see the timings), see life2_d.d.

Later I have uses simple pointers as function arguments instead of arrays (see see life3_d.d), again I don't know why this increases the performance with the LDC compiler, also because most of sych functions get inlined anyway.

Now the performance of life3_d.d was bad only for large values (the last values of the use_Sizes array). After several more experiments I have by chance found that the cause was in the inner loop of the calc_new() function, where the whole botRow array is reset to zero. I think here the Java compiler recognizes such loop as a clearing, and replaces it with a call to a function like memset(). This makes the Java code slower than the D one for small botRow arrays, but faster for the longer ones (because it seems an inlined loop is faster than a memset when the length is small, about 50-100 or so if array items are 4 byte long).

So in life4_d.d I have replaced the loop with something a little more complex that takes a look at botRow length:
if (botRow.length > 100)
    botRow[] = 0;
else
    for (int c = 0; c < botRow.length; c++)
        botRow[c] = 0;
The timings of the 4th D version are now good enough, but not the best still, I don't know why. (I am now trying to find a faster array reset that uses an asm routine that contains the movntps SSE instuction).

------------------------

Scores on Windows Vista, using DMD compiler for the D code (bigger is better):
java -server Life
Size    average
Adjusting 6744 to 2811246
5       14288
6       14974
8       14125
10      15697
15      14175
25      15369
50      14595
250     6331
1000    2218
2500    880


life1_d.exe
Size    average
5       9248
6       8938
8       10213
10      10469
15      10974
25      9898
50      7189
250     1694
1000    421
2500    170


life2_d.exe
Size    average
5       9424
6       8270
8       8083
10      7876
15      8389
25      7191
50      5288
250     1331
1000    331
2500    147


life3_d.exe
Size    average
5       10156
6       9172
8       9353
10      9043
15      8802
25      8452
50      5865
250     1357
1000    356
2500    151


life4_d.exe
Size    average
Adjusting 917315 to 0
5       10145
6       9275
8       9631
10      10282
15      10061
25      9083
50      6286
250     4747
1000    1596
2500    743

Home Vista Basic with 2 GB RAM, Celeron 2.13 GHz

Compilers used:

Java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b66)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)

DMD Digital Mars D Compiler v1.042

------------------------

Scores on Ubuntu running on VirtualBox running on Vista, using LDC compiler for the D code (bigger is better):
java -server Life
Size	average
Adjusting 28766 to 951466
Adjusting 951466 to 2616313
5	13740
6	16494
8	14922
10	19481
15	20602
25	21249
50	19979
250	7493
1000	2609
2500	1004


ldc -O5 -release -inline life1_d.d
Size	average
5	17810
6	16661
8	15147
10	15201
15	14818
25	12901
50	4384
250	1893
1000	485
2500	198


ldc -O5 -release -inline -inline-threshold=2000000001 life1_d.d
Size	average
5	16016
6	14690
8	13183
10	11823
15	11100
25	10144
50	5203
250	2305
1000	571
2500	256


ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1_d.d
opt -std-compile-opts life1_d.bc > life1_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1_do life1_do.bc
Size	average
5	22628
6	18773
8	14326
10	18820
15	17558
25	15264
50	10155
250	2329
1000	578
2500	251


ldc -O5 -release -inline -inline-threshold=2000000001 life2_d.d
Size	average
5	16557
6	15205
8	13149
10	12037
15	10870
25	10051
50	6470
250	2327
1000	571
2500	247


ldc -O5 -release -inline -inline-threshold=2000000001 life3_d.d
Size	average
5	18205
6	17032
8	15948
10	17798
15	17907
25	15656
50	10191
250	2310
1000	574
2500	248


ldc -O5 -release -inline -inline-threshold=2000000001 life4_d.d
Size	average
5	18169
6	16576
8	16129
10	17616
15	17937
25	15486
50	9667
250	6714
1000	2006
2500	886


ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life4_d.d
opt -std-compile-opts life4_d.bc > life4_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life4_do life4_do.bc
Size	average
5	25522
6	22085
8	21079
10	20964
15	21987
25	19300
50	13160
250	6774
1000	1986
2500	900


This version (not included in the zip) is the first D one, but the cleaning of the botRow array in calc_new() is done as in life4_d.d:
ldc -O5 -release -inline -inline-threshold=2000000001 -output-bc life1b_d.d
opt -std-compile-opts life1b_d.bc > life1b_do.bc
llvm-ld -native -ltango-base-ldc -ltango-user-ldc -ldl -lm -lpthread -internalize-public-api-list=_Dmain -o=life1b_do life1b_do.bc
Size	average
5	22957
6	19394
8	16457
10	17208
15	18177
25	15522
50	5089
250	6638
1000	1971
2500	889


Code running on Ubuntu 9.4 running on VirtualBox 3.0.6 r52128.

Compilers used:

LDC based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009)

Java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) Client VM (build 14.2-b01, mixed mode, sharing)

------------------------
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , , ,
Subject:"Silly benchmark" in D/C++
Time:09:45 pm
All the code discussed here:
http://www.fantascienza.net/leonardo/js/silly_bench.zip

Slava Pestov, author of the programming language Factor, has created a very simple benchmark. Despite being defined "silly" such benchmark has shown me some interesting things.

This the original posts:
"Performance comparison between Factor and Java on a contrived benchmark"
http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html

"Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster":
http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html

What the benchmark does:
- It works on points, which are triplets of single-precision floats, (x,y,z)
- First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
-Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
- Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

The purpose of this benchmark is to compare performance of an array of struct in Factor to an array of class references in Java. It also shows how trigonometric functions are managed.

Slava has given Java code, that I have modified a bit, adding more timings inside, pulling out a subclass, making it static. The overall running time of the Java version is improved just a little.

The Java version is slow for the large amount of memory allocations, and also because it doesn't use the CPU native trigonometric instructions (sin, cos or sincos) when the input is arge, because of cross platform issues and because the Intel CPU instructions return quite inaccurate results in some cases, like when the input numers are large.

So I have created a second Java version, that uses an helper class (probably written by Razzi, http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums〈=java&id=4 ) to reduce the input range of trig functions. The result is much faster. (2.4 seconds instead of 11.4).

Then I have created three D versions:
- The first version is designed to be very similar to the Java version, the array is filled with class references. It's slower on performing the sqrt() and other code, but even in non-release mode it's overall quite faster than the first Java version. Bothe the Java versions and this D version use lot of RAM on each iteration of the program, the GC doesn't free it efficiently.
- The second D version is more optimized. The rray is now composed of struct values. I also delete the array at the end of each loop, so the memory used by the program is about constant. I have also used free functions instead of Silly class, but this has not much effect. D1 structs don't allow constructors (D2 has struct constructors), so I have had to change code a bit there too. The result is much faster, almost two times faster than the second Java version.
- In the third D version I've tried to manually optimize what possible. The DMD compiler doesn't inline functions with ref arguments and some other things. I have avoided an useless inizialization of the struct array and I have manually inlined its filling. I have used for loops insead of foreach, and allocated memory from the GC heap in a raw way. I have optimized the code using only one call to sin() because the DMD compiler isn't able to see the two calls to sin() have the same results. I have hand-optimized the computation of the max struct, even if takes only a small percentage of the whole runtime. The result is a nice speeup compared to the second D version. In D there's no simple way to allocate an array of uninitialized structs (I have to test something with LDC: it may not initialize the array if the struct data is float x=void,y=void,z=void; To be tested). LDC compiler can probably improve such timings.

Later, to have a baseline timing, I have translated the code to C++, and I have compiled it with G++ and LLVM-GCC. The second compiler uses calls to tle libc even using the -ffast-math compilation argument, so the running time is quite slower. Compiled with G++ the program is quite faster than the D code compiled with DMD.

This code in Haskell (or Python, or D with my dlibs) can be lazy, and reduce a lot memory used and probably running time too.

When possible I'll add timings with the LDC D compiler.

(I have not put the Python+Psyco timings here because I have not enough RAM to test it with n=5 millions. With n=1 million the best Python+Psyco time is about 3700 milliseconds. With the first Java program the best timing with n=1 million is 577 milliseconds, about seven times faster).

--------------------
Timings on Windows:

...>java -Xms512m -Xmx512m -server Silly1
Run #0
0.8944272, 1.0, 0.4472136
10810 422 234 0  Time: 11466

...>java -Xms512m -Xmx512m -server Silly2
Run #0
0.8944272, 1.0, 0.4472136
1966 406 124 0  Time: 2496

...>silly_bench_cpp (g++)
Run #0
0.894427, 1.000000, 0.447214
334 255 36 5  Total=632

...>silly_bench_cpp (llvm-g++)
Run #2
0.894427, 1.000000, 0.447214
1304 374 21 5  Total=1705

...>silly_bench1_d (No -release mode):
Run #0
0.894427, 1.000000, 0.447214
6941 444 80 0  total = 7466

...>silly_bench2_d
Run #1
0.894427, 1.000000, 0.447214
785 446 74 0  total = 1305

...>silly_bench3_d
Run #2
0.894427, 1.000000, 0.447214
488 443 28 0  Total = 960
------------------

Compilation arguments for g++ and llvm-g++:
g++ -Wall -O3 -s -fomit-frame-pointer -msse -msse2 -msse3 -march=native -ffast-math silly_bench_cpp.cpp -o silly_bench_cpp

Compilation arguments for DMD:
dmd -O -release -inline:

Intel CPU Celeron 560 at 2.13 GHz, 1GB RAM, Vista Home Basic.

Compilers used:
DMD v1.042
gcc v. 4.3.3-dw2-tdm-1 (GCC)
LLVM-G++ gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)
Java HotSpot(TM) Client VM (build 16.0-b06, mixed mode, sharing)
comments: 2 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , , , ,
Subject:Updates and links
Time:12:36 am
Updated two toy raytracers (Sphereflake and Yopyra) in my two software pages.

I have also added a new benchmark, about the Boggle game:
http://www.fantascienza.net/leonardo/js/boggle.zip

---------------------

A nice and probably useful hierarchical allocator for C programs:
http://swapped.cc/halloc/


Why colors are important in scientific visualizations, and the risks of their bad usage:
http://www.research.ibm.com/people/l/lloydt/color/color.HTM


Theo discusses about what's important for young people to learn, the risks of violent video games, the low quality of educational software:
http://theodoregray.com/BrainRot/index.html
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , , , ,
Subject:Richards benchmark
Time:11:57 pm
This old post of mine is about virtual methods and devirtualizations:
http://leonardo-m.livejournal.com/76547.html

I have performed more benchmarks about virtual methods and its costs with LDC. I have used a little well known benchmark, the Richards one. I have used the code from here (on the Web Archive because it seems to be not online anymore):
http://web.archive.org/web/20060715074131/lissett.port5.com/ben/bench1.htm
http://web.archive.org/web/20060715074131/http://lissett.port5.com/ben/bench3.htm

All the code of the following benchmarks:
http://www.fantascienza.net/leonardo/js/richards.zip
Timing results:

Windows, n = 10_000_000:
  C:       1.26
  D ~C:    2.01
  Java:    2.04 (final classes, -server)
  D2 ~C#:  2.36 (final classes)
  D3 ~C#:  2.36 (final classes, no getters/setters)  
  Java:    2.73 (final classes)
  D4 ~C#:  3.07 (no getters/setters)  
  C#:      3.98
  D1 ~C#:  4.23
  
Windows, n = 100_000_000:
  C:      12.16
  Java:   18.73 (final classes, -server)
  D ~C:   18.86
  D2 ~C#: 23.11 (final classes)
  D3 ~C#: 23.12 (final classes, no getters/setters)  
  Java:   25.40 (final classes)
  D4 ~C#: 30.16 (no getters/setters)    
  C#:     38.39
  D1 ~C#: 41.64


Pubuntu, n = 10_000_000:
  D ~C:    1.35
  C:       1.39
  D2 ~C#:  1.98 (final classes)  
  D3 ~C#:  2.00 (final classes, no getters/setters) 
  Java:    2.73 (final classes)  
  D4 ~C#:  2.94 (no getters/setters)      
  C#:      -  
  D1 ~C#:  4.03
  
Pubuntu, n = 100_000_000:
  D ~C:   13.24
  C:      13.77
  D2 ~C#: 19.64 (final classes)
  D3 ~C#: 19.92 (final classes, no getters/setters) 
  Java:   25.16 (final classes)    
  D4 ~C#: 29.16 (no getters/setters)       
  C#:      -
  D1 ~C#:  40.17

Key:
  D ~C# means D code that comes from the C# version.
  D ~C means D code that comes and looks from the C version.

Note that the classes in C# code aren't final. As usual Java shows very good performance.


On WindowsXp:
DMD Digital Mars D Compiler v1.042
gcc version 4.3.3-dw2-tdm-1 (GCC)

dmd used with:
dmd -O -release -inline

ldc used with:
ldc -O5 -release -inline

gcc used with:
gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2


On Pubuntu:
gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
ldc based on DMD v1.045 and llvm 2.6svn (Thu Jul 2 23:07:48 2009)

ldc used with:
ldc -O5 -release -inline

gcc used with:
gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native
comments: 8 comments or Leave a comment Add to Memories Tell a Friend

Tags:, , , , ,
Subject:fpaq0 - update
Time:06:09 pm
The ldc D compiler (http://www.dsource.org/projects/ldc ) is based on the LLVM backend and it's very good. LLVM is currently a bit young still compared to GCC, and it's often possible to find code where GCC produces faster code, but such difference is usually within 5-20%, so in practice it's acceptable in all situations where performance isn't critical (and in such situations you may want to use something like the Intel compiler or inline asm anyway).

So I have used ldc to recompile fpaq0, a simple order-0 arithmetic file compressor for stationary sources by by Matt Mahoney:
http://cs.fit.edu/%7Emmahoney/compression/fpaq0.cpp
That I have translated to D. I have also added a version of the imports for the D Tango standard library too. As input file I use a large purely ASCII text of 6_500_314 bytes, mixed English texts (originally by Peter Norvig, cleaned up a bit).

The D code and more details about the timings:
http://www.fantascienza.net/leonardo/so/fpaq0.zip

The D timing results are very good, better or equal than the C++ code compiled with G++. In both the D and C++ version I have also used putc_unlocked/getc_unlocked to increase I/O performance (I'd like to see such functions in Tango).
Timings Windows, seconds:
  DMD:        3.01  (100%)
  LLVM-G++ C: 1.78  ( 59%) (100%)
  LLVM-G++ A: 1.73  ( 57%) ( 97%) (100%)
  G++ C:      1.41  ( 47%) ( 79%) ( 81%)
  G++ A:      1.35  ( 45%) ( 76%) ( 78%)

Timings Pubuntu, seconds:
  G++ A:      3.53
  LDC:        1.92 (100%)
  G++ B:      1.55 ( 81%)
  G++ B:      1.48 ( 78%) putc_unlocked/getc_unlocked
  LDC:        1.41        putc_unlocked/getc_unlocked 
Compilers used:

On WindowsXp:
DMD Digital Mars D Compiler v1.042
gcc version 4.3.3-dw2-tdm-1 (GCC)
gcc version 4.2.1 (Based on Apple Inc. build 5636) (LLVM build)

On Pubuntu:
gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
ldc based on DMD v1.045 and llvm 2.6svn (Sun Jun 7 14:18:55 2009)


dmd used with:
dmd -O -release -inline

ldc used with:
ldc -O5 -release -inline

g++ A used with:
g++ -O3 -s

g++ B used with:
g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native

g++/llvm-gcc C used with:
g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2
llvm-g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2
comments: 2 comments or Leave a comment Add to Memories Tell a Friend

Advertisement

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 6 entries.