| Here I have collected few benchmarks where the D language, or the DMD compiler, or the Phobos std lib don't shine in their performance. Two of the following performance problems (see 'gc1' and 'sort' benchmarks) have already a solution.
http://www.fantascienza.net/leonardo/js/slow_d.zip
--------------------
Compilers/interpreters used: Digital Mars D Compiler v1.028
gcc version 4.2.1-dw2 (mingw32-2)
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Psyco V.1.5.2 (JIT for Python)
javac 1.6.0_03, Java version "1.6.0_03"
All timings are "warm", best of 3, in seconds. CPU used is a Pentium3 500 MHz, 256 MB RAM, with Win.
Compilation flags used (when not specified otherwise): gcc: -O3 -s dmd: -O -release -inline Python: no flags.
-------------------
BENCHMARKS:
recursive, n = 38:
C: 3.89 s (with __builtin_expect)
C: 3.99 s
D: 4.95 s
hash, n = 500_000:
D: 7.07 s (7.15 s with GC patched)
D: 5.89 s (GC disabled)
Psyco: 4.23 s
(Note that Python+Psyco has much bigger overhead compared to D in any instruction, simple loops too).
wordcount, on a txt file of 13_312_768 bytes:
C: 0.73 s
D: 5.77 s
gc1, with 6.3 MB of text:
loading splitting total
time time time
D: 1.87 s 3.58 s 13.76 s
Python: 0.28 s 1.98 s 3.52 s
Python: 0.1 s 2.0 s 3.38 s (load with 'rb', same result)
With a Patch to the D GC, plus disabling the GC:
D: 0.08 s 0.72 s 1.36 s (GC patched + GC disabled after load)
gc2, n=1_000, m=10_000:
seconds MB
DMD class: 18.95 1.7
GDC class: 17.91 1.8
DMD struct: 11.77 1.7
GDC struct: 12.31 1.8
Python: 37.10 3.1
Psyco: 15.68 3.5
Java: 2.19 7.3
gc3 (binarytrees), n = 15:
Java: 9.12 s
D: 35.01 s
sort, const n = 1_000_000:
Random distribution:
sort: 2.934
fastSort: 0.881
Already sorted arrays:
sort: 1.723
fastSort: 0.41
Inverted order arrays:
sort: 1.853
fastSort: 0.651
(For a better and more complete version of fastSort see my D libs: www.fantascienza.net/leonardo/so/libs_d.zip ). | comments: Leave a comment  |
| GCC supports gotos to variables addresses too, they can be useful in some special situations, like when you want to optimize an interpreter (and sometimes a finite state machine too, see below), you can find a short post about gcc-specific C extensions: http://majek4.blogspot.com/2008/03/useful-c-extensions-gcc-specific.html
This is example code from that blog post:
#include "stdio.h"
#include "assert.h"
int main(int argc, char** argv) {
int value = 2;
if (argc == 2)
value = atoi(argv[1]);
assert(value >= 0 && value <= 2);
const void *labels[] = {&&VAL0, &&VAL1, &&VAL2};
goto *labels[value];
assert(0);
VAL0:
puts("The value is 0");
goto END;
VAL1:
puts("The value is 1");
goto END;
VAL2:
puts("The value is 2");
goto END;
END:
return 0;
}
Compiling that C code with MinGW (GCC) V.4.2.1 it produces the following assembly (two parts in the middle of the code):
...
LC1:
.ascii "value >= 0 && value <= 2\0"
LC2:
.ascii "The value is 0\0"
LC3:
.ascii "The value is 1\0"
LC4:
.ascii "The value is 2\0"
...
...
call __assert ; corrisponde alla riga assert(0);
L6:
movl $L7, -20(%ebp)
movl $L8, -16(%ebp)
movl $L9, -12(%ebp)
movl -8(%ebp), %eax
movl -20(%ebp,%eax,4), %eax
jmp *%eax
L7:
movl $LC2, (%esp)
call _puts
jmp L10
L8:
movl $LC3, (%esp)
call _puts
jmp L10
L9:
movl $LC4, (%esp)
call _puts
L10:
movl $0, %eax
addl $36, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
...
Where the jmp *%eax is the computed goto. Note that in ANSI C you can write the middle part of that code as:
switch (value) {
case 0:
puts("The value is 0");
break;
case 1:
puts("The value is 1");
break;
case 2:
puts("The value is 2");
break;
}
return 0;
But in more complex situations it may help.
I have tried to use that kind of goto in the C version of the Audioactive series generator, and it speeds up the code about 3%. Inside the S0 state the following code:
switch (*p1++) {
case '1': goto S1;
case '2': goto S2;
case '3': goto S3;
}
Is replaced by:
goto *labelsS123[*p1++ - '1'];
Where labelsS123 is:
const void *labelsS123[] = {&&S1, &&S2, &&S3};
| comments: Leave a comment  |
| I have removed a bug from the D/C code, and I have added a faster Python version. Psyco is able to speed it up *25* times, so it's just 3 times slower than the faster C version (that uses a different algorithm), and it uses essentially the same memory: http://www.fantascienza.net/leonardo/js/audioactive.zip
Beside that Python version, I have tried to translate to Python the finite state automata version too, producing something like this:
p1 = p2 = 0
pend = len1
S0, S1, S2, S3, S11, S22, S33, END = xrange(8)
state = S0
while True:
if state == S0:
if p1 == pend: state = END
elif s1[p1] == "1": p1 += 1; state = S1
elif s1[p1] == "2": p1 += 1; state = S2
elif s1[p1] == "3": p1 += 1; state = S3
elif state == S1:
if p1 == pend: s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = END
elif s1[p1] == "1": p1 += 1; state = S11
elif s1[p1] == "2": p1 += 1; s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = S2
elif s1[p1] == "3": p1 += 1; s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = S3
...
But it results a bit slower than the version (derived from the switc-based C/D version that I have later removed) that you can find in the audioactive zip. | comments: Leave a comment  |
| I have updated the Look and say Sequence generators: http://www.fantascienza.net/leonardo/js/audioactive.zip
See for information: http://en.wikipedia.org/wiki/Look-and-say_sequence http://www.research.att.com/~njas/sequences/A005150
This is a very simple sequence, but it can be generated in many different ways, and I have explored many different interesting things. First of all I have found my fastest Python/Psyco versions, looking among more than ten versions.
The faster Python versions use a kind of pattern matching, splitting the precedent term (seen as a string of digits) with groupby, in groups of equal digits, then replacing those patterns using a dictionary of patterns.
Once I have created the fastest Python/Psyco versions, I have tried to use D to generate the terms of the Look and say sequence. First of all I have used my D libs to produce programs as short as Python ones. Then I have tried to write longer but much faster programs.
A first quite fast D version of mine uses two nested switches, and appends the corresponding pair of digits:
foreach (c; last) {
if (count == 0) {
cchar = c;
count++;
} else {
if (c == cchar)
count++;
else {
switch (cchar) {
case '1':
switch (count) {
case 1: r ~= "11"; break;
case 2: r ~= "21"; break;
case 3: r ~= "31"; break;
}
break;
case '2':
switch (count) {
case 1: r ~= "12"; break;
case 2: r ~= "22"; break;
case 3: r ~= "32"; break;
}
break;
case '3':
switch (count) {
case 1: r ~= "13"; break;
case 2: r ~= "23"; break;
case 3: r ~= "33"; break;
}
break;
}
cchar = c;
count = 1;
}
}
}
But later I have kept improving that. I have found an expression that gives an upper bound of the number of digits of the successive term of the series, so I have used that to pre-allocate the whole term, and avoid array appends. Later I have used char pointer to speed up the code, and I have allocated memory from the non-GC heap. I have stated to use C too, because the GCC 4.2.1 compiler is a better optimizer than the DMD v.1.027 D compiler.
One strategy that leads to some of the faster programs is to turn the program into a finite state machine. With a bit of thinking I have found the automata and I have simplified it to just seven states:
S0:
if (p1 == pend) goto END;
switch(*p1++) {
case '1': goto S1;
case '2': goto S2;
case '3': goto S3;
}
S1:
if (p1 == pend) { *p2++ = '1'; *p2++ = '1'; goto END; }
switch (*p1++) {
case '1': goto S11;
case '2': *p2++ = '1'; *p2++ = '1'; goto S2;
case '3': *p2++ = '1'; *p2++ = '1'; goto S3;
}
S2:
if (p1 == pend) { *p2++ = '1'; *p2++ = '2'; goto END; }
switch (*p1++) {
case '1': *p2++ = '1'; *p2++ = '2'; goto S1;
case '2': goto S22;
case '3': *p2++ = '1'; *p2++ = '2'; goto S3;
}
S3:
if (p1 == pend) { *p2++ = '1'; *p2++ = '3'; goto END; }
switch (*p1++) {
case '1': *p2++ = '1'; *p2++ = '3'; goto S1;
case '2': *p2++ = '1'; *p2++ = '3'; goto S2;
case '3': goto S33;
}
S11:
if (p1 == pend) { *p2++ = '2'; *p2++ = '1'; goto END; }
switch (*p1++) {
case '1': *p2++ = '3'; *p2++ = '1'; goto S0;
case '2': *p2++ = '2'; *p2++ = '1'; goto S2;
case '3': *p2++ = '2'; *p2++ = '1'; goto S3;
}
S22:
if (p1 == pend) { *p2++ = '2'; *p2++ = '2'; goto END; }
switch (*p1++) {
case '1': *p2++ = '2'; *p2++ = '2'; goto S1;
case '2': *p2++ = '3'; *p2++ = '2'; goto S0;
case '3': *p2++ = '2'; *p2++ = '2'; goto S3;
}
S33:
if (p1 == pend) { *p2++ = '2'; *p2++ = '3'; goto END; }
switch (*p1++) {
case '1': *p2++ = '2'; *p2++ = '3'; goto S1;
case '2': *p2++ = '2'; *p2++ = '3'; goto S2;
case '3': *p2++ = '3'; *p2++ = '3'; goto S0;
}
END:
s2.length = p2 - s2.ptr;
s1 = s2;
Converting that to C the code is really fast. One of those blocks is few (3-6) assembly instructions (the assembly code is full of jumps). Later I have performed the allocation just once up front, for both strings. In the C versions I have used the __builtin_expect() that allows me to tell the compiler that usually p1 != pend. The resulting versions are audioactive_seq2.d and audioactive_seq3.c, they are almost the same, but the C versions is faster: it's able to finds the 66th term in 7.1s (62_226_614 digits) on a Pentium3 @ 500MHz with 256MB RAM. With that I have found the 72th term, that has 305_351_794 digits in few minutes. Now it's clear that this program is fast enough, but pure speed isn't enough, few minutes of running time show that a different solution must be found. Someone has found the 91th term, that has 47_032_657_188 digits, this is incredible.
So I have to reduce memory used. There are just 3 digits, so 2 bits/digits suffice. So 4 digits can be packed in a byte, but this makes the finite state machine really complex, probably too much complex for the time I can use to this problem. A simpler compromise is to use 1 byte to store 2 digits, and seeing that all term lengths are even, that simplifies code.
Later I have read that digits aren't equally represented, "1" is more common (frequencies: 50%, 32%, 18%). A simple way to encode the digits as bits is to use 0 for "1", 10 for "2", and 11 for "3". This leads to an efficient (Huffman) encoding of the symbols. It turns out that it's easy enough to modify the state machine to manage such bits, and D language offers compiler intrinsic to manage arrays of bits in a very simple way. Here are the first two states of the 7 state machine:
S0:
if (p1 == pend) goto END;
if (testbit(s1, p1++)) {
if (testbit(s1, p1++)) {
goto S3;
} else {
goto S2;
}
} else {
goto S1;
}
S1:
if (p1 == pend) { writezero(s2, p2++); writezero(s2, p2++); goto END; }
if (testbit(s1, p1++)) {
if (testbit(s1, p1++)) {
writezero(s2, p2++); writezero(s2, p2++); goto S3;
} else {
writezero(s2, p2++); writezero(s2, p2++); goto S2;
}
} else {
goto S11;
}
...
This allows to use about 1.5 bits/digit, it's almost the optimal encoding (for single digits. But the digits of the look and say sequence contain less than 100 different blocks of digits, using such blocks as macro-symbols, plus a conversion table, the next term of the sequence can be computed in very short time, using a tiny amount of memory. This may be the trick to allow to compute the len of the 91th term). The resulting code is in audioactive_seq4.d, it's about two times slower than audioactive_seq2.d, so it shows that despite all the speed of this computation is CPU-bound, it's not bound to the transfer rate from cache and CPU (because audioactive_seq4.d needs such transfer rate to be much lower, just about 18% of the precedent one).
This has allowed me to find the 77th term, with 1_149_440_192 digits, in few minutes. So the limiting factor is memory still, it's not computing time yet.
Probably there is a way to pack the two sequences (two adjacent terms of the series) more in memory, because they are scanned monotonically and almost in parallel. So most of the same memory can be shared for both sequences. This allows to use almost half the (RAM) memory. Both sequences must be scanned in the same forward direction because backwards memory scanning is slower (but this isn't a problem in Python. So in Python you can just perform an alternating scanning of a single array.array("c"). In Python too an automata can be used, and with the help of Psyco this may lead to a fast program, faster than the "fastest ones" of audioactive_seq.py. I have seen that often a good way to speed-optimize a Python program is to translate it to D/C, optimize that again and again, and then back-translating it to Psyco).
Another simple way to "solve" the memory problem is to write the term on disk (encoded with 1.5 bit/digit), then read it to compute the next one, etc. The reads and writes are sequential, so they are fast enough (a memory mapped file can be an easy solution). | comments: Leave a comment  |
|