leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 4 entries.

Tags:, , , ,
Subject:Slow D
Time:03:25 am
Here I have collected few benchmarks where the D language, or the DMD compiler, or the Phobos std lib don't shine in their performance. Two of the following performance problems (see 'gc1' and 'sort' benchmarks) have already a solution.

http://www.fantascienza.net/leonardo/js/slow_d.zip

--------------------

Compilers/interpreters used:
Digital Mars D Compiler v1.028

gcc version 4.2.1-dw2 (mingw32-2)

Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32

Psyco V.1.5.2 (JIT for Python)

javac 1.6.0_03, Java version "1.6.0_03"


All timings are "warm", best of 3, in seconds.
CPU used is a Pentium3 500 MHz, 256 MB RAM, with Win.


Compilation flags used (when not specified otherwise):
gcc: -O3 -s
dmd: -O -release -inline
Python: no flags.

-------------------
BENCHMARKS:

recursive, n = 38:
  C: 3.89 s (with __builtin_expect)
  C: 3.99 s
  D: 4.95 s


hash, n = 500_000:
  D:     7.07 s (7.15 s with GC patched)
  D:     5.89 s (GC disabled)
  Psyco: 4.23 s
(Note that Python+Psyco has much bigger overhead compared to D in any instruction, simple loops too).
wordcount, on a txt file of 13_312_768 bytes:
  C: 0.73 s
  D: 5.77 s


gc1, with 6.3 MB of text:
        loading   splitting   total
         time        time     time
  D:       1.87 s     3.58 s   13.76 s
  Python:  0.28 s     1.98 s    3.52 s
  Python:  0.1 s      2.0 s     3.38 s (load with 'rb', same result)
With a Patch to the D GC, plus disabling the GC:
  D:       0.08 s     0.72 s    1.36 s (GC patched + GC disabled after load)


gc2, n=1_000, m=10_000:
               seconds  MB
  DMD class:   18.95    1.7
  GDC class:   17.91    1.8
  DMD struct:  11.77    1.7
  GDC struct:  12.31    1.8
  Python:      37.10    3.1
  Psyco:       15.68    3.5
  Java:         2.19    7.3


gc3 (binarytrees), n = 15:
  Java:   9.12 s
  D:     35.01 s


sort, const n = 1_000_000:
  Random distribution:
    sort: 2.934
    fastSort: 0.881
  Already sorted arrays:
    sort: 1.723
    fastSort: 0.41
 Inverted order arrays:
    sort: 1.853
    fastSort: 0.651
(For a better and more complete version of fastSort see my D libs:
www.fantascienza.net/leonardo/so/libs_d.zip ).
comments: Leave a comment Add to Memories Tell a Friend

Tags:,
Subject:Look and say Sequence - 4
Time:10:03 pm
GCC supports gotos to variables addresses too, they can be useful in some special situations, like when you want to optimize an interpreter (and sometimes a finite state machine too, see below), you can find a short post about gcc-specific C extensions:
http://majek4.blogspot.com/2008/03/useful-c-extensions-gcc-specific.html

This is example code from that blog post:
#include "stdio.h"
#include "assert.h"

int main(int argc, char** argv) {
    int value = 2;
    if (argc == 2)
        value = atoi(argv[1]);
    assert(value >= 0 && value <= 2);

    const void *labels[] = {&&VAL0, &&VAL1, &&VAL2};

    goto *labels[value];
    assert(0);

    VAL0:
        puts("The value is 0");
        goto END;
    VAL1:
        puts("The value is 1");
        goto END;
    VAL2:
        puts("The value is 2");
        goto END;
    END:
        return 0;
}
Compiling that C code with MinGW (GCC) V.4.2.1 it produces the following assembly (two parts in the middle of the code):
...
LC1:
	.ascii "value >= 0 && value <= 2\0"
LC2:
	.ascii "The value is 0\0"
LC3:
	.ascii "The value is 1\0"
LC4:
	.ascii "The value is 2\0"
...

...
	call	__assert   ; corrisponde alla riga assert(0);
L6:
	movl	$L7, -20(%ebp)
	movl	$L8, -16(%ebp)
	movl	$L9, -12(%ebp)
	movl	-8(%ebp), %eax
	movl	-20(%ebp,%eax,4), %eax
	jmp	*%eax
L7:
	movl	$LC2, (%esp)
	call	_puts
	jmp	L10
L8:
	movl	$LC3, (%esp)
	call	_puts
	jmp	L10
L9:
	movl	$LC4, (%esp)
	call	_puts
L10:
	movl	$0, %eax
	addl	$36, %esp
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
...
Where the jmp *%eax is the computed goto.
Note that in ANSI C you can write the middle part of that code as:
switch (value) {
    case 0:
        puts("The value is 0");
        break;
    case 1:
        puts("The value is 1");
        break;
    case 2:
        puts("The value is 2");
        break;
}
return 0;
But in more complex situations it may help.

I have tried to use that kind of goto in the C version of the Audioactive series generator, and it speeds up the code about 3%. Inside the S0 state the following code:
switch (*p1++) {
    case '1': goto S1;
    case '2': goto S2;
    case '3': goto S3;
}
Is replaced by:
goto *labelsS123[*p1++ - '1'];
Where labelsS123 is:
const void *labelsS123[] = {&&S1, &&S2, &&S3};
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , ,
Subject:Look and say Sequence - 3
Time:11:46 am
I have removed a bug from the D/C code, and I have added a faster Python version. Psyco is able to speed it up *25* times, so it's just 3 times slower than the faster C version (that uses a different algorithm), and it uses essentially the same memory:
http://www.fantascienza.net/leonardo/js/audioactive.zip

Beside that Python version, I have tried to translate to Python the finite state automata version too, producing something like this:
p1 = p2 = 0
pend = len1
S0, S1, S2, S3, S11, S22, S33, END = xrange(8)
state = S0

while True:
    if state == S0:
        if p1 == pend: state = END
        elif s1[p1] == "1": p1 += 1; state = S1
        elif s1[p1] == "2": p1 += 1; state = S2
        elif s1[p1] == "3": p1 += 1; state = S3

    elif state == S1:
        if p1 == pend: s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = END
        elif s1[p1] == "1": p1 += 1; state = S11
        elif s1[p1] == "2": p1 += 1; s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = S2
        elif s1[p1] == "3": p1 += 1; s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = S3
...
But it results a bit slower than the version (derived from the switc-based C/D version that I have later removed) that you can find in the audioactive zip.
comments: Leave a comment Add to Memories Tell a Friend

Tags:, , ,
Subject:Look and say Sequence - 2
Time:03:44 am
I have updated the Look and say Sequence generators:
http://www.fantascienza.net/leonardo/js/audioactive.zip

See for information:
http://en.wikipedia.org/wiki/Look-and-say_sequence
http://www.research.att.com/~njas/sequences/A005150

This is a very simple sequence, but it can be generated in many different ways, and I have explored many different interesting things. First of all I have found my fastest Python/Psyco versions, looking among more than ten versions.

The faster Python versions use a kind of pattern matching, splitting the precedent term (seen as a string of digits) with groupby, in groups of equal digits, then replacing those patterns using a dictionary of patterns.

Once I have created the fastest Python/Psyco versions, I have tried to use D to generate the terms of the Look and say sequence. First of all I have used my D libs to produce programs as short as Python ones. Then I have tried to write longer but much faster programs.

A first quite fast D version of mine uses two nested switches, and appends the corresponding pair of digits:
foreach (c; last) {
    if (count == 0) {
        cchar = c;
        count++;
    } else {
        if (c == cchar)
            count++;
        else {
            switch (cchar) {
                case '1':
                    switch (count) {
                        case 1: r ~= "11"; break;
                        case 2: r ~= "21"; break;
                        case 3: r ~= "31"; break;
                    }
                    break;
                case '2':
                    switch (count) {
                        case 1: r ~= "12"; break;
                        case 2: r ~= "22"; break;
                        case 3: r ~= "32"; break;
                    }
                    break;
                case '3':
                    switch (count) {
                        case 1: r ~= "13"; break;
                        case 2: r ~= "23"; break;
                        case 3: r ~= "33"; break;
                    }
                    break;
            }
            cchar = c;
            count = 1;
        }
    }
}

But later I have kept improving that. I have found an expression that gives an upper bound of the number of digits of the successive term of the series, so I have used that to pre-allocate the whole term, and avoid array appends. Later I have used char pointer to speed up the code, and I have allocated memory from the non-GC heap. I have stated to use C too, because the GCC 4.2.1 compiler is a better optimizer than the DMD v.1.027 D compiler.

One strategy that leads to some of the faster programs is to turn the program into a finite state machine. With a bit of thinking I have found the automata and I have simplified it to just seven states:
S0:
    if (p1 == pend) goto END;
    switch(*p1++) {
        case '1': goto S1;
        case '2': goto S2;
        case '3': goto S3;
    }

S1:
    if (p1 == pend) { *p2++ = '1'; *p2++ = '1'; goto END; }
    switch (*p1++) {
        case '1': goto S11;
        case '2': *p2++ = '1'; *p2++ = '1'; goto S2;
        case '3': *p2++ = '1'; *p2++ = '1'; goto S3;
    }

S2:
    if (p1 == pend) { *p2++ = '1'; *p2++ = '2'; goto END; }
    switch (*p1++) {
        case '1': *p2++ = '1'; *p2++ = '2'; goto S1;
        case '2': goto S22;
        case '3': *p2++ = '1'; *p2++ = '2'; goto S3;
    }

S3:
    if (p1 == pend) { *p2++ = '1'; *p2++ = '3'; goto END; }
    switch (*p1++) {
        case '1': *p2++ = '1'; *p2++ = '3'; goto S1;
        case '2': *p2++ = '1'; *p2++ = '3'; goto S2;
        case '3': goto S33;
    }

S11:
    if (p1 == pend) { *p2++ = '2'; *p2++ = '1'; goto END; }
    switch (*p1++) {
        case '1': *p2++ = '3'; *p2++ = '1'; goto S0;
        case '2': *p2++ = '2'; *p2++ = '1'; goto S2;
        case '3': *p2++ = '2'; *p2++ = '1'; goto S3;
    }

S22:
    if (p1 == pend) { *p2++ = '2'; *p2++ = '2'; goto END; }
    switch (*p1++) {
        case '1': *p2++ = '2'; *p2++ = '2'; goto S1;
        case '2': *p2++ = '3'; *p2++ = '2'; goto S0;
        case '3': *p2++ = '2'; *p2++ = '2'; goto S3;
    }

S33:
    if (p1 == pend) { *p2++ = '2'; *p2++ = '3'; goto END; }
    switch (*p1++) {
        case '1': *p2++ = '2'; *p2++ = '3'; goto S1;
        case '2': *p2++ = '2'; *p2++ = '3'; goto S2;
        case '3': *p2++ = '3'; *p2++ = '3'; goto S0;
    }

END:
    s2.length = p2 - s2.ptr;
    s1 = s2;
Converting that to C the code is really fast. One of those blocks is few (3-6) assembly instructions (the assembly code is full of jumps). Later I have performed the allocation just once up front, for both strings.
In the C versions I have used the __builtin_expect() that allows me to tell the compiler that usually p1 != pend. The resulting versions are audioactive_seq2.d and audioactive_seq3.c, they are almost the same, but the C versions is faster: it's able to finds the 66th term in 7.1s (62_226_614 digits) on a Pentium3 @ 500MHz with 256MB RAM.
With that I have found the 72th term, that has 305_351_794 digits in few minutes. Now it's clear that this program is fast enough, but pure speed isn't enough, few minutes of running time show that a different solution must be found. Someone has found the 91th term, that has 47_032_657_188 digits, this is incredible.

So I have to reduce memory used. There are just 3 digits, so 2 bits/digits suffice. So 4 digits can be packed in a byte, but this makes the finite state machine really complex, probably too much complex for the time I can use to this problem. A simpler compromise is to use 1 byte to store 2 digits, and seeing that all term lengths are even, that simplifies code.

Later I have read that digits aren't equally represented, "1" is more common (frequencies: 50%, 32%, 18%). A simple way to encode the digits as bits is to use 0 for "1", 10 for "2", and 11 for "3". This leads to an efficient (Huffman) encoding of the symbols. It turns out that it's easy enough to modify the state machine to manage such bits, and D language offers compiler intrinsic to manage arrays of bits in a very simple way. Here are the first two states of the 7 state machine:
S0:
    if (p1 == pend) goto END;
    if (testbit(s1, p1++)) {
        if (testbit(s1, p1++)) {
            goto S3;
        } else {
            goto S2;
        }
    } else {
        goto S1;
    }

S1:
    if (p1 == pend) { writezero(s2, p2++); writezero(s2, p2++); goto END; }
    if (testbit(s1, p1++)) {
        if (testbit(s1, p1++)) {
            writezero(s2, p2++); writezero(s2, p2++); goto S3;
        } else {
            writezero(s2, p2++); writezero(s2, p2++); goto S2;
        }
    } else {
        goto S11;
    }
...
This allows to use about 1.5 bits/digit, it's almost the optimal encoding (for single digits. But the digits of the look and say sequence contain less than 100 different blocks of digits, using such blocks as macro-symbols, plus a conversion table, the next term of the sequence can be computed in very short time, using a tiny amount of memory. This may be the trick to allow to compute the len of the 91th term). The resulting code is in audioactive_seq4.d, it's about two times slower than audioactive_seq2.d, so it shows that despite all the speed of this computation is CPU-bound, it's not bound to the transfer rate from cache and CPU (because audioactive_seq4.d needs such transfer rate to be much lower, just about 18% of the precedent one).

This has allowed me to find the 77th term, with 1_149_440_192 digits, in few minutes. So the limiting factor is memory still, it's not computing time yet.

Probably there is a way to pack the two sequences (two adjacent terms of the series) more in memory, because they are scanned monotonically and almost in parallel. So most of the same memory can be shared for both sequences. This allows to use almost half the (RAM) memory. Both sequences must be scanned in the same forward direction because backwards memory scanning is slower (but this isn't a problem in Python. So in Python you can just perform an alternating scanning of a single array.array("c"). In Python too an automata can be used, and with the help of Psyco this may lead to a fast program, faster than the "fastest ones" of audioactive_seq.py. I have seen that often a good way to speed-optimize a Python program is to translate it to D/C, optimize that again and again, and then back-translating it to Psyco).

Another simple way to "solve" the memory problem is to write the term on disk (encoded with 1.5 bit/digit), then read it to compute the next one, etc. The reads and writes are sequential, so they are fast enough (a memory mapped file can be an easy solution).
comments: Leave a comment Add to Memories Tell a Friend

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 4 entries.