leonardo ([info]leonardo_m) wrote,
@ 2009-10-19 12:10:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:benchmark, c++, d language, programming

Slow allocation of D objects
Allocating objects in D language, usign the good and efficient LDC compiler, may seem slower than doing the same thing in C++, but the situation is a little more complex, so few examples can show what's going on.

This is a syntetic C++ benchmark, it allocates an array of n pointers to Foo object, and then allocates the instances, that contain ten 32 bit integers:

// C++ version 1, 0.46 s
#include "stdio.h"
#include "stdlib.h"

using namespace std;

class Foo {
    public:
        int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
        Foo(int, int, int, int, int, int, int, int, int, int);
};

Foo::Foo(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10):
  y1(x1), y2(x2), y3(x3), y4(x4), y5(x5), y6(x6), y7(x7), y8(x8), y9(x9), y10(x10) {}

int main(int argc, char *argv[]) {
    int n = (argc == 2) ? atoi(argv[1]) : 5;
    Foo* foos[n];

    for (int i = 0; i < n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf("%d %d %d\n", foos[n-1]->y1, foos[n-1]->y2, foos[n-1]->y10);
    return 0;
}

This is an almost equivalent D code. I use printf and those imports are so hairy because this code is designed to work with both Phobos or Tango standard libraries, and with D1 and D2 languages:
// D version 1, 0.89 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
} else {
    import std.c.stdio: printf;
    version (D_Version2) {
        import std.conv: to;
        alias to!(int, char[]) toInt;
    } else
        import std.conv: toInt;
}

class Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;

    this(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
        y1 = x1;
        y2 = x2;
        y3 = x3;
        y4 = x4;
        y5 = x5;
        y6 = x6;
        y7 = x7;
        y8 = x8;
        y9 = x9;
        y10 = x10;
    }
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo[] foos = new Foo[n];

    for (int i = 0; i < n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

Replacing in the C++ the class like this doesn't change the running time of the program:
// piece of the C++ version 2, 0.47 s
class Foo {
    public:
        int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
        Foo(int, int, int, int, int, int, int, int, int, int);
};

Foo::Foo(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
    y1 = x1;
    y2 = x2;
    y3 = x3;
    y4 = x4;
    y5 = x5;
    y6 = x6;
    y7 = x7;
    y8 = x8;
    y9 = x9;
    y10 = x10;
}

The running times on a Ubuntu running on VirtualBox, running on Vista 32 bit, running on a Celeron CPU, using LDC compiler based on DMD v1.045 and llvm 2.6 (Thu Sep 10 23:50:27 2009):
Timings, n = 1_000_000, best of 3, seconds:
  C++ 1: 0.46  |#########                |
  C++ 2: 0.47  |#########                |
  D 1:   0.89  |##################       |
  D 2:   0.82  |################         |
  D 3:   0.87  |#################        |
  D 4:   0.46  |#########                |
  D 5:   0.46  |#########                |
  D 6:   1.22  |#########################|
  D 6b:  0.86  |#################        |
  D 7:   0.63  |############             |
  D 8:   0.71  |##############           |

I have compiled the C++ code with gcc V.4.3.3 with:
g++ -Wall -O3 -s -fomit-frame-pointer -msse3 -march=native objbench1_cpp.cpp -o objbench1_cpp
And the D code with:
ldc -O5 -release -inline objbench1_d.d

As you can see the D version is about twice slower than the C++ code, despite usually the LDC compiler is almost as efficient as GCC (LLVM is not able to perform some optimizations yet, like auto-vectorization, but things are already good and they will improve).

D classes are different from C++ ones, D objects always contain a pointer to the virtual table (even if no virtual methods are present, it's used by the reflection and GC too) and a monitor, so on 32 bit systems they need 8 extra bytes. So we may try to remove such memory (and bookeeping) overhead using a struct (this version can be changed a little, so it works with Phobos of D2 too):
// D version 2, 0.82 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i < n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }
    
    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

But you can see the performance improves only a little, so the problem is elsewhere. I have had to use that two-line initialization because only in D2 language structs are allowed to have an explicit constructor.

We can try allocating the pointer array on the stack but the performance gets a little worse:
// D version 3, 0.87 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: alloca;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: alloca;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo** ptr = cast(Foo**)alloca((Foo*).sizeof * n);
    Foo*[] foos = ptr[0 .. n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

A next step is to allocate the structs from the C heap, this time the performance is about the same as the C++ code, showing that the cause of the slowdown is the (not efficient yet) D Garbage Collector:
// D version 4, 0.46 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: alloca, malloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: alloca, malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo** ptr = cast(Foo**)alloca((Foo*).sizeof * n);
    Foo*[] foos = ptr[0 .. n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)malloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

So we can restore the original allocation of the dynamic array of pointers, the performance improves a tiny bit (less than 0.01 s):
// D version 5, 0.46 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.stdc.stdlib: malloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.c.stdlib: malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)malloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

We may think that the problem is in the double initalization of the structs, it seems that's not the case, as the following code is the slowest, here I use uninitialized memory from the GC heap:
// D version 6, 1.22 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.malloc gcmalloc;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: gcmalloc = malloc;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];
    
    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)gcmalloc(Foo.sizeof);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

I don't fully know why the performance of the version 6 is so low (if you have ideas please tell me), part of that lower performance comes from the GC that scans the memory of those structs. You can see it in the version 6b, where I have disabled the scanning of those memory blocks, there's indeed a significant performance improvement:
// D version 6b, 0.86 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.malloc gcmalloc;
    void hasNoPointers(void* p) {
        GC.setAttr(p, GC.BlkAttr.NO_SCAN);
    }
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: gcmalloc = malloc, hasNoPointers;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i < n; i++) {
        foos[i] = cast(Foo*)gcmalloc(Foo.sizeof);
        hasNoPointers(foos[i]);
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);
    }

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

But even if the GC doesn't scan those memory blocks, it performs several operations anyway, so when you allocate many pieces of memory and you need performance, you may want to disable the GC (and enable it just after the allocation, here I have not added the enable(), but in your code you are supposed to put it), the performance is intermediate:
// D version 7, 0.63 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.disable disable;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: disable;
}

struct Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    disable();
    
    Foo*[] foos = new Foo*[n];

    for (int i = 0; i < n; i++) {
        foos[i] = new Foo;
        *foos[i] = Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);  
    }
    
    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

We can also go back ot the object-based version, with a further little reduction of performance:
// D version 8, 0.71 s
version (Tango) {
    import tango.stdc.stdio: printf;
    import Integer = tango.text.convert.Integer;
    alias Integer.parse toInt;
    import tango.core.Memory: GC;
    alias GC.disable disable;
} else {
    import std.c.stdio: printf;
    import std.conv: toInt;
    import std.gc: disable;
}

class Foo {
    int y1, y2, y3, y4, y5, y6, y7, y8, y9, y10;

    this(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8, int x9, int x10) {
        y1 = x1;
        y2 = x2;
        y3 = x3;
        y4 = x4;
        y5 = x5;
        y6 = x6;
        y7 = x7;
        y8 = x8;
        y9 = x9;
        y10 = x10;
    }
}

void main(char[][] args) {
    int n = (args.length == 2) ? toInt(args[1]) : 5;
    disable();
    Foo[] foos = new Foo[n];

    for (int i = 0; i < n; i++)
        foos[i] = new Foo(i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10);

    printf("%d %d %d\n", foos[$-1].y1, foos[$-1].y2, foos[$-1].y10);
}

Having a garbage collector is handy, and sometimes safer too, but it has a price too (in the size of the binary too). D allows you to not use the GC when you need C++-like performance, but then you have to remember to manage and deallocate memory manually as in C, or you need to implement/use other forms or memory management as in C++ (and D allows the scope(exit) idiom too to deallocate memory when a scope ends). If the D language will get widespread, its GC will improve, restoring part of the lost performance.



Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…