?

Log in

No account? Create an account

More on D associative array performance - leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).

Tags:,
Security:
Subject:More on D associative array performance
Time:03:54 pm
Sometimes the simplest code is the best to show a problem (update: I have added a Python3 version that's better to show the true speed differences):
Timings, N = 10_000_000, best of 3:
  Python2:            1.22 s
  Python1 + Psyco:    1.48 s 
  Python3:            1.56 s
  Python1:            2.07 s
  D DMD:             24.34 s

Timings, N = 20_000_000, best of 3:
  Python2:            2.35 s
  Python1 + Psyco:    2.87 s
  Python3:            3.04 s
  Python1:            4.05 s
  D DMD:            129.8 s

DMD v1.029, compiled with -O -release -inline
Python Python 2.5.2
WinXP, Pentium Dual Core 2 GHz

---------------------

SOURCE CODE:

Python version 1
(comment out the Psyco line to disable it):

def main():
    N = 20000000
    aa = {}
    for i in xrange(N):
        aa[i] = 0
import psyco; psyco.bind(main)
main()

---------------------

Python version 2:

set(xrange(20000000))

---------------------

Python version 3:

dict.fromkeys(xrange(20000000), 0)

---------------------

D code:

void main() {
    const uint N = 20_000_000;
    size_t[long] aa;
    for (uint i; i < N; ++i)
        aa[i] = 0;
}

---------------------
comments: Leave a comment Previous Entry Share Next Entry

(Anonymous)
Subject:Rehash
Link:(Link)
Time:2008-06-09 04:57 pm (UTC)
Did you try to rehash the AA at regular intervals?
What you see might simply be a degenerate hash table (or not?).

KB
(Reply) (Thread)

leonardo_m
Subject:Re: Rehash
Link:(Link)
Time:2008-06-09 05:07 pm (UTC)
I have tried to rehash every 1 million elements, but it seems the net result is more memory usage and more running time (like 30+ seconds for the n=10 million case instead of the 24+ seconds).
(Reply) (Parent) (Thread)

(Anonymous)
Subject:Re: Rehash
Link:(Link)
Time:2008-06-09 06:18 pm (UTC)
What's even worse is the scaling implied in the numbers,
though more data would be nice, this looks like > N^2
for this particular case.

I would like to see this with qseudo random numbers
and some other permuation of xrange(10**9).

-phoku
(Reply) (Parent) (Thread)

leonardo_m
Subject:Re: Rehash
Link:(Link)
Time:2008-06-09 07:42 pm (UTC)
Thank you for all the answers. I'll take some time to try to answer everyone.

phoku:
>What's even worse is the scaling implied in the numbers, though more data would be nice, this looks like > N^2 for this particular case.<
Here are more data points for the D program, best of 3:
 100_000      0.05 s
 500_000      0.20 s
1_000_000     0.47 s
2_000_000     1.25 s
5_000_000     5.93 s
10_000_000   24.34 s
20_000_000  129.8  s

I have found those timings with this D code:

import std.conv: toUint;
void main(string[] args) {
    uint n = toUint(args[1]);
    size_t[long] aa;
    for (uint i; i < n; ++i)
        aa[i] = 0;
}

phoku:
>I would like to see this with qseudo random numbers and some other permuation of xrange(10**9).<

I have tried it with prime numbers (long), with similar results.
(Reply) (Parent) (Thread)


pragma_x
Subject:Ugh.
Link:(Link)
Time:2008-06-09 06:02 pm (UTC)
Those are some pretty embarassing numbers. I think it's time to start diving into the generated assembler and see if it's a codegen problem, or a library problem.

OOC, how do the timings look if you don't use a plain array instead of a hashtable?
(Reply) (Thread)


pragma_x
Subject:One more thing.
Link:(Link)
Time:2008-06-09 06:38 pm (UTC)
I recall seeing very poor performance in DDL when putting arrays and maps through the wringer due to lots and lots of discrete appends. So it could be that the pre-allocation for the map is just very conservative, and is re-allocating itself more times than the equivalent operation would do in other languages.

While I do know how to pre-allocate an array in D, I have no clue how to do it for a map.

You could try running the test *twice* back-to-back in the same code; just time the second pass after the map is already allocated. If you see a speed significant increase (one order of magnitude or more), you'll know why.
(Reply) (Parent) (Thread)

leonardo_m
Subject:Re: One more thing.
Link:(Link)
Time:2008-06-09 08:03 pm (UTC)
pragma_x:
>You could try running the test *twice* back-to-back in the same code; just time the second pass after the map is already allocated. If you see a speed significant increase (one order of magnitude or more), you'll know why.<

The timings for the second pass seem nearly linear (note that the second pass is quite faster in Python too):
 1_000_000  0.03 s
 2_000_000  0.06 s
 5_000_000  0.20 s
10_000_000  0.33 s
20_000_000  0.66 s


The D code I have used:

import std.conv: toUint;
import d.time:clock;//www.fantascienza.net/leonardo/so/libs_d.zip
void main(string[] args) {
    uint n = toUint(args[1]);
    size_t[long] aa;
    for (uint i; i < n; ++i)
        aa[i] = 0;
    float t = clock();
    for (uint i; i < n; ++i)
        aa[i] = ulong.max;
    printf("%f\n", clock() - t);
}
(Reply) (Parent) (Thread)

leonardo_m
Subject:Re: Ugh.
Link:(Link)
Time:2008-06-09 07:50 pm (UTC)
pragma_x
>how do the timings look if you don't use a plain array instead of a hashtable?<

Not very good, with DMD, but much better:
D3:
10_000_000   0.10 s
100_000_000  0.64 s
200_000_000  1.25 s
300_000_000  1.87 s

D4:
10_000_000   0.08 s
100_000_000  0.57 s
200_000_000  1.13 s
300_000_000  1.69 s

D5:
10_000_000   0.06 s
100_000_000  0.35 s
200_000_000  0.68 s
300_000_000  1.00 s

CODE:

// D3
import std.conv: toUint;
void main(string[] args) {
    uint n = toUint(args[1]);
    auto a = new uint[n];
    for (uint i; i < n; ++i)
        a[i] = 0;
}


// D4
import std.conv: toUint;
import std.gc: malloc;
void main(string[] args) {
    uint n = toUint(args[1]);
    uint* a = cast(uint*)malloc(n * uint.sizeof);
    for (uint i; i < n; ++i)
        a[i] = 0;
}


// D5
import std.conv: toUint;
import std.c.stdlib: malloc;
void main(string[] args) {
    uint n = toUint(args[1]);
    uint* a = cast(uint*)malloc(n * uint.sizeof);
    for (uint i; i < n; ++i)
        a[i] = 0;
}
(Reply) (Parent) (Thread)

leonardo_m
Subject:Re: Ugh.
Link:(Link)
Time:2008-06-09 08:06 pm (UTC)
Note in such tests I have used uints and not ulongs, sorry. But I think the overall result isn't much different.
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2008-06-09 07:07 pm (UTC)
I think the GC is mostly to blame here. When using the HashAA from MinTL, I'm getting results similar to these of Python. I've tried the MallocNoRoots allocator, since it seems to make the most sense here. But it still kind of sucks, since it doesn't do any buffering, and I'd guess that Python has a specialized pool for ints.

// The rest of this post is not about MinTL:

When a pooling allocator is used, I'm getting 3.2 secs for D (might be lower with some tweaking of the allocator, I presume) and 2 secs for the fastest Python version (version 2).

When setting the number of hash buckets to N, the time goes down to 2.5 secs for D. Disabling the GC makes it 1.5 sec.

It should be kept in mind that the current GC in D is very simplistic and is capable of killing performance rather quickly, thus low-level management has to be used sometimes. As the built-in AA-s rely on the GC, they tend to have rather limited usage.

As for the hash map implementation I've used for testing... I'm obliged not to disclose any details yet, but it will be rather loud when the parent container lib reaches the lights, so stay focused :>

-h3r3tic
(Reply) (Thread)

leonardo_m
Link:(Link)
Time:2008-06-09 08:10 pm (UTC)
h3r3tic:
>and 2 secs for the fastest Python version (version 2).<

It's better if you use the version 3 of Python as reference, because it's the more fair. The Python version 2 builds a set, that requires no memory for the values (but in practice the timings aren't much far, I know).


h3r3tic:
>When setting the number of hash buckets to N, the time goes down to 2.5 secs for D. Disabling the GC makes it 1.5 sec.<

The GC can be disabled in Python too, it improved some of those timings a little.


h3r3tic:
>It should be kept in mind that the current GC in D is very simplistic and is capable of killing performance rather quickly, thus low-level management has to be used sometimes. As the built-in AA-s rely on the GC, they tend to have rather limited usage.<

I see. How can you use a different allocator with D built AAs?
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2008-06-09 08:17 pm (UTC)
Ok, version 3 gives 2.5 sec for Python. Disabling the gc doesn't change the performance at all, but perhaps this is because I used 5M items instead of 20M, in order to keep my poor rams safe ;)


> I see. How can you use a different allocator with D built AAs?

You can't do it without changing the runtime, AFAICS. I've used the custom allocator with an upcoming D container package.

-h3r3tic
(Reply) (Parent) (Thread)

zzzzrrr
Link:(Link)
Time:2008-06-11 01:54 am (UTC)
// I think the GC is mostly to blame here

Now that I know associative arrays have performance issues, I may switch to Tango HashMaps in Blaze considering my using AA's in important places....


This makes me wonder about dynamic array performance...? Any side-by-side performance comparisons with Python?
(Reply) (Parent) (Thread)

leonardo_m
Link:(Link)
Time:2008-06-11 09:45 am (UTC)
>Now that I know associative arrays have performance issues, I may switch to Tango HashMaps in Blaze considering my using AA's in important places....<

I think the Tango HashMaps may be too much slow anyway. Do you have timings that show they are fast enough (for your purposes)? You may need C if you want enough speed.


>This makes me wonder about dynamic array performance...? Any side-by-side performance comparisons with Python?<

Python is a slow language (in most of its implementations), such comparisons aren't much meaningful. For a better apple-to-apple comparison you may want to try ShedSkin, a compiler for Python. The following timings show that for dynamic arrays D is nearly as fast as Python, and it uses less memory:
Append timings, seconds:
               D      Psy
    100_000   0.04    0.10
  1_000_000   0.11    0.15
 10_000_000   0.83    0.68
 20_000_000   2.00    1.25
 50_000_000   4.90    3.00

Python 2.5.2 + Psyco
D DMD 1.029, uint[]

The programs:

// D:
import std.conv: toUint;
void main(string[] args) {
    uint n = toUint(args[1]);
    auto a = new uint[0];
    for (uint i; i < n; ++i)
        a ~= 0;
}


# Python + Psyco:
from sys import argv
def main():
    n = int(argv[1])
    a = []
    for i in xrange(n):
        a.append(0)
import psyco; psyco.bind(main)
main()
(Reply) (Parent) (Thread)

leonardo_m
Link:(Link)
Time:2008-06-11 11:32 am (UTC)
With C++ STL too (no Boehm-Demers-Weiser GC used here):
Append timings, seconds:
               D      Psy    C++STL
    100_000   0.04    0.10    0.03
  1_000_000   0.11    0.15    0.03
 10_000_000   0.83    0.68    0.15
 20_000_000   2.00    1.25    0.27
 50_000_000   4.90    3.00    0.58


#include "stdio.h"
#include "stdlib.h"
#include 
using namespace std;

int main(int argc, char** argv) {
    int n = argc == 2 ? atoi(argv[1]) : 100;

    vector a;
    for (unsigned int i = 0; i < n; ++i)
        a.push_back(0);

    return 0;
}
(Reply) (Parent) (Thread)

zzzzrrr
Link:(Link)
Time:2008-06-11 02:22 pm (UTC)
> I think the Tango HashMaps may be too much slow anyway.

On the contrary, I get the following results:

Assign timings, seconds:
D AA Tango HashMap
100_000 0.02 0.0028
1_000_000 0.46 0.028
10_000_000 24.56 0.292
20_000_000 231.68 0.634

D DMD 1.029

The programs:

// D AA
import tango.text.convert.Integer;
import tango.time.StopWatch;

void main(char[][] args) {
auto elapsed = new StopWatch();
char[] test = args[1];
int n = toInt(test);
size_t[long] aa;
elapsed.start;
for (uint i; i < n; ++i)
aa[i] = 0;
double time = elapsed.stop;
}

// Tango HashMap
import tango.text.convert.Integer;
import tango.time.StopWatch;
import tango.util.collection.HashMap;

alias HashMap!(int,int) MyIntList;

void main(char[][] args) {
auto elapsed = new StopWatch();
auto list = new MyIntList;
char[] test = args[1];
int n = toInt(test);
elapsed.start;
for (uint i; i < n; ++i)
list.add(n,0);
double time = elapsed.stop;
}
(Reply) (Parent) (Thread)

leonardo_m
Link:(Link)
Time:2008-06-11 03:20 pm (UTC)
Good, I'm glad to be wrong. Note that HashMap has integers keys and not longs as the Phobos version (but probably it will not make a big difference).
(Reply) (Parent) (Thread)

zzzzrrr
Link:(Link)
Time:2008-06-11 04:28 pm (UTC)
I changed the long to an integer and posted the results on this tread:
http://www.dsource.org/projects/tango/forums/topic/524

Not much of a change in performance....
(Reply) (Parent) (Thread)

leonardo_m
Link:(Link)
Time:2008-06-11 05:25 pm (UTC)
That code is wrong:
list.add(n, 0);
It adds the n key over and over.
I have installed tangobos and timed a correct version, and it's slower than the built-in AAs to me (about 30 seconds for 10 million items).
(Reply) (Parent) (Thread)

zzzzrrr
Link:(Link)
Time:2008-06-11 05:59 pm (UTC)
No wonder it was so fast!

A very dumb mistake on my part! Thanks for the correction.
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2008-06-11 04:32 pm (UTC)
> I think the Tango HashMaps may be too much slow anyway

Tango is what my code used. Since it's no longer a secret that the new containers are out ( http://dsource.org/projects/tango/forums/topic/523 ), I guess I can paste my code :)


import tango.util.container.HashMap;

void main() {
	const uint N = 5_000_000;

	auto aa = new HashMap!(long, int, Container.hash, Container.reap, Container.Chunk);
	aa.allocator.config (5000, 1000);
	aa.buckets = 7_500_000;

	for (uint i; i < N; ++i) {
		aa[i] = 0;
	}
}


BTW, why are the D versions supposed to use long as the key? On my machine (winxp sp2, python 2.5.1), Python integers seem to be 32bit.

-h3r3tic
(Reply) (Parent) (Thread)

leonardo_m
Link:(Link)
Time:2008-06-11 04:55 pm (UTC)
h3r3tic:
>I guess I can paste my code :)<

Good.


>import tango.util.container.HashMap;<

I think the DMD style guides say modules must start with a lowercase. And I agree with those style guides.


>auto aa = new HashMap!(long, int, Container.hash, Container.reap, Container.Chunk);<

That's a really long line, compared to the built-in AA syntax.
I like the idea of built-in AAs, but now I have changed my opinion regarding built-ins like AAs and array.sort, now I think a language like D has to start with everything OUT of the language, and put things inside itself (with a better syntax) only later, when a refined and efficient implementation of some functionality is already developed. Putting the code inside the language, especially if it's not much open source leads to ridiculous situations like this one of this thread, and in many others before this.


>aa.allocator.config (5000, 1000);
>aa.buckets = 7_500_000;

I don't know what such parameters mean, but I presume they are explained into the docs. Such parameters are probably cheating compared to the Python code :-)


>aa[i] = 0;

Good, I can see there's no more the add() method.


>BTW, why are the D versions supposed to use long as the key? On my machine (winxp sp2, python 2.5.1), Python integers seem to be 32bit.<

I have given a longer answer to this to Walter in the D newsgroup. The short answer is that I did forget the size of Python integers, sorry. But they are boxed and they require much more than 32 bits anyway. The good thing is the the D benchmarks give essentially the same timings in both cases.
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2008-06-11 07:31 pm (UTC)
> I think the DMD style guides say modules must start with a lowercase. And I agree with those style guides.

I guess you haven't browsed through Tango in a while :P Using MixedCase for modules means that you can have a module and a package with the same name, so e.g. foo.Bar might import all of foo.bar.*;


> That's a really long line, compared to the built-in AA syntax.

I'm willing to pay that price for much better speed.


> I don't know what such parameters mean, but I presume they are explained into the docs. Such parameters are probably cheating compared to the Python code :-)

The code is fast even without them, as I've explained in my earlier post. These lines are there 'because I can.' :P

-h3r3tic
(Reply) (Parent) (Thread)


leonardo_m
Link:(Link)
Time:2008-06-11 07:56 pm (UTC)
-h3r3tic:
>I guess you haven't browsed through Tango in a while :P<

I've just installed Tangobos for the first time, after more than a year of D usage, I know little about Tango still.


>Using MixedCase for modules means that you can have a module and a package with the same name, so e.g. foo.Bar might import all of foo.bar.*;<

I see...


>I'm willing to pay that price for much better speed.<

Can your HashMap code be merged back into Phobos, to replace the built in AAs? Otherwise the built in ones become useless filler :-)


>The code is fast even without them, as I've explained in my earlier post. These lines are there 'because I can.' :P<

You are right. How can I try your HashMap, having just instlled a tangobos? Thank you.
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2008-06-11 08:03 pm (UTC)
> Can your HashMap code be merged back into Phobos, to replace the built in AAs? Otherwise the built in ones become useless filler :-)

No idea if it could be merged into Phobos, but it might be possible to use it as the implementation for built-in AAs in case that Tango is used. Keep in mind that Tango replaces the runtime entirely, so it might redefine the GC, the built-ins and a whole lot more.

> You are right. How can I try your HashMap, having just instlled a tangobos? Thank you.

Hm. I think you got the idea behind Tangobos wrong ;) It's meant to provide Phobos modules for Tango users, not the other way around. You should probably install Tango and have Tangobos around to retain the std.* modules.


-h3r3tic
(Reply) (Parent) (Thread)

More on D associative array performance - leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).