leonardo ([info]leonardo_m) wrote,
@ 2009-06-20 00:01:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:d language, programming, python

Built-in hashing in D and Python
Both the D1/D2 languages and Python have built-in hash, in Python there are dicts, sets and frozensets, while in D there are associative arrays.

Having them built-in is very handy, and it allows a nice syntax. In future D2 language may keep such syntax while moving their implementation into the standard library, allowing more flexibility and keeping them about as handy as now (probably the compilation time may grow some and error messages may worsen, but the efficiency and flexibility will increase).

Any Python object can be used inside a dict/set, but by default the hash value of the object is the ID and the equality too is done amond IDs. So to avoid duplications you must define __hash__() and __eq__ (or __cmp__()).

In D you have to define the opEquals() and opEquals() and opCmp() too, because the hash is currently implemented in a different way, in case of hash collisions (when the hash value is equal when opEquals is true) they are resolved with a sorted search tree (that requires a <).

This means that while Python dicts are currently usually faster or much faster than D associative arrays when used on normal data with normally good hash functions, D AAs switch from being about O(1) to being about O(log n) in insert and access time, while Python dicts in such situations may become O(n). So D AAs are safer when the hash function is bad (or even in situation of hash attacks).

I have written small programs to benchmark the situation. (In the tests I have not used Psyco or the new LDC D compiler because their performance is the same or worse.)

This is the Python test (only insertions are performed, so this benchameks is also strongly determined by the efficiency of the memory allocator):

from sys import argv

class K(object):
    def __init__(self, x):
        self.x = x
    def __eq__(self, other):
        return self.x == other.x
    def __hash__(self):
        return 1
        #return self.x

def test(n):
    d = {}
    for i in xrange(n):
        d[K(i)] = i

n = int(argv[1]) if len(argv) == 2 else 10
test(n)

Equivalent D1 code:
import std.conv: toInt;

class K {
    int x;
    this(int xx) {
        this.x = xx;
    }
    int opEquals(Object other) {
        return this.x == (cast(K)other).x;
    }
    int opCmp(Object other) {
        int other_x = (cast(K)other).x;
        return (this.x == other_x) ? 0 : (this.x - other_x);
    }
    hash_t toHash() {
        return 1;
        //return cast(hash_t)this.x;
    }
}

void test(int n) {
    int[K] d;
    for (int i; i < n; i++)
        d[new K(i)] = i;
}

void main(string[] args) {
    int n = args.length == 2 ? toInt(args[1]) : 10;
    test(n);
}

I have also tried the same D1 code t, using struct used by value (not allocated on the heap, well, allocated inside the associative array itself) instead of objects:
import std.conv: toInt;

struct K {
    int x;
    int opEquals(K other) {
        return this.x == other.x;
    }
    int opCmp(K other) {
        return (this.x == other.x) ? 0 : (this.x - other.x);
    }
    hash_t toHash() {
        return 1;
        //return cast(hash_t)this.x;
    }
}

void test(int n) {
    int[K] d;
    for (int i; i < n; i++)
        d[K(i)] = i;
}

void main(string[] args) {
    int n = args.length == 2 ? toInt(args[1]) : 10;
    test(n);
}
As you can see the hash methods alaways return 1, so every item is an hash collision.

The timings:
Timings with all collisions (hash=1):
  N =       1000  2000  3000  4000  10_000  100_000
  Python:   0.53  1.27  2.45  4.30  30.27
  D class:  0.06  0.09  0.12  0.17   1.07
  D struct: 0.06  0.09  0.09  0.11   0.50    44.26

With no collisions (hash=this.x):
  N =       1000  2000  3000  4000  10_000  100_000  1_000_000
  Python:   0.28  0.28  0.28  0.28   0.28     0.53    6.97
  D class:  0.05  0.05  0.05  0.05   0.06     0.12    1.01
  D struct: 0.05  0.05  0.05  0.05   0.06     0.07    0.39
The timings don't show the clean grow factors I was talking about (probably also because of memory allocations), but you can clearly see how much faster timings grow in Python compared to D when there are many collisions.



(3 comments) - (Post a new comment)

Slightly flawed
(Anonymous)
2009-06-20 10:55 am UTC (link)
In Python you rarely use custom classes as the key in dicts. Things like tuples are much more common.

Also try using __slots__ for your K class.

(Reply to this) (Thread)

Re: Slightly flawed
[info]leonardo_m
2009-06-20 11:53 am UTC (link)
Tuples have anonymous fields, this may lead to troubles (so Hettinger has created the named tuples).
Probably your suggestions can improve the timings of the Python code.
The purpose of this comparison isn't to show who is the faster one, or to show that D AAs are faster than Python ones (usually they are aren't).
The purpose is to show a difference in the algorithm used to resolve collisions; in D it's tree-based, in Python it is not, so in this test I was interested in how timings grow and not in actual baseline performance.

(Reply to this) (Parent)

Tango Hashmaps
(Anonymous)
2009-06-20 04:54 pm UTC (link)
I'd be interested to see the benchmarks when using tango's hashmaps. In my experience they are a lot faster than built in associative arrays, and work the same way (other than instantiation):

----
HashMap!( char[], int ) hm = new HashMap!( char[], int );
// The above works the same as below, the are accessed the same way etc.
int[char[]] aa;
----

If Tango's hashmaps are faster, it might be a good idea to implement them in a similar way.

(Reply to this)


(3 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…