leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).
You're looking at the latest 10 entries.
Missed some entries? Then simply jump back 10 entries

Security:
Subject:Permutations exercise in D (and C)
Time:10:48 pm
By leonardo maffi, V.1.0, 1 Aug 2014.


A nice programming exercise in Haskell by Harry Garrood, "Permutations: an exercise" (21 Jul 2014):
http://harry.garrood.me/blog/permutations-an-exercise/
http://www.reddit.com/r/haskell/comments/2ba54b/a_little_programming_exercise_about_permutations/

Harry Garrood writes:
We're going to use Haskell, because this is all about functions (in the mathematical sense), and so Haskell, being a functional programming language, is especially well suited to the job.
Most of the code you below is in D language (with a C version at the end) because it's a good language to solve similar problems.

The problem statement defines a card shuffling technique:
1) Take one card from the top of the deck and discard it into a second pile.
2) Take another card from the top of the deck, and put it at the bottom of the deck.
3) Repeat these two steps, putting all discarded cards from step 1 into the same pile, until the original deck is all gone and the second pile has all the cards in it.

The problem is: how many shuffles does it take until a deck is in the same order as when you started, for a deck with an arbitrary number of cards? Write a function, f :: Int -> Int, such that, for a deck with n cards, f n is the minimum number of shuffles required to return it to its original order.

Below I explain the varios D implementations, that you can find in the archive:
http://www.fantascienza.net/leonardo/js/permutations.zip

The fantascienza site is down, so you can temporarily find the programs here:

perms1.hs: http://lpaste.net/2412259684489625600
perms2.hs: http://lpaste.net/390308567523000320
perms3.d: http://dpaste.dzfl.pl/2eb987664dfd
perms3b.d: http://dpaste.dzfl.pl/0aa48d395cb2
perms4.d: http://dpaste.dzfl.pl/0b0cb344ccce
perms5.d: http://dpaste.dzfl.pl/9f929c2d064d
perms6.d: http://dpaste.dzfl.pl/dbf1d3f19321
perms7.d: http://dpaste.dzfl.pl/a8eddf54067d
perms7b.d: http://dpaste.dzfl.pl/8a9a1afc10f8
perms7c.d: http://dpaste.dzfl.pl/45ff250cafa7
perms8.d: http://dpaste.dzfl.pl/32614ae6430c
perms9.c: http://dpaste.dzfl.pl/608dd06e44a5


I compile all the code on a 32 bit Windows using:
dmd -O -release -inline -noboundscheck
ghc -O3
gcc -Ofast -flto -std=c99 -Wall -Wextra

dmd 2.066beta4
Glasgow Haskell Compiler, Version 7.6.1, stage 2 booted by GHC version 7.4.2
gcc 4.0.0
- - - - - - - - - - - - - - - - - - -

perms1.hs:
perms2.hs:

The perms1.hs and perms2.hs programs are adapted from the two solutions by Harry Garrood, with small changes.

This is the original 'permutations' function:
permutation :: (Deck -> Deck) -> Int -> PermutationGraph
permutation f n = go 1 xs
    where
    xs = unCardAll . f . makeDeck $ n
    go m (y:ys) = (y, m) : go (m+1) ys
    go _ []     = []

This is my version:
permutation :: (Deck -> Deck) -> Int -> PermutationGraph
permutation shuf n = zip perm [1 ..]
    where perm = unCardAll $ shuf $ makeDeck n
perms2.hs runs on the six test cases (n = 4, 5, 52, 53, 100, 200) in about 0.08 seconds (such small timings are not precise).

- - - - - - - - - - - - - - - - - - -

perms3.d:
perms3b.d:

This program is equivalent to the perms1.hs program.
Harry Garrood defines a very nice and clean Haskell function that performs the two first steps of the shuffling:
step :: (Deck, Deck) -> (Deck, Deck)
step (deck, pile) = case deck of
    a:b:cs -> (cs ++ [b], a : pile)
    [a]    -> ([], a : pile)
    []     -> ([], pile)

I have implemented this 'step' function in various ways in D, and at the end this is the version I have used (here I have removed the pre/post-conditions, see the files for the full code), this performs the whole shuffling (including the third step):
Deck deckShuffle(in Deck Ain) pure nothrow @safe @nogc {
    typeof(return) B;
    const(Card)[] A = Ain;

    while (!A.empty) {
        // Step 1:
        B = [Card(A.front)] ~ B;
        A.popFront;

        // Step 2:
        if (!A.empty) {
            A ~= A.front;
            A.popFront;
        }
    }

    return B;
}

This function is (strongly) pure, it accepts a constant array of Cards and performs the shuffling using slices and concatenations, building a new output array B. Here a Deck is a dynamic array, but perhaps for this algorithm a more efficient data structure for a Deck is a (singly linked) list (about as in the Haskell code).

In perms3 I have a commented-out the range-based version of 'makeDeck' because currently the 'array' function is not @safe for that use case. So I've written a more internally imperative version of 'makeDeck':
enum makeDeck = (in uint n) pure nothrow => iota(1, n + 1).map!Card.array;

I have used a Typedef to implement the 'newtype' of the Haskell code (despite currently Typedef has some bugs):
alias Card = Typedef!int;
The function 'f1' uses a do-while because Phobos still lacks an 'iterate' range-based function, as used in the 'shuffle' Haskell function.

I have run both perms1.hs and perms3.d on the few testing values, both print:
4 2
5 5
52 510
53 53
100 120
200 8460

perms1.hs produces this output in about 2.72 seconds:
[2,5,510,53,120,8460]

perms3.d produces this output in (best of 3) in about 1.39 seconds:
4 2
5 5
52 510
53 53
100 120
200 8460
I have then created a perms3b.d program that is very similar to perms3.d, but represents a Deck with a SList (a singly linked list) hoping to improve performance avoiding the O(n) copies of the slices of the array-based version. But perms3b.d runs in about 30.4 seconds. In most cases linked lists don't give any performance advantage.

- - - - - - - - - - - - - - - - - - -

perms4.d:

This version is similar to perms3.d, but here I have implemented 'deckShuffle' in a lower level way. This deckShuffle is now @nogc (it doesn't allocate on the heap), and works in-place on two given arrays, using indexes (it could also be implemented using pointer arithmetic, but there's no need for that, this function is fast and @safe). Now deckShuffle is still annotated with 'pure' but now it's weakly pure:
void deckShuffle(Deck A, Deck B) pure nothrow @safe @nogc {
    size_t aTop = 0,
           aLength = A.length,
           bTop = B.length - 1; // B is empty at start.

    while (aLength) {
        // Step 1:
        B[bTop] = A[aTop];
        aTop = (aTop + 1) % A.length;
        bTop--;
        aLength--;

        // Step 2:
        if (aLength) {
            A[(aTop + aLength) % A.length] = A[aTop];
            aTop = (aTop + 1) % A.length;
        }
    }
}
This kind of code is more fiddly and requires more care and testing compared to more a functional style, but it's also much faster and produces no heap garbage. The run-time of this function is also more predictable.

The run time of 'perms4.d' to produce the six output pairs is about 0.06 seconds, so this is about 23 times faster than 'perms3.d'. Unlike many other languages, in D it's easy to mix low level code with higher level code, and produce code that still looks natural and free of most accidental complexity.

- - - - - - - - - - - - - - - - - - -

perms5.d:

This implements the algorithm from perms2.hs. For simplicity I have used the slower but simpler 'deckShuffle' from perms3.d.

Instead of writing the various functions of the Haskell version, building the various intermediate data stutures (like the list of pairs), I have written a single 'f2' function that computes the lcm of the cycleLen without storing the actual cycles. D has less 'glue' than Haskell, so often in D it's more convenient to write a little larger functions (with the side effect that they are often faster than the Haskell versions, but also less easy to combine in different ways):
ulong f2(in uint n) pure nothrow {
    uint[uint] graph;
    foreach (immutable uint i, immutable x; n.makeDeck.deckShuffle)
        graph[i + 1] = cast(uint)x;

    typeof(return) result = 1;

    while (graph.length) {
        auto x = graph.byKey.front;
        uint cycleLen = 0;
        while (x in graph) {
            cycleLen++;
            immutable y = graph[x];
            graph.remove(x);
            x = y;
        }
        result = lcm(result, cycleLen);
    }

    return result;
}
This code is less general and more specialized than the Haskell functions. It's less easy to see what the code is doing. Here 'graph' of the pairs is an associative array to avoid the linear searches in the lists (but a linear scan is still present despite being hidden, the one performed by byKey.front).

The run-time of perms5.d is very small, so beside calling 'f2' on the six test cases, this function also computes 'f2' on the numbers n in [1, 1_000[. The total run-time is about 1.29 seconds.

If I modify perms2.hs to compute f2 in the same [1, 1_000[ interval, the run-time is very large (about half an hour, and it uses 32 bit integers, so some results for the [1, 1_000[ interval are wrong). So the 'f2' function in D has some advantages (and it's still strongly pure). The D 'f2' function returns a ulong because when n grows large, the output grows a lot.

- - - - - - - - - - - - - - - - - - -

perms6.d:

This is similar to perms5.d, but uses the more efficient 'deckShuffle' function from perms4.d.
The run-time of perms6.d is only 0.20 seconds (this includes the computations for n in [1, 1_000[), so it's more than six times faster than perms5.d.

- - - - - - - - - - - - - - - - - - -

perms7.d:
perms7b.d:
perms7c.d:

This is a lower-level version (that is easy to port to C language), it's similar to perms6.d, but here in 'f2' instead of the associative array 'graph' I have re-used the array D2, filling it with 'emptyItem' when I remove a value. 'f2' is now also @nogc because it accepts fixed-size arrays as arguments and then slices them. So the program allocates the arrays only inside the 'main' (and here are stack-allocated, so even the main is @nogc):
ulong f2(size_t m)(const uint n, ref Card[m] D1, ref Card[m] D2)
pure nothrow @safe @nogc {
    for (uint i = 0; i < n; i++)
        D1[i] = cast(Card)i;
    deck_shuffle(D1, D2, n);

    typeof(return) result = 1;
    uint graph_len = n;
    enum emptyItem = uint.max;

    while (graph_len) {
        uint x = uint.max;
        for (uint i = 0; i < n; i++)
            if (D2[i] != emptyItem)
                x = i;

        uint cycle_len = 0;
        while (D2[x] != emptyItem) {
            cycle_len++;
            const uint y = D2[x];
            D2[x] = emptyItem;
            graph_len--;
            x = y;
        }
        result = lcm(result, cycle_len);
    }

    return result;
}

void main() nothrow @nogc {
    import core.stdc.stdio: printf;

    const uint nMax = 1_000;
    Card[nMax] d1 = void, d2 = void;

    for (uint n = 1; n <= nMax; n++)
        printf("%llu\n", f2(n, d1, d2));
}
perms7.d runs in about 0.05 seconds. So it's about 4 times faster than perms6.d.

To avoid overflow bugs I have created the perms7b.d and perms7c.d versions, that use core.checkedint (despite core.checkedint is not meant for user code). For both perms7b.d and perms7c.d the run-time is similar to the unchecked versions, and perms7b.d spots the overflow bugs:
...
1389: 4247163320400
1390: 465585120
1391: 1391
1392: 6570
1393: 4294945235192621194 (overflow)
1394: 3655769040
1395: 1194
1396: 930
1397: 1748154975840
1398: 1398
...
- - - - - - - - - - - - - - - - - - -

perms8.d:

This version is very similar to perms7.d, but I've used BigInt instead of ulong to compute all correct results (if there are no other bugs or design mistakes).

perms8.d is still quite fast, it runs in about 0.08 seconds (this includes the computations for n in [1, 1_000[). If I increase nMax to 10_000 it runs in about 6.6 seconds. D BigInts are not very fast, but I think this is not a significant problem for this program.

- - - - - - - - - - - - - - - - - - -

perms9.c:

This is C code derived from perms7.d.

perms9.c runs in about 0.03 seconds or less. perms9.c is a little faster than perms7.d, but the speed difference is really visible when nMax is much larger (but then the program gives wrong outputs becuse of overflows). The difference between 0.03 and 0.05 seconds is mostly caused by the time taken to initialize the D-runtime.

- - - - - - - - - - - - - - - - - - -
comments: Leave a comment Share

Security:
Subject:K-Nearest Neighbor in D language
Time:12:26 am
By leonardo maffi,
version 1.1.

Warning: currently nearest4/nearest5 don't work correctly, I'll try to fix them.

(Later this article will be moved to my site.)

These blog posts present a performance comparison between F# and OCaml implementations of a simple K-Nearest Neighbor (with K=1) classifier algorithm to recognize handwritten digits:

http://philtomson.github.io/blog/2014/05/29/comparing-a-machine-learning-algorithm-implemented-in-f-number-and-ocaml/

http://philtomson.github.io/blog/2014/05/30/stop-the-presses-ocaml-wins/

One of the Reddit discussions:

http://www.reddit.com/r/fsharp/comments/26vl3w/stop_the_presses_ocaml_wins_in_terms_of_speed_vs/

The original F# and OCaml code:
https://github.com/philtomson/ClassifyDigits

So I've studied and compared versions of this code in D language.

First I've written a functional-style version, 27 CLOC lines long, that is similar to the array-based OcaML version:
import std.stdio, std.algorithm, std.range, std.array, std.conv,
       std.string, std.math, std.typecons;

struct LabelPixels { ubyte label; ubyte[] pixels; }

immutable readData = (in string fileName) =>
    fileName
    .File
    .byLine
    .dropOne
    .map!(r => r.chomp.split(",").to!(ubyte[]))
    .map!(a => LabelPixels(a[0], a.dropOne))
    .array;

immutable distance = (in ubyte[] p1, in ubyte[] p2)
pure /*nothrow*/ @safe /*@nogc*/ =>
    double(p1.zip(p2).map!(ab => (ab[0] - ab[1]) ^^ 2).sum).sqrt;

immutable classify = (in LabelPixels[] trainingSet, in ubyte[] pixels)
/*pure nothrow*/ @safe =>
    trainingSet
    .map!(s => tuple(pixels.distance(s.pixels), ubyte(s.label)))
    .reduce!min[1];

void main() {
    const trainingSet = "training_sample.csv".readData;
    const validationSample = "validation_sample.csv".readData;

    immutable nCorrect = validationSample
                         .map!(s => trainingSet.classify(s.pixels) == s.label)
                         .sum;
    writeln("Percentage correct: ", double(nCorrect) / validationSample.length * 100);
}

Notes to this first version:
- Inside LabelPixels the pixes are represented with an unsigned byte each. This reduces the RAM used to store the data and allows to use the CPU cache more efficiently.
- "readData" is a lambda assigned to a global immutable value. This is not fully idiomatic D style, but I've used this because it looks more similar to the F# code. In practice I usually don't use such global lambdas because the output type gives me useful information to understand the purpose of the function, and they are less easy to find in the assembly (you can't just search for "readData" in the asm listing to find this lambda.
- The D standard library contains a module to read CSV files (std.csv), but I always find its API bad, so I've used File.byLine and I have processed the lines lazily.
- In readData I've used the eager function split instead of the lazy splitter (that uses less RAM) because for the data sets used by this program split is faster (readData loads both CSV files in about 0.8-0.9 seconds on my PC).
- In the "distance" lambda the nothrow and @nogc attributes are commented out because zip() doesn't yet respect such attributes (zip throws an exception in certain cases and the code to throw the exception allocates it on the heap). The "classify" lambda calls "distance" so it can't be nothrow and @nogc (and because reduce doesn't have a default value in case its input range is empty, so it can throw, and because classify contains a heap-allocating lambda).
- The distance lambda contains code like "map!(ab => (int(ab[0]) - ab[1])" because D is not yet able to de-structure tuples, like the 2-tuples generated lazily by zip.
- The classify lambda contains a map of tuples because the max/min functions of Phobos still lack an optional "key" function like the max/min functions of Python. So I've created tuples of the mapped value plus the label and I've used reduce to find the min. I have used "ubyte(s.label)" to produce a mutable tuple, that reduce doesn't accept (unless you use a mutable seed).

Most run-time of this program is used inside the distance function. So if you want to optimize this program that's the function to improve.

In the zip you can find a "nearest1b.d" file that is very similar to this first function, but it's adapted to the current version of the LDC2 compiler, that is a little older (it doesn't support the @nogc attribute, T(x) syntax for implicit casts, and its Phobos doesn't have a sum function).

The D compilers (even the well optimizing LDC2 complier) is not yet very good at optimizing that functional distance function. So in the "nearest2.d" version of the program (that you can find in the zip archive) I've written distance in imperative way, this produces a program that compiled with ldc is more than 3 times faster:
uint distance(in ubyte[] p1, in ubyte[] p2) pure nothrow @safe {
    uint tot = 0;
    foreach (immutable i, immutable p1i; p1)
        tot += (p1i - p2[i]) ^^ 2;
    return tot;
}

Now this distance function can be pure nothrow @safe. If you look this function also cheats a little not computing the square root, because it's not useful if you want to search the closest. This function also computes the total using integers.

I have not shown the asm generated by LDC2 for the nearest lambda in the "nearest1.d" program because it's too much complex and messy, but I can show the asm for this second simpler distance function:
__D8nearest38distanceFNaNbNfyAhyAhZk:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	24(%esp), %ecx
	xorl	%eax, %eax
	testl	%ecx, %ecx
	je	LBB8_3
	movl	28(%esp), %edx
	movl	20(%esp), %esi
	.align	16, 0x90
LBB8_2:
	movzbl	(%edx), %edi
	movzbl	(%esi), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%edi, %eax
	incl	%edx
	incl	%esi
	decl	%ecx
	jne	LBB8_2
LBB8_3:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$16

As you see the loop gets compiled to quite simple asm with 9 X86 (32 bit) instructions.

To improve performance I have changed the data layout a little. With the "binary_generator1.d" program you can find in the zip archive I have generated binary files of the training set and validation sample. So now the loading of the data is very simple and very fast (see in the zip archive for the full source of this "nearest3.d" program:
enum nCols = 785;
enum size_t labelIndex = 0;
alias TData = ubyte;
alias TLabel = TData;

immutable(TData[nCols])[] readData(in string fileName) {
    return cast(typeof(return))std.file.read(fileName);
}

uint distance(immutable ref TData[nCols - 1] p1,
              immutable ref TData[nCols - 1] p2)
pure nothrow @safe /*@nogc*/ {
    uint tot = 0;
    foreach (immutable i, immutable p1i; p1)
        tot += (p1i - p2[i]) ^^ 2;
    return tot;
}

TLabel classify(immutable TData[nCols][] trainingSet,
                immutable ref TData[nCols - 1] pixels)
pure nothrow @safe /*@nogc*/ {
    auto closestDistance = uint.max;
    auto closestLabel = TLabel.max;

    foreach (immutable ref s; trainingSet) {
        immutable dist = pixels.distance(s[1 .. $]);
        if (dist < closestDistance) {
            closestDistance = dist;
            closestLabel = s[labelIndex];
        }
    }

    return closestLabel;
}

Notes:
- The classify function is now imperative, but this improves the performance just a little.
- @nogc is still commented out because the current version of the LDC2 compiler doesn't support it yet.
- The main performance difference of this third program comes the data layout. Now I have hard-coded (defined statically) the number of columns nCols. So now the data sets are essentially a ubyte[N][]. In D this is represented as a dense array, that minimize cache misses and reduce by one the number of levels of indirection. Thanks to the compile-time knowledge of the loop bounds inside the distance function, the LLVM back-end of the LDC2 compiler performs some loop unrolling (in theory it can be performed even before, but in practice the the version 3.4.1 of the LLVM is not performing loop unrolling on dynamic bounds).

So the asm of the distance function of the "nearest3.d" program is longer because of a partial loop unwinding:
__D8nearest38distanceFNaNbNfKyG784hKyG784hZk:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	xorl	%ecx, %ecx
	movl	$7, %edx
	movl	16(%esp), %esi
	.align	16, 0x90
LBB1_1:
	movzbl	-7(%esi,%edx), %edi
	movzbl	-7(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	-6(%esi,%edx), %ecx
	movzbl	-6(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	movzbl	-5(%esi,%edx), %edi
	movzbl	-5(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	-4(%esi,%edx), %ecx
	movzbl	-4(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	movzbl	-3(%esi,%edx), %edi
	movzbl	-3(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	-2(%esi,%edx), %ecx
	movzbl	-2(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	movzbl	-1(%esi,%edx), %edi
	movzbl	-1(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	(%esi,%edx), %ecx
	movzbl	(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	addl	$8, %edx
	cmpl	$791, %edx
	jne	LBB1_1
	movl	%ecx, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$4

Thanks to the data layout changes and other smaller improvements, the "nearest3.d" function is almost three times faster than "nearest2.d".

To reduce the run time and better utilize one CPU core, we can use the SIMD registers of the CPU.

To perform a subtraction of the arrays I change the data layout again, this time I represent the pixels with short (in D signed 16 bit integers). So I use a short8 to perform several subtractions in parallel. A problem comes from the label, if I slice it away as in the "nearest3.d" program, the array pointers are not aligned to 16 bytes and this is not accepted by my CPU. To solve this problem I add a padding of 7 short in every line of the matrix after the label. This is performed by the "binary_generator2.d" program.

So the 4th version of the program contains (see in the zip archive for the full code of "nearest4.d"):
enum nCols = 785;
enum size_t labelIndex = 0;
alias TData = short;
alias TLabel = TData;

immutable(TData[nCols + 7])[] readData(in string fileName) {
    return cast(typeof(return))std.file.read(fileName);
}

uint distance(immutable ref TData[nCols - 1] p1,
              immutable ref TData[nCols - 1] p2) pure nothrow /*@nogc*/ {
    alias TV = short8;
    enum size_t Vlen = TV.init.array.length;
    assert(p1.length % Vlen == 0);
    immutable v1 = cast(immutable TV*)p1.ptr;
    immutable v2 = cast(immutable TV*)p2.ptr;

    TV totV = 0;
    foreach (immutable i; 0 .. p1.length / Vlen) {
        TV d = v1[i] - v2[i];
        totV += d * d;
    }

    uint tot = 0;
    foreach (immutable t; totV.array)
        tot += t;
    return tot;
}

TLabel classify(immutable TData[nCols + 7][] trainingSet,
                immutable ref TData[nCols - 1] pixels) pure nothrow /*@nogc*/ {
    auto closestDistance = uint.max;
    auto closestLabel = short.max;

    foreach (immutable ref s; trainingSet) {
        immutable dist = pixels.distance(s[8 .. $]);
        if (dist < closestDistance) {
            closestDistance = dist;
            closestLabel = s[labelIndex];
        }
    }

    return closestLabel;
}

Notes:
- The classify function is not changed much, it performs the slicing to throw away the first eight (short label + 7 padding short) items.
- The distance function is a little more complex, but not too much. It uses the basic SIMD operations like subtraction, sum, product, plus the ".array" attribute that allows me to manage a short8 as a fixed-size value to sum all its contents.
- The code of distance is designed to be adaptable to other sizes of SIMD registers (but it's not able to adapt automatically).
- This "nearest4.d" program is only 1.6 times faster than "nearest4.d" probably because the SIMD code that manages shorts is not very clean.

The asm for the distance function of the "nearest4.d" program is rather long, despite being fast:
__D8nearest48distanceFNaNbKyG784sKyG784sZk:
	pushl	%edi
	pushl	%esi
	pxor	%xmm0, %xmm0
	movl	$208, %ecx
	movl	12(%esp), %edx
	.align	16, 0x90
LBB1_1:
	movdqa	-208(%edx,%ecx), %xmm1
	movdqa	-192(%edx,%ecx), %xmm2
	psubw	-208(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	psubw	-192(%eax,%ecx), %xmm2
	pmullw	%xmm2, %xmm2
	paddw	%xmm1, %xmm2
	movdqa	-176(%edx,%ecx), %xmm0
	psubw	-176(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm2, %xmm0
	movdqa	-160(%edx,%ecx), %xmm1
	psubw	-160(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-144(%edx,%ecx), %xmm0
	psubw	-144(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-128(%edx,%ecx), %xmm1
	psubw	-128(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-112(%edx,%ecx), %xmm0
	psubw	-112(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-96(%edx,%ecx), %xmm1
	psubw	-96(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-80(%edx,%ecx), %xmm0
	psubw	-80(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-64(%edx,%ecx), %xmm1
	psubw	-64(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-48(%edx,%ecx), %xmm0
	psubw	-48(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-32(%edx,%ecx), %xmm1
	psubw	-32(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-16(%edx,%ecx), %xmm2
	psubw	-16(%eax,%ecx), %xmm2
	pmullw	%xmm2, %xmm2
	paddw	%xmm1, %xmm2
	movdqa	(%edx,%ecx), %xmm0
	psubw	(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm2, %xmm0
	addl	$224, %ecx
	cmpl	$1776, %ecx
	jne	LBB1_1
	pshufd	$3, %xmm0, %xmm1
	movd	%xmm1, %eax
	movdqa	%xmm0, %xmm1
	movhlps	%xmm1, %xmm1
	movd	%xmm1, %ecx
	pshufd	$1, %xmm0, %xmm1
	movd	%xmm1, %edx
	movd	%xmm0, %esi
	movswl	%si, %edi
	sarl	$16, %esi
	addl	%edi, %esi
	movswl	%dx, %edi
	addl	%esi, %edi
	sarl	$16, %edx
	addl	%edi, %edx
	movswl	%cx, %esi
	addl	%edx, %esi
	sarl	$16, %ecx
	addl	%esi, %ecx
	movswl	%ax, %edx
	addl	%ecx, %edx
	sarl	$16, %eax
	addl	%edx, %eax
	popl	%esi
	popl	%edi
	ret	$4

The run time of the varions versions, in seconds, best of 3:
  nearest1:   91     (dmd)
  nearest1b:  19.7   (ldc2)
  nearest2:    5.91  (ldc2)
  nearest3:    2.05  (ldc2)
  nearest4:    1.28  (ldc2)

I have compiled the programs with:
dmd -wi -d -O -release -inline -boundscheck=off nearest1.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest1b.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest2.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest3.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest4.d
strip nearest1b.exe nearest2.exe nearest3.exe nearest4.exe

I have used the compilers:

DMD32 D Compiler v2.066

And ldc2 V.0.13.0-beta1, based on DMD v2.064 and LLVM 3.4.1, Default target: i686-pc-mingw32, Host CPU: core2.

On my 32 bit Windows the binaries generated are:
    1.692 nearest1.d
  221.212 nearest1.exe
    1.757 nearest1b.d
1.079.822 nearest1b.ex
    1.716 nearest2.d
1.070.606 nearest2.exe
    2.059 nearest2b.d
1.071.118 nearest2b.exe
    2.037 nearest3.d
1.353.230 nearest3.exe
    2.365 nearest4.d
1.353.742 nearest4.exe
    2.513 nearest5.d

In the zip archive I have also added a "nearest5.d" program that shows a recent improvement of the D type system (not yet available in ldc2), to support fixed-size array length inference for slices passed to templated functions.

With a run-time 1.28 seconds for the "nearest4.d" program there are still ways to reduce the run-time. I am using an old 2.3 GHz CPU, so if you use a modern Intel 4 GHz CPU like the Core i7-4790K and with faster memory, you can see a significant speedup. If you use a modern CPU you can also use YMM registers with short16 SIMD, that should offer some speedup (LDC2 is already able to target such YMM registers, you need to replace short8 with short16 in the nearest4.d program and change the padding). And then this program is very easy to parallelize for two or four cores, you can use the std.parallelism module for the computation of nCorrect with map+reduce in the main function, that should give some speedup.

You can find all the code and data here:
http://www.fantascienza.net/leonardo/js/nearest.7z

(This 7z archive contains the data sets too, I hope this is OK. Otherwise I'll remove them from the 7z archive.)
comments: 5 comments or Leave a comment Share

Security:
Subject:Links and updates
Time:12:49 am
Updates:

Given the natural numbers n and m, find the minimum number of integer-sided squares that tile an (n,m) rectangle:
http://www.fantascienza.net/leonardo/js/index.html#squares_problem
A D solution that improves the performance of a C++ solution.

A benchmark from the Computer Game site, in D:
http://www.fantascienza.net/leonardo/js/index.html#meteor_bench

A D implementation of a simple and famous simulation:
http://www.fantascienza.net/leonardo/js/segregation_game.zip

Various type-level solutions in D language of the "Instant Insanity" puzzle (inspired by a Haskell solution):
http://www.fantascienza.net/leonardo/js/instant_insanity.zip

-------------

Links:


A testing experience in Haskell:
http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/

A simple but sometimes useful optimization for modern CPUs:
http://blog.libtorrent.org/2013/12/memory-cache-optimizations/

A nice talk about the Julia language:
http://vimeo.com/84661077

"Types for Units-of-Measure: Theory and Practice" by Andrew Kennedy, the complexity behind the units of measure feature of F# (PDF file):
http://research.microsoft.com/en-us/um/people/akenn/units/cefp09typesforunitsofmeasure.pdf

A little story that reminds us that teaching debugging at the University is important:
http://danluu.com/teach-debugging/
comments: Leave a comment Share

Security:
Subject:Links and updates
Time:10:50 pm
Updated software:

http://www.fantascienza.net/leonardo/js/index.html#fasta
http://www.fantascienza.net/leonardo/js/index.html#audioactive


Added software:

http://www.fantascienza.net/leonardo/js/index.html#puzzle15game
http://www.fantascienza.net/leonardo/js/index.html#naval_simulation
http://www.fantascienza.net/leonardo/js/index.html#water_filling
http://www.fantascienza.net/leonardo/js/mark_sweep.zip
http://www.fantascienza.net/leonardo/js/index.html#challenge139


Links, mostly programming:

Easy use of Wikipedia from Python:
https://github.com/goldsmith/Wikipedia

A little sad Python-Haskell comparison:
http://eyallotem.blogspot.co.il/2013/05/comparing-python-and-haskell.html

Introductions to Parser Combinators:
http://blog.fogcreek.com/fparsec/
http://yannesposito.com/Scratch/en/blog/Parsec-Presentation/

"What I Wish I Knew When Learning Haskell", not easy to understand:
http://dev.stephendiehl.com/hask/

Mathics is a general-purpose computer algebra system:
http://www.mathics.net/

Slides of the StrangeLoop 2013 conference:
https://github.com/strangeloop/StrangeLoop2013/tree/master/slides/sessions

Introduction to Typed Clojure:
http://nathanic.org/posts/2013/typed-clojure-tour/

Simple introduction to Monoids for programmers:
http://fsharpforfunandprofit.com/posts/monoids-without-tears/
http://fsharpforfunandprofit.com/posts/monoids-part2/

Several simple C++ implementations of software rendeerers:
http://kagamin.net/hole/simple/index.htm

Cute but with several reserves:
http://www.xfront.com/Haskell/Why-Functional-Programming-Matters-Part-2.pdf

Doctest for Haskell:
https://github.com/sol/doctest-haskell#readme

Units-of-Measure in F#:
http://clear-lines.com/blog/post/Safe-refactoring-with-FSharp-Units-of-Measure.aspx

Nice and simple Pythagoras Tree in F#:
http://fsharpnews.blogspot.it/2009/05/pythagoras-tree.html

Eye-opening, "Dispelling the Myths of Parallel Computing" by Patrick H. Madden:
http://www.cs.binghamton.edu/~pmadden/pubs/dispelling-ieeedt-2013.pdf

Parse, of the Rebol language:
http://www.red-lang.org/2013/11/041-introducing-parse.html
comments: 1 comment or Leave a comment Share

Security:
Subject:Graph Walk
Time:01:48 am
In the "Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers" book by John MacCormick there is a simple simulation of a graph of Web pages, to implement a basic page ranking. This is modified from Figure 3-7 (from: http://press.princeton.edu/releases/m9528_illus.pdf ):



I have implemented the algorithm using the D language. The simulation is just a random walk on the graph, and at each step there is also a small probability (0.15) to jump to a random node. There are various ways to implement this walk, I have chosen a fast one:
// graph_walk1.d
import std.stdio, std.random, std.algorithm, std.math, std.typetuple;

alias NodeIndex = int;

enum nVisits = 20_000_000;
enum nUnroll = 8;
static assert(nVisits % nUnroll == 0);
enum restartProbability = 0.15;
enum nMaxOutArcs = 5;
enum nNodes = 16;

template Iota(int stop) {
    static if (stop <= 0)
        alias TypeTuple!() Iota;
    else
        alias TypeTuple!(Iota!(stop - 1), stop - 1) Iota;
}

struct Node {
    immutable uint nOutArcs;
    immutable uint[nMaxOutArcs] outbound;
    uint hitCount = 0;
}

void main() {
    auto rng = Xorshift(1);
    Node[nNodes] graph = [{/* 0*/ 3, [13, 9, 8]},
                          {/* 1*/ 1, [2]},
                          {/* 2*/ 2, [0, 8]},
                          {/* 3*/ 1, [1]},
                          {/* 4*/ 2, [1, 2]},
                          {/* 5*/ 2, [2, 8]},
                          {/* 6*/ 3, [3, 4, 5]},
                          {/* 7*/ 1, [6]},
                          {/* 8*/ 1, [15]},
                          {/* 9*/ 5, [14, 12, 11, 10, 8]},
                          {/*10*/ 1, [15]},
                          {/*11*/ 1, [15]},
                          {/*12*/ 1, [15]},
                          {/*13*/ 1, [9]},
                          {/*14*/ 1, [15]},
                          {/*15*/ 1, [7]}];

    uint position = uniform(0, graph.length, rng);
    graph[position].hitCount++;

    foreach (immutable _; 1 .. nVisits / nUnroll)
        foreach (immutable __; Iota!nUnroll) {
            if (graph[position].nOutArcs == 0 || uniform(0.0, 1.0, rng) < restartProbability)
                position = uniform(0, graph.length, rng);
            else
                position = graph[position].outbound[uniform(0, graph[position].nOutArcs, rng)];
            graph[position].hitCount++;
        }

    foreach (immutable i, immutable node; graph)
        writefln("%2d %2.0f", i, round(100.0 * node.hitCount / cast(real)nVisits));
}


The program performs twenty millions jumps on the graph. The graph is represented with an adjacency array. I have used a fixed-size 2D matrix, that contains some holes because it has five columns, but only the node 9 has five outbound arcs. Each Node must contain the real number of outbound arcs, plus a counter of the visits.

There is a more compact representation for this graph, a linear array that has no holes and contains both the indexes of the outbound arcs and the number of outbound arcs, plus a little extra array that contains the start indexes of the nodes and the counts. But this representation is more complex, and I prefer a simpler code.

The Iota!nUnroll is used to unroll the loop 8 times, to speed up the walk a little.

To improve the performance of the code I have created a second version of the D program. It uses a similar data structure, but the counts are kept in a separated array.

This program performs the random walks jumping between the nodes using goto, so the state is kept in the program counter. The finite state machine is implemented with code generated at compile-time from the Graph data structure:

// graph_walk2.d
import std.stdio, std.random, std.algorithm, std.math;

alias NodeIndex = int;

enum nVisits = 20_000_000;
enum restartProbability = 0.15;
enum nMaxOutArcs = 5;
enum nNodes = 16;

struct Node {
    immutable uint nOutArcs;
    immutable int[nMaxOutArcs] outbound;
}

string genSwitchCode(in int nMax) /*pure*/ {
    string switchCode = "switch (position) {\n";
    foreach (immutable i; 0 .. nMax)
        switchCode ~= format("    case %d: goto NODE%d;\n", i, i);
    switchCode ~= "    default: assert(0);\n";
    switchCode ~= "}\n";
    return switchCode;
}

string genNodeCode(in int nMax, in string switchCode) /*pure*/ {
    string nodeCode;
    foreach (immutable i; 0 .. nMax)
        nodeCode ~= format("
        NODE%d:
            hits[%d]++;
            steps++;
            if (steps >= nVisits) goto END;
            if (uniform(0.0, 1.0, rng) < restartProbability)
                position = uniform(0, graph.length, rng);
            else
                position = graph[%d].outbound[uniform(0, graph[%d].nOutArcs, rng)];
            %s
            ", i, i, i, i, switchCode);
    return nodeCode;
}

void main() {
    auto rng = Xorshift(1);

    static enum Node[nNodes] graph = [
        {/* 0*/ 3, [13, 9, 8]},
        {/* 1*/ 1, [2]},
        {/* 2*/ 2, [0, 8]},
        {/* 3*/ 1, [1]},
        {/* 4*/ 2, [1, 2]},
        {/* 5*/ 2, [2, 8]},
        {/* 6*/ 3, [3, 4, 5]},
        {/* 7*/ 1, [6]},
        {/* 8*/ 1, [15]},
        {/* 9*/ 5, [14, 12, 11, 10, 8]},
        {/*10*/ 1, [15]},
        {/*11*/ 1, [15]},
        {/*12*/ 1, [15]},
        {/*13*/ 1, [9]},
        {/*14*/ 1, [15]},
        {/*15*/ 1, [7]}];
    uint[graph.length] hits;
    uint steps = 0;

    uint position = uniform(0, graph.length, rng);
    hits[position]++;
    steps++;

    enum switchCode = genSwitchCode(graph.length);
    enum nodeCode = genNodeCode(graph.length, switchCode);
    mixin(switchCode);
    mixin(nodeCode);

    END:
    foreach (immutable i, immutable h; hits)
        writefln("%2d %2.0f", i, round(100.0 * h / cast(real)nVisits));
}

genSwitchCode() generates a simple jump table like this:

switch (position) {
    case 0: goto NODE0;
    case 1: goto NODE1;
    case 2: goto NODE2;
    case 3: goto NODE3;
    case 4: goto NODE4;
    case 5: goto NODE5;
    case 6: goto NODE6;
    case 7: goto NODE7;
    case 8: goto NODE8;
    case 9: goto NODE9;
    case 10: goto NODE10;
    case 11: goto NODE11;
    case 12: goto NODE12;
    case 13: goto NODE13;
    case 14: goto NODE14;
    case 15: goto NODE15;
    default: assert(0);
}


While genNodeCode() generates the whole finite state machine, each node looks like this:

NODE0:
hits[0]++;
steps++;
if (steps >= nVisits) goto END;
if (uniform(0.0, 1.0, rng) < restartProbability)
    position = uniform(0, graph.length, rng);
else
    position = graph[0].outbound[uniform(0, graph[0].nOutArcs, rng)];
switch (position) {
    case 0: goto NODE0;
    case 1: goto NODE1;
    case 2: goto NODE2;
    case 3: goto NODE3;
    case 4: goto NODE4;
    case 5: goto NODE5;
    case 6: goto NODE6;
    case 7: goto NODE7;
    case 8: goto NODE8;
    case 9: goto NODE9;
    case 10: goto NODE10;
    case 11: goto NODE11;
    case 12: goto NODE12;
    case 13: goto NODE13;
    case 14: goto NODE14;
    case 15: goto NODE15;
    default: assert(0);
}


The LLVM back-end of the LDC2 compiler seems able to recognize this kind of jumps tables, and replaces them with a goto (like an jmp *%eax instruction), so this machine is quick. Most of the time is spent generating random numbers.

To see if the computed gotos of the GNU C of GCC (http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html ) lead to a better performance, I have translated the D code to GNU-C99:

// graph_walk3.c
#include "stdio.h"
#include "stdlib.h"
#include "time.h"

#define nVisits 20000000
#define restartProbability 0.15
#define nMaxOutArcs 5
#define nNodes 16

inline static double uniform01() {
    return rand() / (double)RAND_MAX;
}

inline static int uniform(const int max) {
    int k = (int)(max * (rand() / (double)RAND_MAX));
    return (k == max) ? k - 1 : k;
}

typedef struct {
    unsigned int nOutArcs;
    void *outbound[nMaxOutArcs];
} Node;


#define NODE_BLOCK(N)                                                  \
    do {                                                               \
        NODE##N:                                                       \
            hits[N]++;                                                 \
            steps++;                                                   \
            if (__builtin_expect(steps >= nVisits, 0)) goto END;       \
            if (__builtin_expect(uniform01() < restartProbability, 0)) \
                goto *NodeAddresses[uniform(nNodes)];                  \
            else                                                       \
                goto *graph[N].outbound[uniform(graph[N].nOutArcs)];   \
    } while (0)


int main() {
    srand(time(NULL));
    static const void *NodeAddresses[] = {
        &&NODE0,  &&NODE1,  &&NODE2,  &&NODE3,
        &&NODE4,  &&NODE5,  &&NODE6,  &&NODE7,
        &&NODE8,  &&NODE9,  &&NODE10, &&NODE11,
        &&NODE12, &&NODE13, &&NODE14, &&NODE15};

    static const Node graph[nNodes] = {
        /* 0*/ {3, {&&NODE13, &&NODE9, &&NODE8}},
        /* 1*/ {1, {&&NODE2}},
        /* 2*/ {2, {&&NODE0, &&NODE8}},
        /* 3*/ {1, {&&NODE1}},
        /* 4*/ {2, {&&NODE1, &&NODE2}},
        /* 5*/ {2, {&&NODE2, &&NODE8}},
        /* 6*/ {3, {&&NODE3, &&NODE4, &&NODE5}},
        /* 7*/ {1, {&&NODE6}},
        /* 8*/ {1, {&&NODE15}},
        /* 9*/ {5, {&&NODE14, &&NODE12, &&NODE11, &&NODE10, &&NODE8}},
        /*10*/ {1, {&&NODE15}},
        /*11*/ {1, {&&NODE15}},
        /*12*/ {1, {&&NODE15}},
        /*13*/ {1, {&&NODE9}},
        /*14*/ {1, {&&NODE15}},
        /*15*/ {1, {&&NODE7}}};

    unsigned int hits[nNodes] = {0};
    unsigned int steps = 0;
    goto *NodeAddresses[uniform(nNodes)];

    NODE_BLOCK(0);
    NODE_BLOCK(1);
    NODE_BLOCK(2);
    NODE_BLOCK(3);
    NODE_BLOCK(4);
    NODE_BLOCK(5);
    NODE_BLOCK(6);
    NODE_BLOCK(7);
    NODE_BLOCK(8);
    NODE_BLOCK(9);
    NODE_BLOCK(10);
    NODE_BLOCK(11);
    NODE_BLOCK(12);
    NODE_BLOCK(13);
    NODE_BLOCK(14);
    NODE_BLOCK(15);

    END:
    for (int i = 0; i < nNodes; i++)
        printf("%2d %2.1f\n", i, 100.0 * hits[i] / (double)nVisits);

    return 0;
}

I have used C macros to keep the C code short about as the D code.

The "do { } while (0)" is a useful trick to require a semicolon to the macro calls.

The "NODE##N:" syntax allows to generate the numered jump labels from the NODE_BLOCK() macro argument N.

Now there is no need of jump tables, there are just gotos to the addresses. I think this is one of the fastest ways to implement a hard-coded finite state machine.

GCC 4.8.0 generates code like this for each node:

L14:
    addl $1, %ebx
    addl $1, 20(%esp)
    cmpl $19999999, %ebx
    ja L15
    call _rand
    movsd LC1, %xmm1
    cvtsi2sd %eax, %xmm0
    mulsd LC0, %xmm0
    comisd %xmm0, %xmm1
    ja L125
    call _rand
    cvtsi2sd %eax, %xmm0
    xorl %eax, %eax
    mulsd LC0, %xmm0
    cvttsd2si %xmm0, %edx
    cmpl $1, %edx
    cmovne %edx, %eax
    movl _graph.4128+28(,%eax,4), %eax
    jmp *%eax


While the Intel compiler ICC 13.0.1 splits the code in a fast zone and a slow zone, I have not timed this but I think this could be a little more efficient:
    incl      %ebx
    incl      40(%rsp)
    cmpl      $20000000, %ebx
    jae       ..B1.108      # Prob 4%
    call      uniform01()
    comisd    .L_2il0floatpacket.20(%rip), %xmm0
    jbe       ..B1.139      # Prob 18%
    movl      $1, %edi
    call      uniform(int)
    movslq    %eax, %rax
    movq      488+graph.351.0.3(,%rax,8), %rax
    jmp       *%rax
...


..B1.139:
    movl      $16, %edi
    call      uniform(int)
    movslq    %eax, %rax
    movq      NodeAddresses.351.0.3(,%rax,8), %rax
    jmp       *%rax


Run-time per nVisits = 20_000_000:
  graph_walk1: 1.35 (nUnroll = 8)
  graph_walk2: 1.14
  graph_walk3: 1.08

Compilation arguments used:
GCC:
-Wall -Wextra -std=c99 -mfpmath=sse -march=native -Ofast -flto -s graph_walk3.c -o graph_walk3
LDC2 (ldmd2):
-O -release -inline -noboundscheck

Compilers:
GCC V.4.8.0
LDC2 V.0.12.1 based on DMD v2.063.2 and LLVM 3.3.1.

The code shown in this post:
http://www.fantascienza.net/leonardo/js/graph_walk.zip
comments: Leave a comment Share

Security:
Subject:Coin game probabilities
Time:01:53 pm
A simple programming problem, here solved in Matlab:
http://blogs.mathworks.com/pick/2008/03/31/puzzler-coin-game/

There is a table composed of four squares in a circle. When you are on the last square and you move one step forward, you go back to the first square.

Starting at the first square, flip two coins at each move. If the coins are:
- Head Head: move one square forward;
- Tail Tail: move two squares forward;
- Head Tail or Tail Head: go to the first square.

With such rules, what is the probability of being on each of the four squares after twenty moves?

A first simple way to find the probabilities numerically is by simulation, in D language:

import std.stdio, std.random, std.algorithm, std.traits,
       std.bigint, std.exception;

enum uint nSquares = 4;

uint move(uint pos, ref Xorshift rng) {
    final switch (rng.front % nSquares) {
        case 0, 1: // Move one square forward.
            pos = (pos + 1) % nSquares;
            break;
        case 2: // Move two squares forward.
            pos = (pos + 2) % nSquares;
            break;
        case 3: // Go to the first square.
            pos = 0;
            break;
    }
    rng.popFront;
    return pos;
}

void main() {
    enum nTurns = 20,
         nTries = 100_000_000;

    auto rng = Xorshift(unpredictableSeed);
    enforce(isUnsigned!(typeof(rng.front)) &&
            (rng.front.max.BigInt + 1) % nSquares == 0,
            "move() needs to use std.random.uniform().");

    uint[nSquares] finalPosCounts;
    foreach (immutable _; 0 .. nTries) {
        uint pos = 0; // Starting position.
        foreach (immutable __; 0 .. nTurns)
            pos = move(pos, rng);

        finalPosCounts[pos]++;
    }

    finalPosCounts[].map!(x => x / cast(real)nTries).writeln;
}

I have used rng.front % nSquares instead of std.random.uniform() because compiled with the LDC2 compiler it's almost twice faster.

If I compile that first program with:
ldmd2 -O -release -inline -noboundscheck

LDC2 produces the following good 32 bit asm from the two loops of the main():

LBB1_1:
    xorl    %esi, %esi
    xorl    %ecx, %ecx
    .align  16, 0x90
LBB1_2:
    movl    %edi, %edx
    movl    %esi, %edi
    movl    %edx, %ebx
    andl    $3, %ebx
    xorl    %esi, %esi
    cmpl    $3, %ebx
    je  LBB1_7
    cmpl    $2, %ebx
    je  LBB1_13
    cmpl    $1, %ebx
    ja  LBB1_14
    incl    %edi
    jmp LBB1_6
    .align  16, 0x90
LBB1_13:
    addl    $2, %edi
LBB1_6:
    andl    $3, %edi
    movl    %edi, %esi
LBB1_7:
    movl    48(%esp), %ebx
    movl    52(%esp), %edi
    movl    %edi, 48(%esp)
    movl    56(%esp), %edi
    movl    %edi, 52(%esp)
    movl    %ebx, %edi
    shll    $11, %edi
    xorl    %ebx, %edi
    movl    %edx, 56(%esp)
    movl    %edx, %ebx
    shrl    $19, %ebx
    xorl    %edx, %ebx
    xorl    %edi, %ebx
    shrl    $8, %edi
    xorl    %ebx, %edi
    movl    %edi, 60(%esp)
    incl    %ecx
    cmpl    $20, %ecx
    jl  LBB1_2
    incl    32(%esp,%esi,4)
    incl    %eax
    cmpl    $100000000, %eax
    jl  LBB1_1


A different algorithm, as explained in the linked article is to power the transition probability matrix (here the matrix multiplication is implemented in a slow high level way):

import std.stdio, std.range, std.array, std.numeric, std.algorithm;

T[][] mul(T)(in T[][] A, in T[][] B) {
    const Bt = B[0].length.iota.map!(i=> B.transversal(i).array).array;
    return A.map!(a => Bt.map!(b => a.dotProduct(b)).array).array;
}

void main() {
    real[][] M = [[1, 2, 1, 0], [1, 0, 2, 1], [2, 0, 0, 2], [3, 1, 0, 0]];
    foreach (row; M)
        row[] *= 0.25;
    auto P = M.mul(M);
    foreach (immutable _; 2 .. 20)
        P = P.mul(M);

    P[0].writeln;
}

One run of the simulation (nTries = 100_000_000, runtime less than 21 seconds) gives:
[0.386108, 0.234545, 0.213841, 0.165506]

The second program gives a similar output:
[0.386203, 0.234484, 0.213797, 0.165516]

Running the first program with (ulong nTries = 10_000_000_000) in less than an hour gives:
[0.386202, 0.234477, 0.213796, 0.165525]

The code:
http://www.fantascienza.net/leonardo/js/index.html#table_probs
comments: Leave a comment Share

Security:
Subject:Better C++/D code
Time:03:50 pm
When I read the "The Little Schemer" I have done countless small coding exercises using recursion:
https://github.com/pkrumins/the-little-schemer

But later when I've started learning Haskell I've seen than explicit recursion is uncommon in good Haskell code. Most times in Haskell you use map, folds, and other higher order functions instead of recursion. (Sometimes in Haskell you use explicit recursion as an optimization).

C and C++ code contains lot of 'for' and 'while' loops, that are sometimes hard to read and understand.


Recently they have released videos and slides of the talks of the GoingNative 2013 conference. Among the interesting talks the one I have appreciated more is "C++ Seasoning" by Sean Parent (and from the comments I see that others agree with me, and even ask the speaker to write a book on the topic):
http://channel9.msdn.com/Events/GoingNative/2013/Cpp-Seasoning

I suggest to start reading the updated slides:
https://github.com/sean-parent/sean-parent.github.com/wiki/presentations/2013-09-11-cpp-seasoning/cpp-seasoning.pdf

The talk gives some suggestions for C++ code:
- No Raw Loops
- No Raw Syntonization Primitives
- No Raw Pointers

The first part suggests to reduce the number of loops in C++ code and replace them with algorithms (often with STL algorithms), a little like it's done in Haskell.

C++ is an abstraction over C, and the heavy usage of STL-like algorithms is kind of a new language that abstracts over raw loop-heavy C++ code. Writing code like that in C++ takes time and some brain power, but the resulting code is good, short, simpler to keep bug-free, and often efficient too.

In D language I suggest to use std.algorithm for similar purposes. I also suggest to make const/immutable all variables, unless they need to mutate, or unless something in D or in Phobos makes that impossible (like Ranges, that often can't be const).

Walter Bright calls "component programming" a variant of that coding style that relies on ranges a lot. I have shown several examples of such linear flow programming in past blog posts (that today is becoming very common, LINQ in C#, in Java8, with Python itertools, in F#, and so on).

Here is another simple example, the Euler Problem number 50 (http://projecteuler.net/problem=50 ), the problem:
The prime 41, can be written as the sum of six consecutive primes: 41 = 2 + 3 + 5 + 7 + 11 + 13
This is the longest sum of consecutive primes that adds to a prime below one-hundred.
The longest sum of consecutive primes below one-thousand that adds to a prime, contains 21 terms, and is equal to 953.
Which prime, below one-million, can be written as the sum of the most consecutive primes?
This is a simple imperative solution in D (modified and ported from Python code found here: http://blog.dreamshire.com/2009/04/12/project-euler-problem-50-solution/ ):
import std.stdio, std.algorithm, std.range, std.functional;
...

int euler50(in int top) {
    /*const*/ auto primes = top.sieve.assumeSorted;

    int[] prime_sum = [0];
    int tot = 0, count = 0;
    while (tot < top) {
        tot += primes[count];
        prime_sum ~= tot;
        count++;
    }

    int terms = 1;
    int max_prime = -1;
    foreach (immutable i; 0 .. count)
        foreach (immutable j; i + terms .. count) {
            immutable n = prime_sum[j] - prime_sum[i];
            if (j - i > terms && primes.contains(n)) {
                terms = j - i;
                max_prime = n;
            }
        }
    return max_prime;
}

void main() {
    1_000_000.euler50.writeln; // 997651
}
The idea is to compute a cumulative of the prime numbers, and then take a subset of all the N^2 subintervals, removing the sum of the starting end of the interval from the sum of its end, to compute in O(1) the sum of its items. The contains() function is a linear search, so this code is kind of O(n^3), but for the given problem it's fast enough, finding the correct result of 997651 in less than half second.

A very simple sieve implementation used in that euler50() function could be:

uint[] sieve(in uint limit) {
    if (limit < 2)
        return [];
    auto composite = new bool[limit];

    foreach (immutable uint n; 2 .. cast(uint)(limit ^^ 0.5) + 1)
        if (!composite[n])
            for (uint k = n * n; k < limit; k += n)
                composite[k] = true;

    return iota(2, limit).filter!(i => !composite[i]).array;
}

If you take a look at the euler50() functions you see general patterns. The cumulative of an array could be written as general function (even better is to write this as a lazy range that works on any input range):
T[] accumulate(T)(in T[] items) pure nothrow {
    auto result = new T[items.length];
    if (!items.empty) {
        result[0] = items[0];
        foreach (immutable i; 1 .. items.length)
            result[i] = result[i - 1] + items[i];
    }
    return result;
}

The iteration on all subsets could be done with a handy "Pairwise" range (imports not included):
struct Pairwise(Range) {
    alias R = Unqual!Range;
    alias Pair = Tuple!(ForeachType!R, "a", ForeachType!R, "b");
    R _input;
    size_t i, j;

    this(Range r_) {
        this._input = r_;
        j = 1;
    }

    @property bool empty() {
        return j >= _input.length;
    }

    @property Pair front() {
        return Pair(_input[i], _input[j]);
    }

    void popFront() {
        if (j >= _input.length - 1) {
            i++;
            j = i + 1;
        } else {
            j++;
        }
    }
}

Pairwise!Range pairwise(Range)(Range r)
if (isRandomAccessRange!Range && hasLength!Range && !isNarrowString!Range) {
    return typeof(return)(r);
}

Then what's left is a search among all the subsequences, throwing away all the ones that are too much long and the ones that don't sum to a prime number. Also the pre-computed array of primes is sorted, so it's better to use a binary search. Now we can organize all the computation in few simple chains:
int euler50b(in int top) {
    auto primes = top.sieve.assumeSorted;
    auto primeSum = primes.release.accumulate;

    return primeSum
           .until!(p => p > top)
           .walkLength
           .iota
           .pairwise
           .map!(ij => tuple(ij[1] - ij[0], primeSum[ij[1]] - primeSum[ij[0]]))
           .filter!(dn => primes.contains(dn[1]))
           .reduce!max[1];
}
This code works and runs in something like 0.3-0.4 seconds, so it's still fast enough.

This euler50b() function has several problems:
- A range like pairwise() is not yet present in std.range;
- Something like accumulate() (with a given reducing function) is missing in std.algoritm;
- In D there is no syntax to destructure (unpack) tuples, so you have to use things like: ij => tuple(ij[1] - ij[0] ...
- std.algorithm.max is not yet accepting a key function.
- Currently "primes" can't be immutable or even const because you can't call the "release" method on a const SortedRange;
- Currently "primeSum" can't be immutable otherwise reduce() doesn't work.
- Currently "iota" is not pure nor nothrow.

How the main function could look, once D and Phobos become more refined:
int euler50c(in int top) pure nothrow {
    immutable primes = top.sieve.assumeSorted;
    immutable primeSum = primes.accumulate!q{a + b};

    return primeSum
           .until!(p => p > top)
           .walkLength
           .iota
           .pairwise
           .map!({i, j} => tuple(j - i, primeSum[j] - primeSum[i]))
           .filter!({d, n} => primes.contains(n))
           .reduce!(max!({d, n} => d))[1];
}

The full code of the two versions:
http://www.fantascienza.net/leonardo/js/euler50.zip
comments: 4 comments or Leave a comment Share

Security:
Subject:A little Morse Code decoder in D
Time:11:19 pm
A little Morse Code decoder in D language, adapted from here:
http://apfelmus.nfshost.com/articles/fun-with-morse-code.html
void main() {
    import std.stdio, std.string, std.array, std.algorithm, std.conv;

    "-- --- .-. ... .    -.-. --- -.. ."
    .split("    ")
    .map!(w => w
               .split
               .map!(s => " ETIANMSURWDKGOHVF L PJBXCYZQ  "
                          [reduce!q{2 * a + 1 + (b & 1)}(0, s)])
               .text)
    .join(" ")
    .writeln;
}
It prints: MORSE CODE
comments: Leave a comment Share

Security:
Subject:Updates and links
Time:10:45 pm
Lot of time since the last update.

News:

Solutions of the first three problems of the qualifications of the Google Code Jam 2013:
http://www.fantascienza.net/leonardo/js/gcj13.rar

A D solution for the Euler problem n.395, translated from Python:
http://www.fantascienza.net/leonardo/js/index.html#euler395

Updated code in the zip file:
http://www.fantascienza.net/leonardo/ar/d_notes2.html

Updated and improved (now "foo-bar" is digested as "foo bar" instead of "foobar"):
http://www.fantascienza.net/leonardo/js/index.html#markov

Added the "Spectral-norm" benchmark using SIMD:
http://www.fantascienza.net/leonardo/js/index.html#spectralnorm_simd

Updated path tracing demo:
http://www.fantascienza.net/leonardo/js/index.html#path_tracing

Added a De Bruijn sequence generator in D:
http://www.fantascienza.net/leonardo/js/index.html#de_bruijn_sequence

Added two photos of a peluche of the Pegasus pony Bluespark:
http://www.fantascienza.net/leonardo/im/index.html#bluespark_toy


Links:

An nice example of not so good functional programming, and a better solution:
http://pragprog.com/magazines/2013-03/uncle-bob-and-functional-programming

A talk video, Simon Peyton Jones on Adventures with types, nice and interesting:
http://skillsmatter.com/podcast/java-jee/keynote-3842
The slides:
http://laser.inf.ethz.ch/2012/slides/PeytonJones/Adventures%20with%20types.pdf
All the slides of the interesting LASER 2013 conference, including more slides from Simon Peyton:
http://laser.inf.ethz.ch/2012/slides/

Many interesting slides from the CPPNOW 2013 conference:
https://github.com/boostcon/cppnow_presentations_2013

A simple optimization story:
http://users.softlab.ntua.gr/~ttsiod/straylight.html

Some things to know about tin:
http://www.youtube.com/watch?v=Rh8yWyuoL2A

http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/

A simple intro to finite state machines for games:
http://gameprogrammingpatterns.com/state.html

Less Wrong's / Eliezer Yudkowsky's Harry Potter fanfic-in-progress, "Harry Potter and the Methods of Rationality", it's a very long and very good fanfiction novel that introduces ways to think rationally:
http://hpmor.com/

"Flutterwonder", a 3D musical video about the little pony Fluttershy:
http://www.youtube.com/watch?&v=iQYrFecvCGo
comments: Leave a comment Share

Security:
Subject:A word counting problem
Time:04:31 pm
A little programming problem from this blog post:
http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

The problem is:
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
A solution with shell scripting:
Here's the script that does that, with each command given its own line:
bash:
1:  tr -cs A-Za-z '\n' |
2:  tr A-Z a-z |
3:  sort |
4:  uniq -c |
5:  sort -rn |
6:  head -${1}
Explanation:
1) Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
2) Transliterate upper case to lower case.
3) Sort to bring identical words together.
4) Replace each run of duplicate words with a single representative and include a count (-c).
5) Sort in reverse (-r) numeric (-n) order.
6) Pass through a stream editor; quit (q) after printing the number of lines designated by the script's first parameter (${1}).

Even if data-filtering programs for these exact purposes were not at hand, these operations would well be implemented separately: for separation of concerns, for easier development, for piecewise debugging, and for potential reuse. The simple pipeline given above will suffice to get answers right now, not next week or next month. It could well be enough to finish the job.
Some comments to that blog post:
Andre is quite right that this is an example of the elegance of FP vs. imperative -- but the flip side is that FP often fails when it runs up against real-world edge cases. Knuth's version may not handle contractions -- but it could, easily. Could McIlroy's? What if I want British/US spellings (such as "color" and "colour") to be counted as one word? What about hyphenated words?
20+ years later, we still don't have sed or awk available as a library, afaik. We also don't have a tr-like function which works on streams, or a sort for text files in standard libraries. Which is why you simply can't use it when you write something that isn't completely solvable via shell scripting - it doesn't get compiled in, and your team leader/senior programmer/whatever will definitely not allow you to call shell commands from C, C++ or Pascal, even when it pays off.
This is a modern solution (from the blog post http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ with several small changes, probably it can be improved further):
import System.Environment (getArgs)
import Data.List (sortBy)
import Data.Char (toLower, isLetter)
import qualified Data.HashMap.Strict as HM

createReport :: Int -> String -> String
createReport n text =
    unlines $
    map (\(count, w) -> w ++ " " ++ show count) $
    take n $
    sortBy (flip compare) $
    map (\(w, count) -> (count, w)) $
    HM.toList $
    HM.fromListWith (+) $
    map (\w -> (w, 1)) $
    words $
    map toLower $
    map (\c -> if isLetter c then c else '\n') $
    text

main = do
    [fileName, nstr] <- getArgs
    let n = read nstr
    text <- readFile fileName
    putStr $ createReport n text
Usage: runhaskell wfreqh.hs test_data.txt 10

A comment from that blog post:
The shell script operates on raw text and everything is just strings being parsed and reparsed by the respective Unix utility programs.

The Haskell program is statically typed. It is type-checked by the compiler, which generates native code. The program uses standard libraries and data types, such as lists and hash maps.

Also, the Haskell program could be refined, extended, optimized in various ways. The most important optimizations I can think of off the top of my head:
- Using a better representation of strings than the default built-in "string as list of characters". Easily accessible advice can be found on Stack Overflow and browsing through Haskell library documentation, such as for the text package.
- Loop fusion, deforestation can be applied to deal with the apparent allocation of lots of new lists in the pipeline. One of the selling points of using a language like Haskell is the opportunity for the compiler to perform radical optimizations that are impossible for languages that have side effects.

I don't write many bash scripts these days. General-purpose programming languages can do a decent job munging data without difficulty. The situation was different decades ago when there was C, and few standard high-level libraries for the C world.
This is a D language version (it requires no installation of libraries like the unordered-containers from Cabal in the Haskell program). It works with unicode text. If you only have ASCII text the code becomes simpler. That Haskell version seems to have problems with unicode text (on the other hand don't expect the D code to perform a fully correct handling of unicode sorting, that is a complex problem).
import std.stdio, std.conv, std.file, std.string,
       std.algorithm, std.range;

void main(string[] args) {
    args[1]
    .readText
    .toLower
    .tr("A-Za-z", "\n", "cs")
    .split
    .sort()
    .group
    .array
    .schwartzSort!q{ -a[1] }
    .take(args[2].to!uint)
    .map!q{ text(a[0], " ", a[1]) }
    .join("\n")
    .writeln;
}
D version: 17 nonblank lines, Haskell: 23 lines.
(The cast is needed because of small D bug that probably will be fixed. The sort function has () to avoid calling the built-in sort, that will be deprecated.)
File sizes:
9.005.142 test_data.txt
      365 wfreqd.d
      701 wfreqh.hs
1.565.710 wfreqh.exe (stripped)
  510.492 wfreqd.exe
Haskell code compiled with:
ghc -O3 wfreqh.hs
Glasgow Haskell Compiler, Version 7.6.1, stage 2 booted by GHC version 7.4.2

D code compiled with:
dmd -O -release -inline -noboundscheck wfreqd.d
DMD Version 2.064alpha.
Runtime, best of 3 (top 100 words):
  wfreqh: 7.45
  wfreqd: 6.24 
Tests performed on Windows 32 bit, using an about 9 MB text file in English.

In theory the shell program is fully lazy (even the sort can be done with an external sorting algorithm), while the D (and probably the Haskell) version load the whole file in memory.

The D and Haskell programs are strongly statically typed so they are able to catch several programming errors at compile-time.

The shell program uses text, so it has to convert numbers to text and then from text to numbers, while the D and Haskell programs use lazy sequences of arbitrary types, that can be more powerful and save time in the textual serialization and deserialization. And unless the floating point values are serialized as hex floats, such textual conversions can also introduce some error.

If you have to perform some operation not present among the unix tools, you can write a small program (in C or another language) that uses text pipes, and call it like the other shell commands. In Haskell or D you just write a function (or a lazy Range in D, often a struct), and call it, this is simpler and more integrated, but it's less modular.

Update Jun 21 2013: wiz offers this Haskell version (gist.github.com/wiz/5826131 ):
{-# LANGUAGE TupleSections #-}

import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import           Data.Char (isLetter)
import           Data.List (sortBy)
import           Data.Function (on)
import qualified Data.HashMap.Strict as HM
import System.Environment (getArgs)

collect = sortBy (flip compare `on` snd)
        . HM.toList
        . HM.fromListWith (+)
        . map (,1)
        . filter (not . TL.null)
        . TL.split (not . isLetter)
        . TL.toLower

display (word, count) = do
    TL.putStr word
    putStr ": "
    print count

main = do
    [fname, count] <- getArgs
    text <- TL.readFile fname
    mapM_ display $ take (read count) (collect text)

This Haskell version is slower than the precedent one, it takes about 10.8 seconds to run. The D code compiled with the ldc2 compiler (that optimizes better than dmd) takes 4.5 seconds to run (if you use ldc2 2.063 then use .sort!q{-a[1] < -b[1]} instead of the schwartzSort, that was improved later in Phobos).

There are many ways to improve the performance of the D code. Like using an hash counter, that is missing in Phobos (run-time with ldc2 is about 3.17 seconds):
import std.stdio, std.conv, std.file, std.string,
       std.algorithm, std.range, std.traits;

auto hashCounter(R)(R items) if (isForwardRange!R) {
    size_t[ForeachType!R] result;
    foreach (x; items)
        result[x]++;
    return result.byKey.zip(result.byValue);
}

void main(string[] args) {
    args[1]
    .readText
    .toLower
    .tr("A-Za-z", "\n", "cs")
    .split
    .hashCounter
    .array
    .sort!q{ -a[1] < -b[1] }
    .take(args[2].to!uint)
    .map!q{ text(a[0], " ", a[1]) }
    .join("\n")
    .writeln;
}
comments: 31 comments or Leave a comment Share

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).
You're looking at the latest 10 entries.
Missed some entries? Then simply jump back 10 entries