 Years ago I wrote a D V.1 language implementation of the little Norvig spell corrector, using my own library: http://leonardom.livejournal.com/59589.html
The original code an explanation (the Python code is changed in the meantime): http://norvig.com/spellcorrect.html
In the meantime the D language and its standard library Phobos have improved a lot, so now I rewrite the D solution using just the standad library.
This is partially "golfed" D code, it's not an example of good D code.
import std.stdio, std.regex, std.algorithm, std.range, std.File, std.string, std.ascii;
int[string] nWords;
auto set(R)(R r) { return r.zip(0.repeat).assocArray; }
auto edits1(string W) @safe {
immutable n = W.length;
auto deletion = n.iota.map!(i=>W[0..i]~W[i+1..$]);
auto transposition = iota(n1).map!(i=>W[0..i]~W[i+1]~W[i]~W[i+2..$]);
auto alteration = n.iota.cartesianProduct(lowercase).map!(p=>W[0..p[0]]~cast(char)p[1]~W[p[0]+1..$]);
auto insertion = iota(n+1).cartesianProduct(lowercase).map!(p=>W[0..p[0]]~cast(char)p[1]~W[p[0]..$]);
return chain(deletion, transposition, alteration, insertion).set;
}
auto known(R)(R words) { return words.filter!(w => w in nWords).set.keys; }
enum knownEdits2 = (string word) => word.edits1.byKey.map!(e1 => e1.edits1.byKey).joiner.known;
T or(T)(T x, lazy T y) { return x.length ? x : y; }
string correct(string word) {
auto candidates = [word].known.or(word.edits1.byKey.known.or(word.knownEdits2.or([word])));
return candidates.minPos!((w1, w2) => nWords.get(w1, 1) < nWords.get(w2, 1)).front;
}
void main() {
foreach (mo; "big.txt".readText.toLower.matchAll("[az]+"))
nWords[mo.front]++;
writeln("speling".correct, '\n', "korrecter".correct);
} The output is:
spelling
corrected Phobos currently lacks a set implementation, so I've defined a function that uses the builtin associative arrays. The "or" function is from the D1 code, with a small improvement.
The runtime of the D code compiled with ldc2 (V.0.15.1) is about 1.5 seconds (32 bit Windows). Python 2.6.6 runs the updated Python code (searching for both words) in about 1.3 seconds.
The Python version creates the NWORDS dict in about 1.1 seconds. The D version takes about 1 second to create it.  comments: Leave a comment 
 A nice blog post, ""The "99 Bottles of Beers" of Type Systems"":
http://gelisam.blogspot.ca/2014/12/the99bottlesofbeersoftypesystems_21.html
The problem is to manage integers, pairs of integers, pairs of pairs of integers, and so on, to represent them in the type system, and to print the nth iteration of them:
0. Int 1. (Int,Int) 2. ((Int,Int),(Int,Int)) 3. (((Int,Int),(Int,Int)),((Int,Int),(Int,Int))) 4. ...
An Haskell solution from that blog:
printTree :: Show a => a > Int > IO ()
printTree v 0 = print v
printTree v n = printTree (v,v) (n1)
main :: IO ()
main = readLn >>= printTree 42
This Haskell solution is short, and it's also efficient at runtime (quite faster than the Java solution shown in that blog).
One D language solution uses the static (monomorphization) version of D generics, similar to C++ templates:
import std.stdio, std.conv, std.typetuple;
// For higher maxDepth increase stack with L/STACK:10000000
enum maxDepth = 14;
template Iota(uint stop) {
static if (stop <= 0)
alias Iota = TypeTuple!();
else
alias Iota = TypeTuple!(Iota!(stop  1), stop  1);
}
struct Pair(T) {
T x, y;
string toString() {
return text('(', x, ',', y, ')');
}
}
void printTree(uint depth, T)(T v) {
static if (depth == 0)
writeln(v);
else
printTree!(depth  1)(Pair!T(v, v));
}
void main(in string[] args) {
immutable depth = (args.length == 2) ? args[1].to!uint : 5;
enum value = 42;
switch (depth) {
foreach (immutable d; Iota!maxDepth)
case d: return printTree!d(value);
default: return writeln("Too much large depth.");
}
}
This gives efficient code, and the Pairs are stackallocated, but the compiler allows only a limited amount of nesting. The foreach inside the switch generates the switch cases statically. This works up to a depth of 13.
To reach higher levels of nesting in D language I use dynamic (runtime) polymorphism:
import std.stdio, std.conv, std.format;
interface Value {
void toString(scope void delegate(const(char)[]) sink) const;
}
final class Integer : Value {
immutable int x;
this(int x_) pure nothrow @safe @nogc {
this.x = x_;
}
override void toString(scope void delegate(const(char)[]) sink) const {
static fmt = singleSpec("%d");
sink.formatValue(x, fmt);
}
}
final class Pair : Value {
const Value fst, snd;
this(Value fst, Value snd) pure nothrow @safe @nogc {
this.fst = fst;
this.snd = snd;
}
override void toString(scope void delegate(const(char)[]) sink) const {
sink("(");
fst.toString(sink);
sink(",");
snd.toString(sink);
sink(")");
}
}
struct PairWrapper {
Value v;
void toString(scope void delegate(const(char)[]) sink) const {
v.toString(sink);
}
}
void printTree(Value v, in uint n) {
if (n == 0)
writeln(PairWrapper(v));
else
printTree(new Pair(v, v), n  1);
}
void main(in string[] args) {
import core.memory, core.stdc.stdlib;
GC.disable;
immutable depth = (args.length == 2) ? args[1].to!uint : 5;
printTree(new Integer(42), depth);
exit(0);
}
The toString() inside the interface and PairWrapper are used because D classes don't yet support the sink. The sink is important to increase performance, avoiding allocating lot of strings. This version is more efficient also because it leads to more coherence for the L1 code cache. This D solution is just a little slower than the Haskell solution (perhaps because the Haskell solution caches the formatting of the integer 42, I don't know. If I perform such caching the D code becomes faster than the Haskell code), and it's much faster than the Java version (despite currently the D garbage collector is less efficient than the OracleVM ones).  comments: Leave a comment 
 By leonardo maffi, V.1.0, 1 Aug 2014.
A nice programming exercise in Haskell by Harry Garrood, "Permutations: an exercise" (21 Jul 2014): http://harry.garrood.me/blog/permutationsanexercise/ http://www.reddit.com/r/haskell/comments/2ba54b/a_little_programming_exercise_about_permutations/
Harry Garrood writes:
We're going to use Haskell, because this is all about functions (in the mathematical sense), and so Haskell, being a functional programming language, is especially well suited to the job. Most of the code you below is in D language (with a C version at the end) because it's a good language to solve similar problems.
The problem statement defines a card shuffling technique:
1) Take one card from the top of the deck and discard it into a second pile. 2) Take another card from the top of the deck, and put it at the bottom of the deck. 3) Repeat these two steps, putting all discarded cards from step 1 into the same pile, until the original deck is all gone and the second pile has all the cards in it.
The problem is: how many shuffles does it take until a deck is in the same order as when you started, for a deck with an arbitrary number of cards? Write a function, f :: Int > Int, such that, for a deck with n cards, f n is the minimum number of shuffles required to return it to its original order. Below I explain the varios D implementations, that you can find in the archive: http://www.fantascienza.net/leonardo/js/permutations.zip
The fantascienza site is down, so you can temporarily find the programs here:
perms1.hs: http://lpaste.net/2412259684489625600 perms2.hs: http://lpaste.net/390308567523000320 perms3.d: http://dpaste.dzfl.pl/2eb987664dfd perms3b.d: http://dpaste.dzfl.pl/0aa48d395cb2 perms4.d: http://dpaste.dzfl.pl/0b0cb344ccce perms5.d: http://dpaste.dzfl.pl/9f929c2d064d perms6.d: http://dpaste.dzfl.pl/dbf1d3f19321 perms7.d: http://dpaste.dzfl.pl/a8eddf54067d perms7b.d: http://dpaste.dzfl.pl/8a9a1afc10f8 perms7c.d: http://dpaste.dzfl.pl/45ff250cafa7 perms8.d: http://dpaste.dzfl.pl/32614ae6430c perms9.c: http://dpaste.dzfl.pl/608dd06e44a5
I compile all the code on a 32 bit Windows using:
dmd O release inline noboundscheck
ghc O3
gcc Ofast flto std=c99 Wall Wextra
dmd 2.066beta4
Glasgow Haskell Compiler, Version 7.6.1, stage 2 booted by GHC version 7.4.2
gcc 4.0.0
                  
perms1.hs: perms2.hs:
The perms1.hs and perms2.hs programs are adapted from the two solutions by Harry Garrood, with small changes.
This is the original 'permutations' function:
permutation :: (Deck > Deck) > Int > PermutationGraph
permutation f n = go 1 xs
where
xs = unCardAll . f . makeDeck $ n
go m (y:ys) = (y, m) : go (m+1) ys
go _ [] = []
This is my version:
permutation :: (Deck > Deck) > Int > PermutationGraph
permutation shuf n = zip perm [1 ..]
where perm = unCardAll $ shuf $ makeDeck n
perms2.hs runs on the six test cases (n = 4, 5, 52, 53, 100, 200) in about 0.08 seconds (such small timings are not precise).
                  
perms3.d: perms3b.d:
This program is equivalent to the perms1.hs program. Harry Garrood defines a very nice and clean Haskell function that performs the two first steps of the shuffling:
step :: (Deck, Deck) > (Deck, Deck)
step (deck, pile) = case deck of
a:b:cs > (cs ++ [b], a : pile)
[a] > ([], a : pile)
[] > ([], pile)
I have implemented this 'step' function in various ways in D, and at the end this is the version I have used (here I have removed the pre/postconditions, see the files for the full code), this performs the whole shuffling (including the third step):
Deck deckShuffle(in Deck Ain) pure nothrow @safe @nogc {
typeof(return) B;
const(Card)[] A = Ain;
while (!A.empty) {
// Step 1:
B = [Card(A.front)] ~ B;
A.popFront;
// Step 2:
if (!A.empty) {
A ~= A.front;
A.popFront;
}
}
return B;
}
This function is (strongly) pure, it accepts a constant array of Cards and performs the shuffling using slices and concatenations, building a new output array B. Here a Deck is a dynamic array, but perhaps for this algorithm a more efficient data structure for a Deck is a (singly linked) list (about as in the Haskell code).
In perms3 I have a commentedout the rangebased version of 'makeDeck' because currently the 'array' function is not @safe for that use case. So I've written a more internally imperative version of 'makeDeck':
enum makeDeck = (in uint n) pure nothrow => iota(1, n + 1).map!Card.array;
I have used a Typedef to implement the 'newtype' of the Haskell code (despite currently Typedef has some bugs):
alias Card = Typedef!int;
The function 'f1' uses a dowhile because Phobos still lacks an 'iterate' rangebased function, as used in the 'shuffle' Haskell function.
I have run both perms1.hs and perms3.d on the few testing values, both print:
4 2
5 5
52 510
53 53
100 120
200 8460
perms1.hs produces this output in about 2.72 seconds:
[2,5,510,53,120,8460]
perms3.d produces this output in (best of 3) in about 1.39 seconds:
4 2
5 5
52 510
53 53
100 120
200 8460
I have then created a perms3b.d program that is very similar to perms3.d, but represents a Deck with a SList (a singly linked list) hoping to improve performance avoiding the O(n) copies of the slices of the arraybased version. But perms3b.d runs in about 30.4 seconds. In most cases linked lists don't give any performance advantage.
                  
perms4.d:
This version is similar to perms3.d, but here I have implemented 'deckShuffle' in a lower level way. This deckShuffle is now @nogc (it doesn't allocate on the heap), and works inplace on two given arrays, using indexes (it could also be implemented using pointer arithmetic, but there's no need for that, this function is fast and @safe). Now deckShuffle is still annotated with 'pure' but now it's weakly pure:
void deckShuffle(Deck A, Deck B) pure nothrow @safe @nogc {
size_t aTop = 0,
aLength = A.length,
bTop = B.length  1; // B is empty at start.
while (aLength) {
// Step 1:
B[bTop] = A[aTop];
aTop = (aTop + 1) % A.length;
bTop;
aLength;
// Step 2:
if (aLength) {
A[(aTop + aLength) % A.length] = A[aTop];
aTop = (aTop + 1) % A.length;
}
}
}
This kind of code is more fiddly and requires more care and testing compared to more a functional style, but it's also much faster and produces no heap garbage. The runtime of this function is also more predictable.
The run time of 'perms4.d' to produce the six output pairs is about 0.06 seconds, so this is about 23 times faster than 'perms3.d'. Unlike many other languages, in D it's easy to mix low level code with higher level code, and produce code that still looks natural and free of most accidental complexity.
                  
perms5.d:
This implements the algorithm from perms2.hs. For simplicity I have used the slower but simpler 'deckShuffle' from perms3.d.
Instead of writing the various functions of the Haskell version, building the various intermediate data stutures (like the list of pairs), I have written a single 'f2' function that computes the lcm of the cycleLen without storing the actual cycles. D has less 'glue' than Haskell, so often in D it's more convenient to write a little larger functions (with the side effect that they are often faster than the Haskell versions, but also less easy to combine in different ways):
ulong f2(in uint n) pure nothrow {
uint[uint] graph;
foreach (immutable uint i, immutable x; n.makeDeck.deckShuffle)
graph[i + 1] = cast(uint)x;
typeof(return) result = 1;
while (graph.length) {
auto x = graph.byKey.front;
uint cycleLen = 0;
while (x in graph) {
cycleLen++;
immutable y = graph[x];
graph.remove(x);
x = y;
}
result = lcm(result, cycleLen);
}
return result;
}
This code is less general and more specialized than the Haskell functions. It's less easy to see what the code is doing. Here 'graph' of the pairs is an associative array to avoid the linear searches in the lists (but a linear scan is still present despite being hidden, the one performed by byKey.front).
The runtime of perms5.d is very small, so beside calling 'f2' on the six test cases, this function also computes 'f2' on the numbers n in [1, 1_000[. The total runtime is about 1.29 seconds.
If I modify perms2.hs to compute f2 in the same [1, 1_000[ interval, the runtime is very large (about half an hour, and it uses 32 bit integers, so some results for the [1, 1_000[ interval are wrong). So the 'f2' function in D has some advantages (and it's still strongly pure). The D 'f2' function returns a ulong because when n grows large, the output grows a lot.
                  
perms6.d:
This is similar to perms5.d, but uses the more efficient 'deckShuffle' function from perms4.d. The runtime of perms6.d is only 0.20 seconds (this includes the computations for n in [1, 1_000[), so it's more than six times faster than perms5.d.
                  
perms7.d: perms7b.d: perms7c.d:
This is a lowerlevel version (that is easy to port to C language), it's similar to perms6.d, but here in 'f2' instead of the associative array 'graph' I have reused the array D2, filling it with 'emptyItem' when I remove a value. 'f2' is now also @nogc because it accepts fixedsize arrays as arguments and then slices them. So the program allocates the arrays only inside the 'main' (and here are stackallocated, so even the main is @nogc):
ulong f2(size_t m)(const uint n, ref Card[m] D1, ref Card[m] D2)
pure nothrow @safe @nogc {
for (uint i = 0; i < n; i++)
D1[i] = cast(Card)i;
deck_shuffle(D1, D2, n);
typeof(return) result = 1;
uint graph_len = n;
enum emptyItem = uint.max;
while (graph_len) {
uint x = uint.max;
for (uint i = 0; i < n; i++)
if (D2[i] != emptyItem)
x = i;
uint cycle_len = 0;
while (D2[x] != emptyItem) {
cycle_len++;
const uint y = D2[x];
D2[x] = emptyItem;
graph_len;
x = y;
}
result = lcm(result, cycle_len);
}
return result;
}
void main() nothrow @nogc {
import core.stdc.stdio: printf;
const uint nMax = 1_000;
Card[nMax] d1 = void, d2 = void;
for (uint n = 1; n <= nMax; n++)
printf("%llu\n", f2(n, d1, d2));
}
perms7.d runs in about 0.05 seconds. So it's about 4 times faster than perms6.d.
To avoid overflow bugs I have created the perms7b.d and perms7c.d versions, that use core.checkedint (despite core.checkedint is not meant for user code). For both perms7b.d and perms7c.d the runtime is similar to the unchecked versions, and perms7b.d spots the overflow bugs:
...
1389: 4247163320400
1390: 465585120
1391: 1391
1392: 6570
1393: 4294945235192621194 (overflow)
1394: 3655769040
1395: 1194
1396: 930
1397: 1748154975840
1398: 1398
...
                  
perms8.d:
This version is very similar to perms7.d, but I've used BigInt instead of ulong to compute all correct results (if there are no other bugs or design mistakes).
perms8.d is still quite fast, it runs in about 0.08 seconds (this includes the computations for n in [1, 1_000[). If I increase nMax to 10_000 it runs in about 6.6 seconds. D BigInts are not very fast, but I think this is not a significant problem for this program.
                  
perms9.c:
This is C code derived from perms7.d.
perms9.c runs in about 0.03 seconds or less. perms9.c is a little faster than perms7.d, but the speed difference is really visible when nMax is much larger (but then the program gives wrong outputs becuse of overflows). The difference between 0.03 and 0.05 seconds is mostly caused by the time taken to initialize the Druntime.
                    comments: Leave a comment 
 By leonardo maffi, version 1.1.
Warning: currently nearest4/nearest5 don't work correctly, I'll try to fix them.
(Later this article will be moved to my site.)
These blog posts present a performance comparison between F# and OCaml implementations of a simple KNearest Neighbor (with K=1) classifier algorithm to recognize handwritten digits:
http://philtomson.github.io/blog/2014/05/29/comparingamachinelearningalgorithmimplementedinfnumberandocaml/
http://philtomson.github.io/blog/2014/05/30/stopthepressesocamlwins/
One of the Reddit discussions:
http://www.reddit.com/r/fsharp/comments/26vl3w/stop_the_presses_ocaml_wins_in_terms_of_speed_vs/
The original F# and OCaml code: https://github.com/philtomson/ClassifyDigits
So I've studied and compared versions of this code in D language.
First I've written a functionalstyle version, 27 CLOC lines long, that is similar to the arraybased OcaML version:
import std.stdio, std.algorithm, std.range, std.array, std.conv,
std.string, std.math, std.typecons;
struct LabelPixels { ubyte label; ubyte[] pixels; }
immutable readData = (in string fileName) =>
fileName
.File
.byLine
.dropOne
.map!(r => r.chomp.split(",").to!(ubyte[]))
.map!(a => LabelPixels(a[0], a.dropOne))
.array;
immutable distance = (in ubyte[] p1, in ubyte[] p2)
pure /*nothrow*/ @safe /*@nogc*/ =>
double(p1.zip(p2).map!(ab => (ab[0]  ab[1]) ^^ 2).sum).sqrt;
immutable classify = (in LabelPixels[] trainingSet, in ubyte[] pixels)
/*pure nothrow*/ @safe =>
trainingSet
.map!(s => tuple(pixels.distance(s.pixels), ubyte(s.label)))
.reduce!min[1];
void main() {
const trainingSet = "training_sample.csv".readData;
const validationSample = "validation_sample.csv".readData;
immutable nCorrect = validationSample
.map!(s => trainingSet.classify(s.pixels) == s.label)
.sum;
writeln("Percentage correct: ", double(nCorrect) / validationSample.length * 100);
} Notes to this first version:  Inside LabelPixels the pixes are represented with an unsigned byte each. This reduces the RAM used to store the data and allows to use the CPU cache more efficiently.  "readData" is a lambda assigned to a global immutable value. This is not fully idiomatic D style, but I've used this because it looks more similar to the F# code. In practice I usually don't use such global lambdas because the output type gives me useful information to understand the purpose of the function, and they are less easy to find in the assembly (you can't just search for "readData" in the asm listing to find this lambda.  The D standard library contains a module to read CSV files (std.csv), but I always find its API bad, so I've used File.byLine and I have processed the lines lazily.  In readData I've used the eager function split instead of the lazy splitter (that uses less RAM) because for the data sets used by this program split is faster (readData loads both CSV files in about 0.80.9 seconds on my PC).  In the "distance" lambda the nothrow and @nogc attributes are commented out because zip() doesn't yet respect such attributes (zip throws an exception in certain cases and the code to throw the exception allocates it on the heap). The "classify" lambda calls "distance" so it can't be nothrow and @nogc (and because reduce doesn't have a default value in case its input range is empty, so it can throw, and because classify contains a heapallocating lambda).  The distance lambda contains code like "map!(ab => (int(ab[0])  ab[1])" because D is not yet able to destructure tuples, like the 2tuples generated lazily by zip.  The classify lambda contains a map of tuples because the max/min functions of Phobos still lack an optional "key" function like the max/min functions of Python. So I've created tuples of the mapped value plus the label and I've used reduce to find the min. I have used "ubyte(s.label)" to produce a mutable tuple, that reduce doesn't accept (unless you use a mutable seed).
Most runtime of this program is used inside the distance function. So if you want to optimize this program that's the function to improve.
In the zip you can find a "nearest1b.d" file that is very similar to this first function, but it's adapted to the current version of the LDC2 compiler, that is a little older (it doesn't support the @nogc attribute, T(x) syntax for implicit casts, and its Phobos doesn't have a sum function).
The D compilers (even the well optimizing LDC2 complier) is not yet very good at optimizing that functional distance function. So in the "nearest2.d" version of the program (that you can find in the zip archive) I've written distance in imperative way, this produces a program that compiled with ldc is more than 3 times faster:
uint distance(in ubyte[] p1, in ubyte[] p2) pure nothrow @safe {
uint tot = 0;
foreach (immutable i, immutable p1i; p1)
tot += (p1i  p2[i]) ^^ 2;
return tot;
} Now this distance function can be pure nothrow @safe. If you look this function also cheats a little not computing the square root, because it's not useful if you want to search the closest. This function also computes the total using integers.
I have not shown the asm generated by LDC2 for the nearest lambda in the "nearest1.d" program because it's too much complex and messy, but I can show the asm for this second simpler distance function:
__D8nearest38distanceFNaNbNfyAhyAhZk:
pushl %ebx
pushl %edi
pushl %esi
movl 24(%esp), %ecx
xorl %eax, %eax
testl %ecx, %ecx
je LBB8_3
movl 28(%esp), %edx
movl 20(%esp), %esi
.align 16, 0x90
LBB8_2:
movzbl (%edx), %edi
movzbl (%esi), %ebx
subl %ebx, %edi
imull %edi, %edi
addl %edi, %eax
incl %edx
incl %esi
decl %ecx
jne LBB8_2
LBB8_3:
popl %esi
popl %edi
popl %ebx
ret $16
As you see the loop gets compiled to quite simple asm with 9 X86 (32 bit) instructions.
To improve performance I have changed the data layout a little. With the "binary_generator1.d" program you can find in the zip archive I have generated binary files of the training set and validation sample. So now the loading of the data is very simple and very fast (see in the zip archive for the full source of this "nearest3.d" program:
enum nCols = 785;
enum size_t labelIndex = 0;
alias TData = ubyte;
alias TLabel = TData;
immutable(TData[nCols])[] readData(in string fileName) {
return cast(typeof(return))std.file.read(fileName);
}
uint distance(immutable ref TData[nCols  1] p1,
immutable ref TData[nCols  1] p2)
pure nothrow @safe /*@nogc*/ {
uint tot = 0;
foreach (immutable i, immutable p1i; p1)
tot += (p1i  p2[i]) ^^ 2;
return tot;
}
TLabel classify(immutable TData[nCols][] trainingSet,
immutable ref TData[nCols  1] pixels)
pure nothrow @safe /*@nogc*/ {
auto closestDistance = uint.max;
auto closestLabel = TLabel.max;
foreach (immutable ref s; trainingSet) {
immutable dist = pixels.distance(s[1 .. $]);
if (dist < closestDistance) {
closestDistance = dist;
closestLabel = s[labelIndex];
}
}
return closestLabel;
} Notes:  The classify function is now imperative, but this improves the performance just a little.  @nogc is still commented out because the current version of the LDC2 compiler doesn't support it yet.  The main performance difference of this third program comes the data layout. Now I have hardcoded (defined statically) the number of columns nCols. So now the data sets are essentially a ubyte[N][]. In D this is represented as a dense array, that minimize cache misses and reduce by one the number of levels of indirection. Thanks to the compiletime knowledge of the loop bounds inside the distance function, the LLVM backend of the LDC2 compiler performs some loop unrolling (in theory it can be performed even before, but in practice the the version 3.4.1 of the LLVM is not performing loop unrolling on dynamic bounds).
So the asm of the distance function of the "nearest3.d" program is longer because of a partial loop unwinding:
__D8nearest38distanceFNaNbNfKyG784hKyG784hZk:
pushl %ebx
pushl %edi
pushl %esi
xorl %ecx, %ecx
movl $7, %edx
movl 16(%esp), %esi
.align 16, 0x90
LBB1_1:
movzbl 7(%esi,%edx), %edi
movzbl 7(%eax,%edx), %ebx
subl %ebx, %edi
imull %edi, %edi
addl %ecx, %edi
movzbl 6(%esi,%edx), %ecx
movzbl 6(%eax,%edx), %ebx
subl %ebx, %ecx
imull %ecx, %ecx
addl %edi, %ecx
movzbl 5(%esi,%edx), %edi
movzbl 5(%eax,%edx), %ebx
subl %ebx, %edi
imull %edi, %edi
addl %ecx, %edi
movzbl 4(%esi,%edx), %ecx
movzbl 4(%eax,%edx), %ebx
subl %ebx, %ecx
imull %ecx, %ecx
addl %edi, %ecx
movzbl 3(%esi,%edx), %edi
movzbl 3(%eax,%edx), %ebx
subl %ebx, %edi
imull %edi, %edi
addl %ecx, %edi
movzbl 2(%esi,%edx), %ecx
movzbl 2(%eax,%edx), %ebx
subl %ebx, %ecx
imull %ecx, %ecx
addl %edi, %ecx
movzbl 1(%esi,%edx), %edi
movzbl 1(%eax,%edx), %ebx
subl %ebx, %edi
imull %edi, %edi
addl %ecx, %edi
movzbl (%esi,%edx), %ecx
movzbl (%eax,%edx), %ebx
subl %ebx, %ecx
imull %ecx, %ecx
addl %edi, %ecx
addl $8, %edx
cmpl $791, %edx
jne LBB1_1
movl %ecx, %eax
popl %esi
popl %edi
popl %ebx
ret $4
Thanks to the data layout changes and other smaller improvements, the "nearest3.d" function is almost three times faster than "nearest2.d".
To reduce the run time and better utilize one CPU core, we can use the SIMD registers of the CPU.
To perform a subtraction of the arrays I change the data layout again, this time I represent the pixels with short (in D signed 16 bit integers). So I use a short8 to perform several subtractions in parallel. A problem comes from the label, if I slice it away as in the "nearest3.d" program, the array pointers are not aligned to 16 bytes and this is not accepted by my CPU. To solve this problem I add a padding of 7 short in every line of the matrix after the label. This is performed by the "binary_generator2.d" program.
So the 4th version of the program contains (see in the zip archive for the full code of "nearest4.d"):
enum nCols = 785;
enum size_t labelIndex = 0;
alias TData = short;
alias TLabel = TData;
immutable(TData[nCols + 7])[] readData(in string fileName) {
return cast(typeof(return))std.file.read(fileName);
}
uint distance(immutable ref TData[nCols  1] p1,
immutable ref TData[nCols  1] p2) pure nothrow /*@nogc*/ {
alias TV = short8;
enum size_t Vlen = TV.init.array.length;
assert(p1.length % Vlen == 0);
immutable v1 = cast(immutable TV*)p1.ptr;
immutable v2 = cast(immutable TV*)p2.ptr;
TV totV = 0;
foreach (immutable i; 0 .. p1.length / Vlen) {
TV d = v1[i]  v2[i];
totV += d * d;
}
uint tot = 0;
foreach (immutable t; totV.array)
tot += t;
return tot;
}
TLabel classify(immutable TData[nCols + 7][] trainingSet,
immutable ref TData[nCols  1] pixels) pure nothrow /*@nogc*/ {
auto closestDistance = uint.max;
auto closestLabel = short.max;
foreach (immutable ref s; trainingSet) {
immutable dist = pixels.distance(s[8 .. $]);
if (dist < closestDistance) {
closestDistance = dist;
closestLabel = s[labelIndex];
}
}
return closestLabel;
} Notes:  The classify function is not changed much, it performs the slicing to throw away the first eight (short label + 7 padding short) items.  The distance function is a little more complex, but not too much. It uses the basic SIMD operations like subtraction, sum, product, plus the ".array" attribute that allows me to manage a short8 as a fixedsize value to sum all its contents.  The code of distance is designed to be adaptable to other sizes of SIMD registers (but it's not able to adapt automatically).  This "nearest4.d" program is only 1.6 times faster than "nearest4.d" probably because the SIMD code that manages shorts is not very clean.
The asm for the distance function of the "nearest4.d" program is rather long, despite being fast:
__D8nearest48distanceFNaNbKyG784sKyG784sZk:
pushl %edi
pushl %esi
pxor %xmm0, %xmm0
movl $208, %ecx
movl 12(%esp), %edx
.align 16, 0x90
LBB1_1:
movdqa 208(%edx,%ecx), %xmm1
movdqa 192(%edx,%ecx), %xmm2
psubw 208(%eax,%ecx), %xmm1
pmullw %xmm1, %xmm1
paddw %xmm0, %xmm1
psubw 192(%eax,%ecx), %xmm2
pmullw %xmm2, %xmm2
paddw %xmm1, %xmm2
movdqa 176(%edx,%ecx), %xmm0
psubw 176(%eax,%ecx), %xmm0
pmullw %xmm0, %xmm0
paddw %xmm2, %xmm0
movdqa 160(%edx,%ecx), %xmm1
psubw 160(%eax,%ecx), %xmm1
pmullw %xmm1, %xmm1
paddw %xmm0, %xmm1
movdqa 144(%edx,%ecx), %xmm0
psubw 144(%eax,%ecx), %xmm0
pmullw %xmm0, %xmm0
paddw %xmm1, %xmm0
movdqa 128(%edx,%ecx), %xmm1
psubw 128(%eax,%ecx), %xmm1
pmullw %xmm1, %xmm1
paddw %xmm0, %xmm1
movdqa 112(%edx,%ecx), %xmm0
psubw 112(%eax,%ecx), %xmm0
pmullw %xmm0, %xmm0
paddw %xmm1, %xmm0
movdqa 96(%edx,%ecx), %xmm1
psubw 96(%eax,%ecx), %xmm1
pmullw %xmm1, %xmm1
paddw %xmm0, %xmm1
movdqa 80(%edx,%ecx), %xmm0
psubw 80(%eax,%ecx), %xmm0
pmullw %xmm0, %xmm0
paddw %xmm1, %xmm0
movdqa 64(%edx,%ecx), %xmm1
psubw 64(%eax,%ecx), %xmm1
pmullw %xmm1, %xmm1
paddw %xmm0, %xmm1
movdqa 48(%edx,%ecx), %xmm0
psubw 48(%eax,%ecx), %xmm0
pmullw %xmm0, %xmm0
paddw %xmm1, %xmm0
movdqa 32(%edx,%ecx), %xmm1
psubw 32(%eax,%ecx), %xmm1
pmullw %xmm1, %xmm1
paddw %xmm0, %xmm1
movdqa 16(%edx,%ecx), %xmm2
psubw 16(%eax,%ecx), %xmm2
pmullw %xmm2, %xmm2
paddw %xmm1, %xmm2
movdqa (%edx,%ecx), %xmm0
psubw (%eax,%ecx), %xmm0
pmullw %xmm0, %xmm0
paddw %xmm2, %xmm0
addl $224, %ecx
cmpl $1776, %ecx
jne LBB1_1
pshufd $3, %xmm0, %xmm1
movd %xmm1, %eax
movdqa %xmm0, %xmm1
movhlps %xmm1, %xmm1
movd %xmm1, %ecx
pshufd $1, %xmm0, %xmm1
movd %xmm1, %edx
movd %xmm0, %esi
movswl %si, %edi
sarl $16, %esi
addl %edi, %esi
movswl %dx, %edi
addl %esi, %edi
sarl $16, %edx
addl %edi, %edx
movswl %cx, %esi
addl %edx, %esi
sarl $16, %ecx
addl %esi, %ecx
movswl %ax, %edx
addl %ecx, %edx
sarl $16, %eax
addl %edx, %eax
popl %esi
popl %edi
ret $4
The run time of the varions versions, in seconds, best of 3:
nearest1: 91 (dmd)
nearest1b: 19.7 (ldc2)
nearest2: 5.91 (ldc2)
nearest3: 2.05 (ldc2)
nearest4: 1.28 (ldc2)
I have compiled the programs with:
dmd wi d O release inline boundscheck=off nearest1.d
ldmd2 wi unrollallowpartial O release inline noboundscheck nearest1b.d
ldmd2 wi unrollallowpartial O release inline noboundscheck nearest2.d
ldmd2 wi unrollallowpartial O release inline noboundscheck nearest3.d
ldmd2 wi unrollallowpartial O release inline noboundscheck nearest4.d
strip nearest1b.exe nearest2.exe nearest3.exe nearest4.exe
I have used the compilers:
DMD32 D Compiler v2.066
And ldc2 V.0.13.0beta1, based on DMD v2.064 and LLVM 3.4.1, Default target: i686pcmingw32, Host CPU: core2.
On my 32 bit Windows the binaries generated are:
1.692 nearest1.d
221.212 nearest1.exe
1.757 nearest1b.d
1.079.822 nearest1b.ex
1.716 nearest2.d
1.070.606 nearest2.exe
2.059 nearest2b.d
1.071.118 nearest2b.exe
2.037 nearest3.d
1.353.230 nearest3.exe
2.365 nearest4.d
1.353.742 nearest4.exe
2.513 nearest5.d
In the zip archive I have also added a "nearest5.d" program that shows a recent improvement of the D type system (not yet available in ldc2), to support fixedsize array length inference for slices passed to templated functions.
With a runtime 1.28 seconds for the "nearest4.d" program there are still ways to reduce the runtime. I am using an old 2.3 GHz CPU, so if you use a modern Intel 4 GHz CPU like the Core i74790K and with faster memory, you can see a significant speedup. If you use a modern CPU you can also use YMM registers with short16 SIMD, that should offer some speedup (LDC2 is already able to target such YMM registers, you need to replace short8 with short16 in the nearest4.d program and change the padding). And then this program is very easy to parallelize for two or four cores, you can use the std.parallelism module for the computation of nCorrect with map+reduce in the main function, that should give some speedup.
You can find all the code and data here: http://www.fantascienza.net/leonardo/js/nearest.7z
(This 7z archive contains the data sets too, I hope this is OK. Otherwise I'll remove them from the 7z archive.)  comments: 5 comments or Leave a comment 
 In the "Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers" book by John MacCormick there is a simple simulation of a graph of Web pages, to implement a basic page ranking. This is modified from Figure 37 (from: http://press.princeton.edu/releases/m9528_illus.pdf ):
I have implemented the algorithm using the D language. The simulation is just a random walk on the graph, and at each step there is also a small probability (0.15) to jump to a random node. There are various ways to implement this walk, I have chosen a fast one:
// graph_walk1.d
import std.stdio, std.random, std.algorithm, std.math, std.typetuple;
alias NodeIndex = int;
enum nVisits = 20_000_000;
enum nUnroll = 8;
static assert(nVisits % nUnroll == 0);
enum restartProbability = 0.15;
enum nMaxOutArcs = 5;
enum nNodes = 16;
template Iota(int stop) {
static if (stop <= 0)
alias TypeTuple!() Iota;
else
alias TypeTuple!(Iota!(stop  1), stop  1) Iota;
}
struct Node {
immutable uint nOutArcs;
immutable uint[nMaxOutArcs] outbound;
uint hitCount = 0;
}
void main() {
auto rng = Xorshift(1);
Node[nNodes] graph = [{/* 0*/ 3, [13, 9, 8]},
{/* 1*/ 1, [2]},
{/* 2*/ 2, [0, 8]},
{/* 3*/ 1, [1]},
{/* 4*/ 2, [1, 2]},
{/* 5*/ 2, [2, 8]},
{/* 6*/ 3, [3, 4, 5]},
{/* 7*/ 1, [6]},
{/* 8*/ 1, [15]},
{/* 9*/ 5, [14, 12, 11, 10, 8]},
{/*10*/ 1, [15]},
{/*11*/ 1, [15]},
{/*12*/ 1, [15]},
{/*13*/ 1, [9]},
{/*14*/ 1, [15]},
{/*15*/ 1, [7]}];
uint position = uniform(0, graph.length, rng);
graph[position].hitCount++;
foreach (immutable _; 1 .. nVisits / nUnroll)
foreach (immutable __; Iota!nUnroll) {
if (graph[position].nOutArcs == 0  uniform(0.0, 1.0, rng) < restartProbability)
position = uniform(0, graph.length, rng);
else
position = graph[position].outbound[uniform(0, graph[position].nOutArcs, rng)];
graph[position].hitCount++;
}
foreach (immutable i, immutable node; graph)
writefln("%2d %2.0f", i, round(100.0 * node.hitCount / cast(real)nVisits));
}
The program performs twenty millions jumps on the graph. The graph is represented with an adjacency array. I have used a fixedsize 2D matrix, that contains some holes because it has five columns, but only the node 9 has five outbound arcs. Each Node must contain the real number of outbound arcs, plus a counter of the visits.
There is a more compact representation for this graph, a linear array that has no holes and contains both the indexes of the outbound arcs and the number of outbound arcs, plus a little extra array that contains the start indexes of the nodes and the counts. But this representation is more complex, and I prefer a simpler code.
The Iota!nUnroll is used to unroll the loop 8 times, to speed up the walk a little.
To improve the performance of the code I have created a second version of the D program. It uses a similar data structure, but the counts are kept in a separated array.
This program performs the random walks jumping between the nodes using goto, so the state is kept in the program counter. The finite state machine is implemented with code generated at compiletime from the Graph data structure:
// graph_walk2.d
import std.stdio, std.random, std.algorithm, std.math;
alias NodeIndex = int;
enum nVisits = 20_000_000;
enum restartProbability = 0.15;
enum nMaxOutArcs = 5;
enum nNodes = 16;
struct Node {
immutable uint nOutArcs;
immutable int[nMaxOutArcs] outbound;
}
string genSwitchCode(in int nMax) /*pure*/ {
string switchCode = "switch (position) {\n";
foreach (immutable i; 0 .. nMax)
switchCode ~= format(" case %d: goto NODE%d;\n", i, i);
switchCode ~= " default: assert(0);\n";
switchCode ~= "}\n";
return switchCode;
}
string genNodeCode(in int nMax, in string switchCode) /*pure*/ {
string nodeCode;
foreach (immutable i; 0 .. nMax)
nodeCode ~= format("
NODE%d:
hits[%d]++;
steps++;
if (steps >= nVisits) goto END;
if (uniform(0.0, 1.0, rng) < restartProbability)
position = uniform(0, graph.length, rng);
else
position = graph[%d].outbound[uniform(0, graph[%d].nOutArcs, rng)];
%s
", i, i, i, i, switchCode);
return nodeCode;
}
void main() {
auto rng = Xorshift(1);
static enum Node[nNodes] graph = [
{/* 0*/ 3, [13, 9, 8]},
{/* 1*/ 1, [2]},
{/* 2*/ 2, [0, 8]},
{/* 3*/ 1, [1]},
{/* 4*/ 2, [1, 2]},
{/* 5*/ 2, [2, 8]},
{/* 6*/ 3, [3, 4, 5]},
{/* 7*/ 1, [6]},
{/* 8*/ 1, [15]},
{/* 9*/ 5, [14, 12, 11, 10, 8]},
{/*10*/ 1, [15]},
{/*11*/ 1, [15]},
{/*12*/ 1, [15]},
{/*13*/ 1, [9]},
{/*14*/ 1, [15]},
{/*15*/ 1, [7]}];
uint[graph.length] hits;
uint steps = 0;
uint position = uniform(0, graph.length, rng);
hits[position]++;
steps++;
enum switchCode = genSwitchCode(graph.length);
enum nodeCode = genNodeCode(graph.length, switchCode);
mixin(switchCode);
mixin(nodeCode);
END:
foreach (immutable i, immutable h; hits)
writefln("%2d %2.0f", i, round(100.0 * h / cast(real)nVisits));
} genSwitchCode() generates a simple jump table like this:
switch (position) {
case 0: goto NODE0;
case 1: goto NODE1;
case 2: goto NODE2;
case 3: goto NODE3;
case 4: goto NODE4;
case 5: goto NODE5;
case 6: goto NODE6;
case 7: goto NODE7;
case 8: goto NODE8;
case 9: goto NODE9;
case 10: goto NODE10;
case 11: goto NODE11;
case 12: goto NODE12;
case 13: goto NODE13;
case 14: goto NODE14;
case 15: goto NODE15;
default: assert(0);
}
While genNodeCode() generates the whole finite state machine, each node looks like this:
NODE0:
hits[0]++;
steps++;
if (steps >= nVisits) goto END;
if (uniform(0.0, 1.0, rng) < restartProbability)
position = uniform(0, graph.length, rng);
else
position = graph[0].outbound[uniform(0, graph[0].nOutArcs, rng)];
switch (position) {
case 0: goto NODE0;
case 1: goto NODE1;
case 2: goto NODE2;
case 3: goto NODE3;
case 4: goto NODE4;
case 5: goto NODE5;
case 6: goto NODE6;
case 7: goto NODE7;
case 8: goto NODE8;
case 9: goto NODE9;
case 10: goto NODE10;
case 11: goto NODE11;
case 12: goto NODE12;
case 13: goto NODE13;
case 14: goto NODE14;
case 15: goto NODE15;
default: assert(0);
}
The LLVM backend of the LDC2 compiler seems able to recognize this kind of jumps tables, and replaces them with a goto (like an jmp *%eax instruction), so this machine is quick. Most of the time is spent generating random numbers.
To see if the computed gotos of the GNU C of GCC (http://gcc.gnu.org/onlinedocs/gcc3.2/gcc/LabelsasValues.html ) lead to a better performance, I have translated the D code to GNUC99:
// graph_walk3.c
#include "stdio.h"
#include "stdlib.h"
#include "time.h"
#define nVisits 20000000
#define restartProbability 0.15
#define nMaxOutArcs 5
#define nNodes 16
inline static double uniform01() {
return rand() / (double)RAND_MAX;
}
inline static int uniform(const int max) {
int k = (int)(max * (rand() / (double)RAND_MAX));
return (k == max) ? k  1 : k;
}
typedef struct {
unsigned int nOutArcs;
void *outbound[nMaxOutArcs];
} Node;
#define NODE_BLOCK(N) \
do { \
NODE##N: \
hits[N]++; \
steps++; \
if (__builtin_expect(steps >= nVisits, 0)) goto END; \
if (__builtin_expect(uniform01() < restartProbability, 0)) \
goto *NodeAddresses[uniform(nNodes)]; \
else \
goto *graph[N].outbound[uniform(graph[N].nOutArcs)]; \
} while (0)
int main() {
srand(time(NULL));
static const void *NodeAddresses[] = {
&&NODE0, &&NODE1, &&NODE2, &&NODE3,
&&NODE4, &&NODE5, &&NODE6, &&NODE7,
&&NODE8, &&NODE9, &&NODE10, &&NODE11,
&&NODE12, &&NODE13, &&NODE14, &&NODE15};
static const Node graph[nNodes] = {
/* 0*/ {3, {&&NODE13, &&NODE9, &&NODE8}},
/* 1*/ {1, {&&NODE2}},
/* 2*/ {2, {&&NODE0, &&NODE8}},
/* 3*/ {1, {&&NODE1}},
/* 4*/ {2, {&&NODE1, &&NODE2}},
/* 5*/ {2, {&&NODE2, &&NODE8}},
/* 6*/ {3, {&&NODE3, &&NODE4, &&NODE5}},
/* 7*/ {1, {&&NODE6}},
/* 8*/ {1, {&&NODE15}},
/* 9*/ {5, {&&NODE14, &&NODE12, &&NODE11, &&NODE10, &&NODE8}},
/*10*/ {1, {&&NODE15}},
/*11*/ {1, {&&NODE15}},
/*12*/ {1, {&&NODE15}},
/*13*/ {1, {&&NODE9}},
/*14*/ {1, {&&NODE15}},
/*15*/ {1, {&&NODE7}}};
unsigned int hits[nNodes] = {0};
unsigned int steps = 0;
goto *NodeAddresses[uniform(nNodes)];
NODE_BLOCK(0);
NODE_BLOCK(1);
NODE_BLOCK(2);
NODE_BLOCK(3);
NODE_BLOCK(4);
NODE_BLOCK(5);
NODE_BLOCK(6);
NODE_BLOCK(7);
NODE_BLOCK(8);
NODE_BLOCK(9);
NODE_BLOCK(10);
NODE_BLOCK(11);
NODE_BLOCK(12);
NODE_BLOCK(13);
NODE_BLOCK(14);
NODE_BLOCK(15);
END:
for (int i = 0; i < nNodes; i++)
printf("%2d %2.1f\n", i, 100.0 * hits[i] / (double)nVisits);
return 0;
} I have used C macros to keep the C code short about as the D code.
The "do { } while (0)" is a useful trick to require a semicolon to the macro calls.
The "NODE##N:" syntax allows to generate the numered jump labels from the NODE_BLOCK() macro argument N.
Now there is no need of jump tables, there are just gotos to the addresses. I think this is one of the fastest ways to implement a hardcoded finite state machine.
GCC 4.8.0 generates code like this for each node:
L14:
addl $1, %ebx
addl $1, 20(%esp)
cmpl $19999999, %ebx
ja L15
call _rand
movsd LC1, %xmm1
cvtsi2sd %eax, %xmm0
mulsd LC0, %xmm0
comisd %xmm0, %xmm1
ja L125
call _rand
cvtsi2sd %eax, %xmm0
xorl %eax, %eax
mulsd LC0, %xmm0
cvttsd2si %xmm0, %edx
cmpl $1, %edx
cmovne %edx, %eax
movl _graph.4128+28(,%eax,4), %eax
jmp *%eax
While the Intel compiler ICC 13.0.1 splits the code in a fast zone and a slow zone, I have not timed this but I think this could be a little more efficient:
incl %ebx
incl 40(%rsp)
cmpl $20000000, %ebx
jae ..B1.108 # Prob 4%
call uniform01()
comisd .L_2il0floatpacket.20(%rip), %xmm0
jbe ..B1.139 # Prob 18%
movl $1, %edi
call uniform(int)
movslq %eax, %rax
movq 488+graph.351.0.3(,%rax,8), %rax
jmp *%rax
...
..B1.139:
movl $16, %edi
call uniform(int)
movslq %eax, %rax
movq NodeAddresses.351.0.3(,%rax,8), %rax
jmp *%rax
Runtime per nVisits = 20_000_000:
graph_walk1: 1.35 (nUnroll = 8)
graph_walk2: 1.14
graph_walk3: 1.08
Compilation arguments used:
GCC:
Wall Wextra std=c99 mfpmath=sse march=native Ofast flto s graph_walk3.c o graph_walk3
LDC2 (ldmd2):
O release inline noboundscheck
Compilers:
GCC V.4.8.0
LDC2 V.0.12.1 based on DMD v2.063.2 and LLVM 3.3.1. The code shown in this post: http://www.fantascienza.net/leonardo/js/graph_walk.zip  comments: Leave a comment 
 A simple programming problem, here solved in Matlab: http://blogs.mathworks.com/pick/2008/03/31/puzzlercoingame/
There is a table composed of four squares in a circle. When you are on the last square and you move one step forward, you go back to the first square.
Starting at the first square, flip two coins at each move. If the coins are:  Head Head: move one square forward;  Tail Tail: move two squares forward;  Head Tail or Tail Head: go to the first square.
With such rules, what is the probability of being on each of the four squares after twenty moves?
A first simple way to find the probabilities numerically is by simulation, in D language:
import std.stdio, std.random, std.algorithm, std.traits,
std.bigint, std.exception;
enum uint nSquares = 4;
uint move(uint pos, ref Xorshift rng) {
final switch (rng.front % nSquares) {
case 0, 1: // Move one square forward.
pos = (pos + 1) % nSquares;
break;
case 2: // Move two squares forward.
pos = (pos + 2) % nSquares;
break;
case 3: // Go to the first square.
pos = 0;
break;
}
rng.popFront;
return pos;
}
void main() {
enum nTurns = 20,
nTries = 100_000_000;
auto rng = Xorshift(unpredictableSeed);
enforce(isUnsigned!(typeof(rng.front)) &&
(rng.front.max.BigInt + 1) % nSquares == 0,
"move() needs to use std.random.uniform().");
uint[nSquares] finalPosCounts;
foreach (immutable _; 0 .. nTries) {
uint pos = 0; // Starting position.
foreach (immutable __; 0 .. nTurns)
pos = move(pos, rng);
finalPosCounts[pos]++;
}
finalPosCounts[].map!(x => x / cast(real)nTries).writeln;
} I have used rng.front % nSquares instead of std.random.uniform() because compiled with the LDC2 compiler it's almost twice faster.
If I compile that first program with:
ldmd2 O release inline noboundscheck
LDC2 produces the following good 32 bit asm from the two loops of the main():
LBB1_1:
xorl %esi, %esi
xorl %ecx, %ecx
.align 16, 0x90
LBB1_2:
movl %edi, %edx
movl %esi, %edi
movl %edx, %ebx
andl $3, %ebx
xorl %esi, %esi
cmpl $3, %ebx
je LBB1_7
cmpl $2, %ebx
je LBB1_13
cmpl $1, %ebx
ja LBB1_14
incl %edi
jmp LBB1_6
.align 16, 0x90
LBB1_13:
addl $2, %edi
LBB1_6:
andl $3, %edi
movl %edi, %esi
LBB1_7:
movl 48(%esp), %ebx
movl 52(%esp), %edi
movl %edi, 48(%esp)
movl 56(%esp), %edi
movl %edi, 52(%esp)
movl %ebx, %edi
shll $11, %edi
xorl %ebx, %edi
movl %edx, 56(%esp)
movl %edx, %ebx
shrl $19, %ebx
xorl %edx, %ebx
xorl %edi, %ebx
shrl $8, %edi
xorl %ebx, %edi
movl %edi, 60(%esp)
incl %ecx
cmpl $20, %ecx
jl LBB1_2
incl 32(%esp,%esi,4)
incl %eax
cmpl $100000000, %eax
jl LBB1_1
A different algorithm, as explained in the linked article is to power the transition probability matrix (here the matrix multiplication is implemented in a slow high level way):
import std.stdio, std.range, std.array, std.numeric, std.algorithm;
T[][] mul(T)(in T[][] A, in T[][] B) {
const Bt = B[0].length.iota.map!(i=> B.transversal(i).array).array;
return A.map!(a => Bt.map!(b => a.dotProduct(b)).array).array;
}
void main() {
real[][] M = [[1, 2, 1, 0], [1, 0, 2, 1], [2, 0, 0, 2], [3, 1, 0, 0]];
foreach (row; M)
row[] *= 0.25;
auto P = M.mul(M);
foreach (immutable _; 2 .. 20)
P = P.mul(M);
P[0].writeln;
} One run of the simulation (nTries = 100_000_000, runtime less than 21 seconds) gives: [0.386108, 0.234545, 0.213841, 0.165506]
The second program gives a similar output: [0.386203, 0.234484, 0.213797, 0.165516]
Running the first program with (ulong nTries = 10_000_000_000) in less than an hour gives: [0.386202, 0.234477, 0.213796, 0.165525]
The code: http://www.fantascienza.net/leonardo/js/index.html#table_probs  comments: Leave a comment 
 When I read the "The Little Schemer" I have done countless small coding exercises using recursion: https://github.com/pkrumins/thelittleschemer
But later when I've started learning Haskell I've seen than explicit recursion is uncommon in good Haskell code. Most times in Haskell you use map, folds, and other higher order functions instead of recursion. (Sometimes in Haskell you use explicit recursion as an optimization).
C and C++ code contains lot of 'for' and 'while' loops, that are sometimes hard to read and understand.
Recently they have released videos and slides of the talks of the GoingNative 2013 conference. Among the interesting talks the one I have appreciated more is "C++ Seasoning" by Sean Parent (and from the comments I see that others agree with me, and even ask the speaker to write a book on the topic): http://channel9.msdn.com/Events/GoingNative/2013/CppSeasoning
I suggest to start reading the updated slides: https://github.com/seanparent/seanparent.github.com/wiki/presentations/20130911cppseasoning/cppseasoning.pdf
The talk gives some suggestions for C++ code:  No Raw Loops  No Raw Syntonization Primitives  No Raw Pointers
The first part suggests to reduce the number of loops in C++ code and replace them with algorithms (often with STL algorithms), a little like it's done in Haskell.
C++ is an abstraction over C, and the heavy usage of STLlike algorithms is kind of a new language that abstracts over raw loopheavy C++ code. Writing code like that in C++ takes time and some brain power, but the resulting code is good, short, simpler to keep bugfree, and often efficient too.
In D language I suggest to use std.algorithm for similar purposes. I also suggest to make const/immutable all variables, unless they need to mutate, or unless something in D or in Phobos makes that impossible (like Ranges, that often can't be const).
Walter Bright calls "component programming" a variant of that coding style that relies on ranges a lot. I have shown several examples of such linear flow programming in past blog posts (that today is becoming very common, LINQ in C#, in Java8, with Python itertools, in F#, and so on).
Here is another simple example, the Euler Problem number 50 (http://projecteuler.net/problem=50 ), the problem:
The prime 41, can be written as the sum of six consecutive primes: 41 = 2 + 3 + 5 + 7 + 11 + 13 This is the longest sum of consecutive primes that adds to a prime below onehundred. The longest sum of consecutive primes below onethousand that adds to a prime, contains 21 terms, and is equal to 953. Which prime, below onemillion, can be written as the sum of the most consecutive primes? This is a simple imperative solution in D (modified and ported from Python code found here: http://blog.dreamshire.com/2009/04/12/projecteulerproblem50solution/ ):
import std.stdio, std.algorithm, std.range, std.functional;
...
int euler50(in int top) {
/*const*/ auto primes = top.sieve.assumeSorted;
int[] prime_sum = [0];
int tot = 0, count = 0;
while (tot < top) {
tot += primes[count];
prime_sum ~= tot;
count++;
}
int terms = 1;
int max_prime = 1;
foreach (immutable i; 0 .. count)
foreach (immutable j; i + terms .. count) {
immutable n = prime_sum[j]  prime_sum[i];
if (j  i > terms && primes.contains(n)) {
terms = j  i;
max_prime = n;
}
}
return max_prime;
}
void main() {
1_000_000.euler50.writeln; // 997651
} The idea is to compute a cumulative of the prime numbers, and then take a subset of all the N^2 subintervals, removing the sum of the starting end of the interval from the sum of its end, to compute in O(1) the sum of its items. The contains() function is a linear search, so this code is kind of O(n^3), but for the given problem it's fast enough, finding the correct result of 997651 in less than half second.
A very simple sieve implementation used in that euler50() function could be:
uint[] sieve(in uint limit) {
if (limit < 2)
return [];
auto composite = new bool[limit];
foreach (immutable uint n; 2 .. cast(uint)(limit ^^ 0.5) + 1)
if (!composite[n])
for (uint k = n * n; k < limit; k += n)
composite[k] = true;
return iota(2, limit).filter!(i => !composite[i]).array;
} If you take a look at the euler50() functions you see general patterns. The cumulative of an array could be written as general function (even better is to write this as a lazy range that works on any input range):
T[] accumulate(T)(in T[] items) pure nothrow {
auto result = new T[items.length];
if (!items.empty) {
result[0] = items[0];
foreach (immutable i; 1 .. items.length)
result[i] = result[i  1] + items[i];
}
return result;
} The iteration on all subsets could be done with a handy "Pairwise" range (imports not included):
struct Pairwise(Range) {
alias R = Unqual!Range;
alias Pair = Tuple!(ForeachType!R, "a", ForeachType!R, "b");
R _input;
size_t i, j;
this(Range r_) {
this._input = r_;
j = 1;
}
@property bool empty() {
return j >= _input.length;
}
@property Pair front() {
return Pair(_input[i], _input[j]);
}
void popFront() {
if (j >= _input.length  1) {
i++;
j = i + 1;
} else {
j++;
}
}
}
Pairwise!Range pairwise(Range)(Range r)
if (isRandomAccessRange!Range && hasLength!Range && !isNarrowString!Range) {
return typeof(return)(r);
} Then what's left is a search among all the subsequences, throwing away all the ones that are too much long and the ones that don't sum to a prime number. Also the precomputed array of primes is sorted, so it's better to use a binary search. Now we can organize all the computation in few simple chains:
int euler50b(in int top) {
auto primes = top.sieve.assumeSorted;
auto primeSum = primes.release.accumulate;
return primeSum
.until!(p => p > top)
.walkLength
.iota
.pairwise
.map!(ij => tuple(ij[1]  ij[0], primeSum[ij[1]]  primeSum[ij[0]]))
.filter!(dn => primes.contains(dn[1]))
.reduce!max[1];
} This code works and runs in something like 0.30.4 seconds, so it's still fast enough.
This euler50b() function has several problems:  A range like pairwise() is not yet present in std.range;  Something like accumulate() (with a given reducing function) is missing in std.algoritm;  In D there is no syntax to destructure (unpack) tuples, so you have to use things like: ij => tuple(ij[1]  ij[0] ...  std.algorithm.max is not yet accepting a key function.  Currently "primes" can't be immutable or even const because you can't call the "release" method on a const SortedRange;  Currently "primeSum" can't be immutable otherwise reduce() doesn't work.  Currently "iota" is not pure nor nothrow.
How the main function could look, once D and Phobos become more refined:
int euler50c(in int top) pure nothrow {
immutable primes = top.sieve.assumeSorted;
immutable primeSum = primes.accumulate!q{a + b};
return primeSum
.until!(p => p > top)
.walkLength
.iota
.pairwise
.map!({i, j} => tuple(j  i, primeSum[j]  primeSum[i]))
.filter!({d, n} => primes.contains(n))
.reduce!(max!({d, n} => d))[1];
} The full code of the two versions: http://www.fantascienza.net/leonardo/js/euler50.zip  comments: 4 comments or Leave a comment 
