Some links

Lately I'm writing less posts, but I plan to write some blog posts soon enough.

Bisecting Floating Point Numbers:
http://squishythinking.com/julia/2014/02/22/bisecting-floats/

A very nice interactive book with little coding examples for in Rust language:
http://rustbyexample.com/

The little programming task FizzBuzz in Rust:
http://chrismorgan.info/blog/rust-fizzbuzz.html

Nice slides, "Rust: A Type System You Didn't Know You Wanted":
http://pnkfelix.github.io/curry-on2015.html#/

Safe Native Code, by Joe Duffy, about Midori:
http://joeduffyblog.com/2015/12/19/safe-native-code/

A nice interactive book on a significant topic, "Programming with Refinement Types. An Introduction to LiquidHaskell":
http://ucsd-progsys.github.io/lh-workshop/

Segment Trees, a data structure:
http://algosaur.us/segment-tree/

A nice long article with good images about the evolution of vision in nature:
http://ngm.nationalgeographic.com/2016/02/evolution-of-eyes-text

"Multidimensional algorithms and iteration", in Julia language, the LLVM is optimizing a lot:
http://julialang.org/blog/2016/02/iteration

This explains why using fuzzers is important, "USENIX Enigma 2016 - Sanitize, Fuzz, and Harden Your C++ Code":
https://www.youtube.com/watch?v=FP8zFhB_cOo

"Cap'n Proto v0.2: Compiler rewritten Haskell -> C++":
https://capnproto.org/news/2013-08-12-capnproto-0.2-no-more-haskell.html

"Why there is no Hitchhiker’s Guide to Mathematics for Programmers" (about the basic four methods of proof in mathematics):
http://jeremykun.com/2013/02/08/why-there-is-no-hitchhikers-guide-to-mathematics-for-programmers/

"Why and when you should use SoA" (Structure of Arrays, a different way to lay data in memory):
https://maikklein.github.io/post/soa-d/

"Exploring 4D Hyperdonuts", very nice animations:
http://imgur.com/a/asnRZ
http://imgur.com/a/ZSTVs

A honest good list of Haskell problems, I wish to find similar list for other languages: https://www.reddit.com/r/haskell/comments/4f47ou/why_does_haskell_in_your_opinion_suck/
programming

D Brainfuck benchmark

I've found some interesting benchmarks in various languages:
https://github.com/kostya/benchmarks

I have seen some ways to improve the performance Rust version of the Base64 benchmark, the performance of the Rust and Python versions of the Brainfuck benchmarks, but in this post I'll focus on the D version of the Brainfuck benchmark (currently my site is frozen, so for the moment I'll just inline all the code in this post. Later I'll offer links to zips on my site).

The benchmark asks to run this long branfuck program that generates a Mandelbrot Set ASCII art, "mandel.b":
      A mandelbrot set fractal viewer in brainf*** written by Erik Bosman
+++++++++++++[->++>>>+++++>++>+<<<<<<]>>>>>++++++>--->>>>>>>>>>+++++++++++++++[[
>>>>>>>>>]+[<<<<<<<<<]>>>>>>>>>-]+[>>>>>>>>[-]>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>[-]+
<<<<<<<+++++[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>>>>+>>>>>>>>>>>>>>>>>>>>>>>>>>
>+<<<<<<<<<<<<<<<<<[<<<<<<<<<]>>>[-]+[>>>>>>[>>>>>>>[-]>>]<<<<<<<<<[<<<<<<<<<]>>
>>>>>[-]+<<<<<<++++[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>>>+<<<<<<+++++++[-[->>>
>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>>>+<<<<<<<<<<<<<<<<[<<<<<<<<<]>>>[[-]>>>>>>[>>>>>
>>[-<<<<<<+>>>>>>]<<<<<<[->>>>>>+<<+<<<+<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>
[>>>>>>>>[-<<<<<<<+>>>>>>>]<<<<<<<[->>>>>>>+<<+<<<+<<]>>>>>>>>]<<<<<<<<<[<<<<<<<
<<]>>>>>>>[-<<<<<<<+>>>>>>>]<<<<<<<[->>>>>>>+<<+<<<<<]>>>>>>>>>+++++++++++++++[[
>>>>>>>>>]+>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+[
>+>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>[-<<<<+>>>>]<<<<[->>>>+<<<<<[->>[
-<<+>>]<<[->>+>>+<<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>>>>>>>>]<<<<<<<
<<[>[->>>>>>>>>+<<<<<<<<<]<<<<<<<<<<]>[->>>>>>>>>+<<<<<<<<<]<+>>>>>>>>]<<<<<<<<<
[>[-]<->>>>[-<<<<+>[<->-<<<<<<+>>>>>>]<[->+<]>>>>]<<<[->>>+<<<]<+<<<<<<<<<]>>>>>
>>>>[>+>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>>[-<<<<<+>>>>>]<<<<<[->>>>>+
<<<<<<[->>>[-<<<+>>>]<<<[->>>+>+<<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>
>>>>>>>]<<<<<<<<<[>>[->>>>>>>>>+<<<<<<<<<]<<<<<<<<<<<]>>[->>>>>>>>>+<<<<<<<<<]<<
+>>>>>>>>]<<<<<<<<<[>[-]<->>>>[-<<<<+>[<->-<<<<<<+>>>>>>]<[->+<]>>>>]<<<[->>>+<<
<]<+<<<<<<<<<]>>>>>>>>>[>>>>[-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<+>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>]>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>+++++++++++++++[[>>>>
>>>>>]<<<<<<<<<-<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+>>>>>>>>>>>>>>>>>>>>>+<<<[<<<<<<
<<<]>>>>>>>>>[>>>[-<<<->>>]+<<<[->>>->[-<<<<+>>>>]<<<<[->>>>+<<<<<<<<<<<<<[<<<<<
<<<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>[-<<<<->>>>]+<<<<[->>>>-<[-<<<+>>>]<<<[->
>>+<<<<<<<<<<<<[<<<<<<<<<]>>>[-]+>>>>>>[>>>>>>>>>]>[-]+<]]+>[-<[>>>>>>>>>]<<<<<<
<<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]<<<<<<<[->+>>>-<<<<]>>>>>>>>>+++++++++++++++++++
+++++++>>[-<<<<+>>>>]<<<<[->>>>+<<[-]<<]>>[<<<<<<<+<[-<+>>>>+<<[-]]>[-<<[->+>>>-
<<<<]>>>]>>>>>>>>>>>>>[>>[-]>[-]>[-]>>>>>]<<<<<<<<<[<<<<<<<<<]>>>[-]>>>>>>[>>>>>
[-<<<<+>>>>]<<<<[->>>>+<<<+<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>>[-<<<<<<<<
<+>>>>>>>>>]>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>+++++++++++++++[[>>>>>>>>>]+>[-
]>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+[>+>>>>>>>>]<<<
<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>>[-<<<<<+>>>>>]<<<<<[->>>>>+<<<<<<[->>[-<<+>>]<
<[->>+>+<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>>>>>>>>]<<<<<<<<<[>[->>>>
>>>>>+<<<<<<<<<]<<<<<<<<<<]>[->>>>>>>>>+<<<<<<<<<]<+>>>>>>>>]<<<<<<<<<[>[-]<->>>
[-<<<+>[<->-<<<<<<<+>>>>>>>]<[->+<]>>>]<<[->>+<<]<+<<<<<<<<<]>>>>>>>>>[>>>>>>[-<
<<<<+>>>>>]<<<<<[->>>>>+<<<<+<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>+>>>>>>>>
]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>>[-<<<<<+>>>>>]<<<<<[->>>>>+<<<<<<[->>[-<<+
>>]<<[->>+>>+<<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>>>>>>>>]<<<<<<<<<[>
[->>>>>>>>>+<<<<<<<<<]<<<<<<<<<<]>[->>>>>>>>>+<<<<<<<<<]<+>>>>>>>>]<<<<<<<<<[>[-
]<->>>>[-<<<<+>[<->-<<<<<<+>>>>>>]<[->+<]>>>>]<<<[->>>+<<<]<+<<<<<<<<<]>>>>>>>>>
[>>>>[-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
]>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>>>[-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<+>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>++++++++
+++++++[[>>>>>>>>>]<<<<<<<<<-<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+[>>>>>>>>[-<<<<<<<+
>>>>>>>]<<<<<<<[->>>>>>>+<<<<<<+<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>>>>>>[
-]>>>]<<<<<<<<<[<<<<<<<<<]>>>>+>[-<-<<<<+>>>>>]>[-<<<<<<[->>>>>+<++<<<<]>>>>>[-<
<<<<+>>>>>]<->+>]<[->+<]<<<<<[->>>>>+<<<<<]>>>>>>[-]<<<<<<+>>>>[-<<<<->>>>]+<<<<
[->>>>->>>>>[>>[-<<->>]+<<[->>->[-<<<+>>>]<<<[->>>+<<<<<<<<<<<<[<<<<<<<<<]>>>[-]
+>>>>>>[>>>>>>>>>]>+<]]+>>>[-<<<->>>]+<<<[->>>-<[-<<+>>]<<[->>+<<<<<<<<<<<[<<<<<
<<<<]>>>>[-]+>>>>>[>>>>>>>>>]>[-]+<]]+>[-<[>>>>>>>>>]<<<<<<<<]>>>>>>>>]<<<<<<<<<
[<<<<<<<<<]>>>>[-<<<<+>>>>]<<<<[->>>>+>>>>>[>+>>[-<<->>]<<[->>+<<]>>>>>>>>]<<<<<
<<<+<[>[->>>>>+<<<<[->>>>-<<<<<<<<<<<<<<+>>>>>>>>>>>[->>>+<<<]<]>[->>>-<<<<<<<<<
<<<<<+>>>>>>>>>>>]<<]>[->>>>+<<<[->>>-<<<<<<<<<<<<<<+>>>>>>>>>>>]<]>[->>>+<<<]<<
<<<<<<<<<<]>>>>[-]<<<<]>>>[-<<<+>>>]<<<[->>>+>>>>>>[>+>[-<->]<[->+<]>>>>>>>>]<<<
<<<<<+<[>[->>>>>+<<<[->>>-<<<<<<<<<<<<<<+>>>>>>>>>>[->>>>+<<<<]>]<[->>>>-<<<<<<<
<<<<<<<+>>>>>>>>>>]<]>>[->>>+<<<<[->>>>-<<<<<<<<<<<<<<+>>>>>>>>>>]>]<[->>>>+<<<<
]<<<<<<<<<<<]>>>>>>+<<<<<<]]>>>>[-<<<<+>>>>]<<<<[->>>>+>>>>>[>>>>>>>>>]<<<<<<<<<
[>[->>>>>+<<<<[->>>>-<<<<<<<<<<<<<<+>>>>>>>>>>>[->>>+<<<]<]>[->>>-<<<<<<<<<<<<<<
+>>>>>>>>>>>]<<]>[->>>>+<<<[->>>-<<<<<<<<<<<<<<+>>>>>>>>>>>]<]>[->>>+<<<]<<<<<<<
<<<<<]]>[-]>>[-]>[-]>>>>>[>>[-]>[-]>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>>>>>[-<
<<<+>>>>]<<<<[->>>>+<<<+<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>+++++++++++++++[
[>>>>>>>>>]+>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+
[>+>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>[-<<<<+>>>>]<<<<[->>>>+<<<<<[->>
[-<<+>>]<<[->>+>+<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>>>>>>>>]<<<<<<<<
<[>[->>>>>>>>>+<<<<<<<<<]<<<<<<<<<<]>[->>>>>>>>>+<<<<<<<<<]<+>>>>>>>>]<<<<<<<<<[
>[-]<->>>[-<<<+>[<->-<<<<<<<+>>>>>>>]<[->+<]>>>]<<[->>+<<]<+<<<<<<<<<]>>>>>>>>>[
>>>[-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]>
>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>[-]>>>>+++++++++++++++[[>>>>>>>>>]<<<<<<<<<-<<<<<
<<<<[<<<<<<<<<]>>>>>>>>>-]+[>>>[-<<<->>>]+<<<[->>>->[-<<<<+>>>>]<<<<[->>>>+<<<<<
<<<<<<<<[<<<<<<<<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>[-<<<<->>>>]+<<<<[->>>>-<[-
<<<+>>>]<<<[->>>+<<<<<<<<<<<<[<<<<<<<<<]>>>[-]+>>>>>>[>>>>>>>>>]>[-]+<]]+>[-<[>>
>>>>>>>]<<<<<<<<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>[-<<<+>>>]<<<[->>>+>>>>>>[>+>>>
[-<<<->>>]<<<[->>>+<<<]>>>>>>>>]<<<<<<<<+<[>[->+>[-<-<<<<<<<<<<+>>>>>>>>>>>>[-<<
+>>]<]>[-<<-<<<<<<<<<<+>>>>>>>>>>>>]<<<]>>[-<+>>[-<<-<<<<<<<<<<+>>>>>>>>>>>>]<]>
[-<<+>>]<<<<<<<<<<<<<]]>>>>[-<<<<+>>>>]<<<<[->>>>+>>>>>[>+>>[-<<->>]<<[->>+<<]>>
>>>>>>]<<<<<<<<+<[>[->+>>[-<<-<<<<<<<<<<+>>>>>>>>>>>[-<+>]>]<[-<-<<<<<<<<<<+>>>>
>>>>>>>]<<]>>>[-<<+>[-<-<<<<<<<<<<+>>>>>>>>>>>]>]<[-<+>]<<<<<<<<<<<<]>>>>>+<<<<<
]>>>>>>>>>[>>>[-]>[-]>[-]>>>>]<<<<<<<<<[<<<<<<<<<]>>>[-]>[-]>>>>>[>>>>>>>[-<<<<<
<+>>>>>>]<<<<<<[->>>>>>+<<<<+<<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>+>[-<-<<<<+>>>>
>]>>[-<<<<<<<[->>>>>+<++<<<<]>>>>>[-<<<<<+>>>>>]<->+>>]<<[->>+<<]<<<<<[->>>>>+<<
<<<]+>>>>[-<<<<->>>>]+<<<<[->>>>->>>>>[>>>[-<<<->>>]+<<<[->>>-<[-<<+>>]<<[->>+<<
<<<<<<<<<[<<<<<<<<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>[-<<->>]+<<[->>->[-<<<+>>>]<
<<[->>>+<<<<<<<<<<<<[<<<<<<<<<]>>>[-]+>>>>>>[>>>>>>>>>]>[-]+<]]+>[-<[>>>>>>>>>]<
<<<<<<<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>[-<<<+>>>]<<<[->>>+>>>>>>[>+>[-<->]<[->+
<]>>>>>>>>]<<<<<<<<+<[>[->>>>+<<[->>-<<<<<<<<<<<<<+>>>>>>>>>>[->>>+<<<]>]<[->>>-
<<<<<<<<<<<<<+>>>>>>>>>>]<]>>[->>+<<<[->>>-<<<<<<<<<<<<<+>>>>>>>>>>]>]<[->>>+<<<
]<<<<<<<<<<<]>>>>>[-]>>[-<<<<<<<+>>>>>>>]<<<<<<<[->>>>>>>+<<+<<<<<]]>>>>[-<<<<+>
>>>]<<<<[->>>>+>>>>>[>+>>[-<<->>]<<[->>+<<]>>>>>>>>]<<<<<<<<+<[>[->>>>+<<<[->>>-
<<<<<<<<<<<<<+>>>>>>>>>>>[->>+<<]<]>[->>-<<<<<<<<<<<<<+>>>>>>>>>>>]<<]>[->>>+<<[
->>-<<<<<<<<<<<<<+>>>>>>>>>>>]<]>[->>+<<]<<<<<<<<<<<<]]>>>>[-]<<<<]>>>>[-<<<<+>>
>>]<<<<[->>>>+>[-]>>[-<<<<<<<+>>>>>>>]<<<<<<<[->>>>>>>+<<+<<<<<]>>>>>>>>>[>>>>>>
>>>]<<<<<<<<<[>[->>>>+<<<[->>>-<<<<<<<<<<<<<+>>>>>>>>>>>[->>+<<]<]>[->>-<<<<<<<<
<<<<<+>>>>>>>>>>>]<<]>[->>>+<<[->>-<<<<<<<<<<<<<+>>>>>>>>>>>]<]>[->>+<<]<<<<<<<<
<<<<]]>>>>>>>>>[>>[-]>[-]>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>[-]>[-]>>>>>[>>>>>[-<<<<+
>>>>]<<<<[->>>>+<<<+<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>>>>>>[-<<<<<+>>>>>
]<<<<<[->>>>>+<<<+<<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>+++++++++++++++[[>>>>
>>>>>]+>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]>[-]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+[>+>>
>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>[-<<<<+>>>>]<<<<[->>>>+<<<<<[->>[-<<+
>>]<<[->>+>>+<<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>>>>>>>>]<<<<<<<<<[>
[->>>>>>>>>+<<<<<<<<<]<<<<<<<<<<]>[->>>>>>>>>+<<<<<<<<<]<+>>>>>>>>]<<<<<<<<<[>[-
]<->>>>[-<<<<+>[<->-<<<<<<+>>>>>>]<[->+<]>>>>]<<<[->>>+<<<]<+<<<<<<<<<]>>>>>>>>>
[>+>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>->>>>>[-<<<<<+>>>>>]<<<<<[->>>>>+<<<<
<<[->>>[-<<<+>>>]<<<[->>>+>+<<<<]+>>>>>>>>>]<<<<<<<<[<<<<<<<<<]]>>>>>>>>>[>>>>>>
>>>]<<<<<<<<<[>>[->>>>>>>>>+<<<<<<<<<]<<<<<<<<<<<]>>[->>>>>>>>>+<<<<<<<<<]<<+>>>
>>>>>]<<<<<<<<<[>[-]<->>>>[-<<<<+>[<->-<<<<<<+>>>>>>]<[->+<]>>>>]<<<[->>>+<<<]<+
<<<<<<<<<]>>>>>>>>>[>>>>[-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<+>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>]>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>+++++++++++++++[[>>>>>>>>
>]<<<<<<<<<-<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+>>>>>>>>>>>>>>>>>>>>>+<<<[<<<<<<<<<]
>>>>>>>>>[>>>[-<<<->>>]+<<<[->>>->[-<<<<+>>>>]<<<<[->>>>+<<<<<<<<<<<<<[<<<<<<<<<
]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>[-<<<<->>>>]+<<<<[->>>>-<[-<<<+>>>]<<<[->>>+<
<<<<<<<<<<<[<<<<<<<<<]>>>[-]+>>>>>>[>>>>>>>>>]>[-]+<]]+>[-<[>>>>>>>>>]<<<<<<<<]>
>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>->>[-<<<<+>>>>]<<<<[->>>>+<<[-]<<]>>]<<+>>>>[-<<<<
->>>>]+<<<<[->>>>-<<<<<<.>>]>>>>[-<<<<<<<.>>>>>>>]<<<[-]>[-]>[-]>[-]>[-]>[-]>>>[
>[-]>[-]>[-]>[-]>[-]>[-]>>>]<<<<<<<<<[<<<<<<<<<]>>>>>>>>>[>>>>>[-]>>>>]<<<<<<<<<
[<<<<<<<<<]>+++++++++++[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>+>>>>>>>>>+<<<<<<<<
<<<<<<[<<<<<<<<<]>>>>>>>[-<<<<<<<+>>>>>>>]<<<<<<<[->>>>>>>+[-]>>[>>>>>>>>>]<<<<<
<<<<[>>>>>>>[-<<<<<<+>>>>>>]<<<<<<[->>>>>>+<<<<<<<[<<<<<<<<<]>>>>>>>[-]+>>>]<<<<
<<<<<<]]>>>>>>>[-<<<<<<<+>>>>>>>]<<<<<<<[->>>>>>>+>>[>+>>>>[-<<<<->>>>]<<<<[->>>
>+<<<<]>>>>>>>>]<<+<<<<<<<[>>>>>[->>+<<]<<<<<<<<<<<<<<]>>>>>>>>>[>>>>>>>>>]<<<<<
<<<<[>[-]<->>>>>>>[-<<<<<<<+>[<->-<<<+>>>]<[->+<]>>>>>>>]<<<<<<[->>>>>>+<<<<<<]<
+<<<<<<<<<]>>>>>>>-<<<<[-]+<<<]+>>>>>>>[-<<<<<<<->>>>>>>]+<<<<<<<[->>>>>>>->>[>>
>>>[->>+<<]>>>>]<<<<<<<<<[>[-]<->>>>>>>[-<<<<<<<+>[<->-<<<+>>>]<[->+<]>>>>>>>]<<
<<<<[->>>>>>+<<<<<<]<+<<<<<<<<<]>+++++[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>+<<<
<<[<<<<<<<<<]>>>>>>>>>[>>>>>[-<<<<<->>>>>]+<<<<<[->>>>>->>[-<<<<<<<+>>>>>>>]<<<<
<<<[->>>>>>>+<<<<<<<<<<<<<<<<[<<<<<<<<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>>>>[-<
<<<<<<->>>>>>>]+<<<<<<<[->>>>>>>-<<[-<<<<<+>>>>>]<<<<<[->>>>>+<<<<<<<<<<<<<<[<<<
<<<<<<]>>>[-]+>>>>>>[>>>>>>>>>]>[-]+<]]+>[-<[>>>>>>>>>]<<<<<<<<]>>>>>>>>]<<<<<<<
<<[<<<<<<<<<]>>>>[-]<<<+++++[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>-<<<<<[<<<<<<<
<<]]>>>]<<<<.>>>>>>>>>>[>>>>>>[-]>>>]<<<<<<<<<[<<<<<<<<<]>++++++++++[-[->>>>>>>>
>+<<<<<<<<<]>>>>>>>>>]>>>>>+>>>>>>>>>+<<<<<<<<<<<<<<<[<<<<<<<<<]>>>>>>>>[-<<<<<<
<<+>>>>>>>>]<<<<<<<<[->>>>>>>>+[-]>[>>>>>>>>>]<<<<<<<<<[>>>>>>>>[-<<<<<<<+>>>>>>
>]<<<<<<<[->>>>>>>+<<<<<<<<[<<<<<<<<<]>>>>>>>>[-]+>>]<<<<<<<<<<]]>>>>>>>>[-<<<<<
<<<+>>>>>>>>]<<<<<<<<[->>>>>>>>+>[>+>>>>>[-<<<<<->>>>>]<<<<<[->>>>>+<<<<<]>>>>>>
>>]<+<<<<<<<<[>>>>>>[->>+<<]<<<<<<<<<<<<<<<]>>>>>>>>>[>>>>>>>>>]<<<<<<<<<[>[-]<-
>>>>>>>>[-<<<<<<<<+>[<->-<<+>>]<[->+<]>>>>>>>>]<<<<<<<[->>>>>>>+<<<<<<<]<+<<<<<<
<<<]>>>>>>>>-<<<<<[-]+<<<]+>>>>>>>>[-<<<<<<<<->>>>>>>>]+<<<<<<<<[->>>>>>>>->[>>>
>>>[->>+<<]>>>]<<<<<<<<<[>[-]<->>>>>>>>[-<<<<<<<<+>[<->-<<+>>]<[->+<]>>>>>>>>]<<
<<<<<[->>>>>>>+<<<<<<<]<+<<<<<<<<<]>+++++[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>>
+>>>>>>>>>>>>>>>>>>>>>>>>>>>+<<<<<<[<<<<<<<<<]>>>>>>>>>[>>>>>>[-<<<<<<->>>>>>]+<
<<<<<[->>>>>>->>[-<<<<<<<<+>>>>>>>>]<<<<<<<<[->>>>>>>>+<<<<<<<<<<<<<<<<<[<<<<<<<
<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>>>>>[-<<<<<<<<->>>>>>>>]+<<<<<<<<[->>>>>>>>
-<<[-<<<<<<+>>>>>>]<<<<<<[->>>>>>+<<<<<<<<<<<<<<<[<<<<<<<<<]>>>[-]+>>>>>>[>>>>>>
>>>]>[-]+<]]+>[-<[>>>>>>>>>]<<<<<<<<]>>>>>>>>]<<<<<<<<<[<<<<<<<<<]>>>>[-]<<<++++
+[-[->>>>>>>>>+<<<<<<<<<]>>>>>>>>>]>>>>>->>>>>>>>>>>>>>>>>>>>>>>>>>>-<<<<<<[<<<<
<<<<<]]>>>]


Note: on Rosettacode there is a simple D program that with the LDC2 compiler runs the mandel brainfuck program in less than 1.1 seconds (and LDC2 compiles it in 2.3 seconds with full optimizations):
http://rosettacode.org/wiki/Execute_Brain****/D
But here I'll use an interpreter.

All the D code in this post is compiled with the LDC2 compiler:
https://github.com/ldc-developers/ldc

I am using the 32 bit version 0.15.2-beta2 (DMD v2.066.1, LLVM 3.6.1) on a 64 bit Windows 8.1.

I compile all the code with a command like:
ldmd2 -O -release -inline -noboundscheck brainfuck.d



This is the original D version on https://github.com/kostya/benchmarks :

// brainfuck0.d
import std.algorithm;
import std.stdio;
import std.file;
import std.array;
import std.conv;

class Tape {
  int pos;
  int[] tape;

  this() {
    pos = 0;
    tape ~= 0;
  }

final:
  int get() { return tape[pos]; }
  void inc() { tape[pos]++; }
  void dec() { tape[pos]--; }
  void advance() { pos++; if (tape.length <= pos) tape ~= 0; }
  void devance() { if (pos > 0) { pos--; } }
};

class Program {
  string code;
  int[int] bracket_map;

  this(string text) {
    int[] leftstack;
    int pc = 0;

    for (int i = 0; i < text.length; i++) {
      char c = text[i];
      if (!canFind("[]<>+-,.", c)) continue;

      if (c == '[') leftstack ~= pc;
      else
        if (c == ']' && leftstack.length != 0) {
          int left = leftstack[leftstack.length - 1];
          leftstack.popBack();
          int right = pc;
          bracket_map[left] = right;
          bracket_map[right] = left;
        }

      pc++;
      code ~= c;
    }
  }

  void run() {
    auto tape = new Tape();
    for (int pc = 0; pc < code.length; pc++) {
      switch (code[pc]) {
        case '+':
          tape.inc();
          break;
        case '-':
          tape.dec();
          break;
        case '>':
          tape.advance();
          break;
        case '<':
          tape.devance();
          break;
        case '[':
          if (tape.get() == 0) pc = bracket_map[pc];
          break;
        case ']':
          if (tape.get() != 0) pc = bracket_map[pc];
          break;
        case '.':
          write(tape.get().to!char);
          stdout.flush();
          break;
        default:
          break;
      }
    }
  }
};

void main(string[] args){
  if (args.length == 2) {
      string text = readText(args[1]);
      auto p = new Program(text);
      p.run();
  }
}



The keys of the bracket_map associative array are points in the cleaned up Brainfuck program. Usually Brainfuck programs aren't huge, so using an associative array is a performance waste. This improved program uses a dynamic array (of ints) for the bracket_map:

// brainfuck1.d
import std.algorithm;
import std.stdio;
import std.file;
import std.array;
import std.conv;

class Tape {
  int pos;
  int[] tape;

  this() {
    pos = 0;
    tape ~= 0;
  }

final:
  int get() { return tape[pos]; }
  void inc() { tape[pos]++; }
  void dec() { tape[pos]--; }
  void advance() { pos++; if (tape.length <= pos) tape ~= 0; }
  void devance() { if (pos > 0) { pos--; } }
};

class Program {
  string code;
  int[] bracket_map;

  this(string text) {
    bracket_map.length = text.length;
    int[] leftstack;
    int pc = 0;

    for (int i = 0; i < text.length; i++) {
      char c = text[i];
      if (!canFind("[]<>+-,.", c)) continue;

      if (c == '[') leftstack ~= pc;
      else
        if (c == ']' && leftstack.length != 0) {
          int left = leftstack[leftstack.length - 1];
          leftstack.popBack();
          int right = pc;
          bracket_map[left] = right;
          bracket_map[right] = left;
        }

      pc++;
      code ~= c;
    }
  }

  void run() {
    auto tape = new Tape();
    for (int pc = 0; pc < code.length; pc++) {
      switch (code[pc]) {
        case '+':
          tape.inc();
          break;
        case '-':
          tape.dec();
          break;
        case '>':
          tape.advance();
          break;
        case '<':
          tape.devance();
          break;
        case '[':
          if (tape.get() == 0) pc = bracket_map[pc];
          break;
        case ']':
          if (tape.get() != 0) pc = bracket_map[pc];
          break;
        case '.':
          write(tape.get().to!char);
          stdout.flush();
          break;
        default:
          break;
      }
    }
  }
};

void main(string[] args){
  if (args.length == 2) {
      string text = readText(args[1]);
      auto p = new Program(text);
      p.run();
  }
}


Here I have reformatted the code a little, I have converted the classes to structs, I have converted code from string to uint[], and I have converted bracket_map to uint[]. Most of such changes don't improve performance, and indeed they decrease performance a little. But they lay the foundations for successive improvements.

// brainfuck2.d
import std.stdio, std.algorithm, std.file, std.array, std.conv;

struct Tape {
    uint pos;
    uint[] tape = [0];

    uint get() const nothrow { return tape[pos]; }
    void inc() nothrow { tape[pos]++; }
    void dec() nothrow { tape[pos]--; }
    void advance() nothrow { pos++; if (tape.length <= pos) tape ~= 0; }
    void devance() nothrow { if (pos > 0) pos--; }
}

struct Program {
    uint[] code, bracket_map;

    this(in string text) {
        bracket_map.length = text.length;
        uint[] left_stack;
        uint pc = 0;

        foreach (immutable c; text) {
            immutable pos = "+-><[].,".countUntil(c);
            if (pos == -1)
                continue;
            code ~= pos;

            if (c == '[') {
                left_stack ~= pc;
            } else {
                if (c == ']' && left_stack.length != 0) {
                    immutable left = left_stack[$ - 1];
                    left_stack.popBack;
                    immutable right = pc;
                    bracket_map[left] = right;
                    bracket_map[right] = left;
                }
            }

            pc++;
        }
    }

    void run() const {
        Tape tape;
        for (uint pc = 0; pc < code.length; pc++) {
            switch (code[pc]) {
                case 0:
                    tape.inc;
                    break;
                case 1:
                    tape.dec;
                    break;
                case 2:
                    tape.advance;
                    break;
                case 3:
                    tape.devance;
                    break;
                case 4:
                    if (tape.get == 0) pc = bracket_map[pc];
                    break;
                case 5:
                    if (tape.get != 0) pc = bracket_map[pc];
                    break;
                case 6:
                    tape.get.to!char.write;
                    stdout.flush;
                    break;
                default: // ',' ignored.
                    break;
            }
        }
    }
}

void main(in string[] args) {
    if (args.length == 2)
        args[1].readText.Program.run;
}


Unlike the GCC compiler, the D language doesn't have computed gotos, so you can't use them to speed up this Brainfuck interpreter. So I've added a sequence of if-goto to the switch cases (using a string mixin to keep code short), and thankfully the powerful LLVM back-end of the LDC2 compiler recognizes this pattern and turns it into poor's man computed gotos:

// brainfuck3.d
import std.stdio, std.algorithm, std.file, std.array, std.conv;

struct Tape {
    uint pos;
    uint[] tape = [0];

    uint get() const nothrow { return tape[pos]; }
    void inc() nothrow { tape[pos]++; }
    void dec() nothrow { tape[pos]--; }
    void advance() nothrow { pos++; if (tape.length <= pos) tape ~= 0; }
    void devance() nothrow { if (pos > 0) pos--; }
}

struct Program {
    uint[] code, bracket_map;

    this(in string text) {
        bracket_map.length = text.length;
        uint[] left_stack;
        uint pc = 0;

        foreach (immutable c; text) {
            immutable pos = "+-><[].,".countUntil(c);
            if (pos == -1)
                continue;
            code ~= pos;

            if (c == '[') {
                left_stack ~= pc;
            } else {
                if (c == ']' && left_stack.length != 0) {
                    immutable left = left_stack[$ - 1];
                    left_stack.popBack;
                    immutable right = pc;
                    bracket_map[left] = right;
                    bracket_map[right] = left;
                }
            }

            pc++;
        }
    }

    void run() const {
        enum jumps = "
                    if (pc < code.length) {
                        pc++;
                        if (code[pc] == 0) goto case 0;
                        if (code[pc] == 1) goto case 1;
                        if (code[pc] == 2) goto case 2;
                        if (code[pc] == 3) goto case 3;
                        if (code[pc] == 4) goto case 4;
                        if (code[pc] == 5) goto case 5;
                        if (code[pc] == 6) goto case 6;
                    }
                    break;";

        Tape tape;
        for (uint pc = 0; pc < code.length; pc++) {
            switch (code[pc]) {
                case 0:
                    tape.inc;
                    mixin(jumps);
                case 1:
                    tape.dec;
                    mixin(jumps);
                case 2:
                    tape.advance;
                    mixin(jumps);
                case 3:
                    tape.devance;
                    mixin(jumps);
                case 4:
                    if (tape.get == 0) pc = bracket_map[pc];
                    mixin(jumps);
                case 5:
                    if (tape.get != 0) pc = bracket_map[pc];
                    mixin(jumps);
                case 6:
                    tape.get.to!char.write;
                    stdout.flush;
                    mixin(jumps);
                default: // ',' ignored.
                    break;
            }
        }
    }
}

void main(in string[] args) {
    if (args.length == 2)
        args[1].readText.Program.run;
}



One last performance optimization is to use a fixed-size array to represent "tape". This is not so bad because traditionally Brainfuck runtimes don't have unbounded tape memory, as you see here:

https://en.wikipedia.org/wiki/Brainfuck#Language_design

The brainfuck language uses a simple machine model consisting of the program and instruction pointer, as well as an array of at least 30,000 byte cells initialized to zero;


// brainfuck4.d
import std.stdio, std.algorithm, std.file, std.array, std.conv;

struct Tape {
    uint pos;
    uint[100_000] tape;

    uint get() const nothrow { return tape[pos]; }
    void inc() nothrow { tape[pos]++; }
    void dec() nothrow { tape[pos]--; }
    void advance() nothrow { pos++; }
    void devance() nothrow { if (pos > 0) pos--; }
}

struct Program {
    uint[] code, bracket_map;

    this(in string text) {
        bracket_map.length = text.length;
        uint[] left_stack;
        uint pc = 0;

        foreach (immutable c; text) {
            immutable pos = "+-><[].,".countUntil(c);
            if (pos == -1)
                continue;
            code ~= pos;

            if (c == '[') {
                left_stack ~= pc;
            } else {
                if (c == ']' && left_stack.length != 0) {
                    immutable left = left_stack[$ - 1];
                    left_stack.popBack;
                    immutable right = pc;
                    bracket_map[left] = right;
                    bracket_map[right] = left;
                }
            }

            pc++;
        }
    }

    void run() const {
        enum jumps = "
                    if (pc < code.length) {
                        pc++;
                        if (code[pc] == 0) goto case 0;
                        if (code[pc] == 1) goto case 1;
                        if (code[pc] == 2) goto case 2;
                        if (code[pc] == 3) goto case 3;
                        if (code[pc] == 4) goto case 4;
                        if (code[pc] == 5) goto case 5;
                        if (code[pc] == 6) goto case 6;
                    }
                    break;";

        Tape tape;
        for (uint pc = 0; pc < code.length; pc++) {
            switch (code[pc]) {
                case 0:
                    tape.inc;
                    mixin(jumps);
                case 1:
                    tape.dec;
                    mixin(jumps);
                case 2:
                    tape.advance;
                    mixin(jumps);
                case 3:
                    tape.devance;
                    mixin(jumps);
                case 4:
                    if (tape.get == 0) pc = bracket_map[pc];
                    mixin(jumps);
                case 5:
                    if (tape.get != 0) pc = bracket_map[pc];
                    mixin(jumps);
                case 6:
                    tape.get.to!char.write;
                    stdout.flush;
                    mixin(jumps);
                default: // ',' ignored.
                    break;
            }
        }
    }
}

void main(in string[] args) {
    if (args.length == 2)
        args[1].readText.Program.run;
}


Timings "mandel", in seconds, best of 3:
  brainfuck0:  33.10
  brainfuck1:  21.11
  brainfuck2:  22.47
  brainfuck3:  15.82
  brainfuck4:  11.79


So the last version is almost three times faster than the original version.


Another smaller Brainfuck benchmark, "bench.d":

 Benchmark brainf*ck program
>++[<+++++++++++++>-]<[[>+>+<<-]>[<+>-]++++++++
[>++++++++<-]>.[-]<<>++++++++++[>++++++++++[>++
++++++++[>++++++++++[>++++++++++[>++++++++++[>+
+++++++++[-]<-]<-]<-]<-]<-]<-]<-]++++++++++.


Timings "mandel", in seconds, best of 3:
  brainfuck0:  4.82
  brainfuck1:  1.75
  brainfuck2:  2.16
  brainfuck3:  1.36
  brainfuck4:  1.25


This is a part of the brainfuck4 32 bit asm (generated with -output-s switch) generated by LDC2 (the whole run() method gets compiled to about 150 32 bit asm instructions):

...
LBB8_2:
	movl	4(%esi), %ecx
	movl	(%ecx,%edi,4), %edx
	cmpl	$6, %edx
	ja	LBB8_45
	jmpl	*LJTI8_0(,%edx,4)
LBB8_4:
	movl	%edx, 24(%esp)
	.align	16, 0x90
LBB8_5:
	movl	24(%esp), %edx
	incl	28(%esp,%edx,4)
	cmpl	%eax, %edi
	jae	LBB8_45
	movl	4(%ecx,%edi,4), %edx
	incl	%edi
	cmpl	$6, %edx
	ja	LBB8_45
	jmpl	*LJTI8_1(,%edx,4)
LBB8_8:
	movl	%edx, 24(%esp)
	.align	16, 0x90
LBB8_9:
	movl	24(%esp), %edx
	decl	28(%esp,%edx,4)
	cmpl	%eax, %edi
	jae	LBB8_45
	movl	4(%ecx,%edi,4), %edx
	incl	%edi
	cmpl	$6, %edx
	ja	LBB8_45
	jmpl	*LJTI8_2(,%edx,4)
LBB8_12:
	movl	24(%esp), %edx
...


It contains the instructions like "jmpl *LJTI8_0(,%edx,4)" that are similar to the ones produced by C-GCC with computed gotos and a jump table.

Where the jump tables LJTI8_0, LJTI8_1, etc, are all the same (LLVM seems unable to see they are all equal, wasting space on the L1 caches):

LJTI8_0:
	.long	LBB8_5
	.long	LBB8_9
	.long	LBB8_12
	.long	LBB8_17
	.long	LBB8_24
	.long	LBB8_32
	.long	LBB8_40
LJTI8_1:
	.long	LBB8_5
	.long	LBB8_9
	.long	LBB8_12
	.long	LBB8_17
	.long	LBB8_24
	.long	LBB8_32
	.long	LBB8_40
LJTI8_2:
	.long	LBB8_5
	.long	LBB8_9
...


Perhaps we can speed up the last D version grouping the most common short sequences of instructions, like ">>".

Links

Some links:

"The Unreasonable Effectiveness of Recurrent Neural Networks":
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Slides, "Statistics for Hackers" by Jake VanderPlas. Minimal but nice:
https://speakerdeck.com/jakevdp/statistics-for-hackers

"Pedestrian Statistics" by Pavel Panchekha, some more basic Statistics:
https://pavpanchekha.com/blog/stats1.html

Nice posts about bayesian thinking:
http://slatestarcodex.com/2015/08/24/probabilities-without-models/
http://slatestarcodex.com/2013/05/02/if-its-worth-doing-its-worth-doing-with-made-up-statistics/

The very simple inspection paradox:
http://allendowney.blogspot.co.uk/2015/08/the-inspection-paradox-is-everywhere.html

This shows the linerized solar system in scale (where the moon diameter is 1 pixel), very good at giving an idea of the distances between planets (something similar for interstellar distances should be created):
http://joshworth.com/dev/pixelspace/pixelspace_solarsystem.html

Links

News:

A new image of Chakasa Bluefire:
http://www.fantascienza.net/leonardo/im/index.html#bluefire_by_taru_2015

New software:
http://www.fantascienza.net/leonardo/js/index.html#backtracking

---------------

Links:

The Case for Formal Verification:
http://permalink.gmane.org/gmane.comp.encryption.general/14818

Why History isn't Scientific (And Why It Can Still Tell Us About the Past):
http://armariummagnus.blogspot.it/2013/11/why-history-isnt-scientific-and-why-it.html

Nice data structures of LLVM: http://llvm.org/devmtg/2014-04/PDFs/LightningTalks/data_structure_llvm.pdf

Paper Microscope:
http://www.youtube.com/watch?v=h8cF5QPPmWU

Reinventing Git interface:
http://tonsky.me/blog/reinventing-git-interface/

Nice intro to some parts of functional programming:
http://www.slideshare.net/ScottWlaschin/railway-oriented-programming

An interesting usage case study of SPARK (a safer Ada dialect) for critical robot navigation code:
http://www.spark-2014.org/uploads/spark-robot-navigation-iros2014.pdf

A shallow survey of formal methods for C code:
https://www.imperialviolet.org/2014/09/07/provers.html

A Derivation of Conway’s Degree-71 "Look-and-Say" Polynomial
http://www.njohnston.ca/2010/10/a-derivation-of-conways-degree-71-look-and-say-polynomial/

"The Unreasonable Effectiveness of Dynamic Typing for Practical Programs" by Robert Smallshire:
http://vimeo.com/74354480

Refined types: a better type system for more secure software:
https://github.com/tomprimozic/type-systems/tree/master/refined_types

Should Airplanes Be Flying Themselves?
http://www.vanityfair.com/business/2014/10/air-france-flight-447-crash

Types and Testing in Haskell, by Daniel Patterson:
http://dbp.io/static/types-testing-haskell-meetup-2014.pdf

Functional Programming Patterns:
http://www.slideshare.net/ScottWlaschin/fp-patterns-buildstufflt
http://schd.ws/hosted_files/buildstuff14/c5/Fp_Patterns.pptx

Comparing Rust and C++:
http://kukuruku.co/hub/rust/comparing-rust-and-cpp

Analysis of Chutes and Ladders:
http://www.datagenetics.com/blog/november12011/index.html

Mathematical speculations about bees:
http://aperiodical.com/2015/01/apiological-mathematical-speculations-about-bees-part-1-honeycomb-geometry/

The Reindeer Riders, with wonderful photos:
http://www.messynessychic.com/2015/02/25/the-reindeer-riders/

Engineers of addiction, on slot machines:
http://www.theverge.com/2015/5/6/8544303/casino-slot-machine-gambling-addiction-psychology-mobile-games

Addictions and their sources:
http://www.iflscience.com/health-and-medicine/many-people-use-drugs-here-s-why-most-don-t-become-addicts

Whale tongue nerves are elastic:
http://www.theglobeandmail.com/news/british-columbia/new-findings-on-whale-tongues-may-lead-to-clues-on-human-nerve-damage/article24251250/

Religion against religion:
http://www.slate.com/blogs/xx_factor/2015/05/04/satanists_in_missouri_they_argue_that_the_state_s_waiting_period_for_abortion.html

Living with being dead:
https://medium.com/matter/living-with-being-dead-844b4b1e668

You don't need to be genius to do mathematics:
http://www.wsj.com/articles/the-wrong-way-to-treat-child-geniuses-1401484790

A warm blooded fish and its adaptations:
http://news.sciencemag.org/biology/2015/05/scientists-discover-first-warm-bodied-fish

Friendships and gift economy used for better business:
http://www.theguardian.com/media/2015/may/07/how-friendship-became-tool-of-powerful
programming

High-level Spelling Corrector in D - reprise

Years ago I wrote a D V.1 language implementation of the little Norvig spell corrector, using my own library:
http://leonardo-m.livejournal.com/59589.html

The original code an explanation (the Python code is changed in the meantime):
http://norvig.com/spell-correct.html

In the meantime the D language and its standard library Phobos have improved a lot, so now I rewrite the D solution using just the standad library.

This is partially "golfed" D code, it's not an example of good D code.
import std.stdio, std.regex, std.algorithm, std.range, std.File, std.string, std.ascii;

int[string] nWords;

auto set(R)(R r) { return r.zip(0.repeat).assocArray; }

auto edits1(string W) @safe {
    immutable n = W.length;
    auto deletion = n.iota.map!(i=>W[0..i]~W[i+1..$]);
    auto transposition = iota(n-1).map!(i=>W[0..i]~W[i+1]~W[i]~W[i+2..$]);
    auto alteration = n.iota.cartesianProduct(lowercase).map!(p=>W[0..p[0]]~cast(char)p[1]~W[p[0]+1..$]);
    auto insertion = iota(n+1).cartesianProduct(lowercase).map!(p=>W[0..p[0]]~cast(char)p[1]~W[p[0]..$]);
    return chain(deletion, transposition, alteration, insertion).set;
}

auto known(R)(R words) { return words.filter!(w => w in nWords).set.keys; }

enum knownEdits2 = (string word) => word.edits1.byKey.map!(e1 => e1.edits1.byKey).joiner.known;

T or(T)(T x, lazy T y) { return x.length ? x : y; }

string correct(string word) {
    auto candidates = [word].known.or(word.edits1.byKey.known.or(word.knownEdits2.or([word])));
    return candidates.minPos!((w1, w2) => nWords.get(w1, -1) < nWords.get(w2, -1)).front;
}

void main() {
    foreach (mo; "big.txt".readText.toLower.matchAll("[a-z]+"))
        nWords[mo.front]++;
    writeln("speling".correct, '\n', "korrecter".correct);
}

The output is:
spelling
corrected

Phobos currently lacks a set implementation, so I've defined a function that uses the built-in associative arrays.
The "or" function is from the D1 code, with a small improvement.

The run-time of the D code compiled with ldc2 (V.0.15.1) is about 1.5 seconds (32 bit Windows). Python 2.6.6 runs the updated Python code (searching for both words) in about 1.3 seconds.

The Python version creates the NWORDS dict in about 1.1 seconds. The D version takes about 1 second to create it.
programming

The '99 Bottles of Beers' of Type Systems in D

A nice blog post, ""The "99 Bottles of Beers" of Type Systems"":

http://gelisam.blogspot.ca/2014/12/the-99-bottles-of-beers-of-type-systems_21.html


The problem is to manage integers, pairs of integers, pairs of pairs of integers, and so on, to represent them in the type system, and to print the n-th iteration of them:

0. Int
1. (Int,Int)
2. ((Int,Int),(Int,Int))
3. (((Int,Int),(Int,Int)),((Int,Int),(Int,Int)))
4. ...

An Haskell solution from that blog:
printTree :: Show a => a -> Int -> IO ()
printTree v 0 = print v
printTree v n = printTree (v,v) (n-1)

main :: IO ()
main = readLn >>= printTree 42

This Haskell solution is short, and it's also efficient at run-time (quite faster than the Java solution shown in that blog).

One D language solution uses the static (monomorphization) version of D generics, similar to C++ templates:
import std.stdio, std.conv, std.typetuple;

// For higher maxDepth increase stack with -L/STACK:10000000
enum maxDepth = 14;

template Iota(uint stop) {
    static if (stop <= 0)
        alias Iota = TypeTuple!();
    else
        alias Iota = TypeTuple!(Iota!(stop - 1), stop - 1);
}

struct Pair(T) {
    T x, y;

    string toString() {
        return text('(', x, ',', y, ')');
    }
}

void printTree(uint depth, T)(T v) {
    static if (depth == 0)
        writeln(v);
    else
        printTree!(depth - 1)(Pair!T(v, v));
}

void main(in string[] args) {
    immutable depth = (args.length == 2) ? args[1].to!uint : 5;
    enum value = 42;

    switch (depth) {
        foreach (immutable d; Iota!maxDepth)
            case d: return printTree!d(value);
        default: return writeln("Too much large depth.");
    }
}

This gives efficient code, and the Pairs are stack-allocated, but the compiler allows only a limited amount of nesting. The foreach inside the switch generates the switch cases statically. This works up to a depth of 13.

To reach higher levels of nesting in D language I use dynamic (runtime) polymorphism:
import std.stdio, std.conv, std.format;

interface Value {
    void toString(scope void delegate(const(char)[]) sink) const;
}

final class Integer : Value {
    immutable int x;

    this(int x_) pure nothrow @safe @nogc {
        this.x = x_;
    }

    override void toString(scope void delegate(const(char)[]) sink) const {
        static fmt = singleSpec("%d");
        sink.formatValue(x, fmt);
    }
}

final class Pair : Value {
    const Value fst, snd;

    this(Value fst, Value snd) pure nothrow @safe @nogc {
        this.fst = fst;
        this.snd = snd;
    }

    override void toString(scope void delegate(const(char)[]) sink) const {
        sink("(");
        fst.toString(sink);
        sink(",");
        snd.toString(sink);
        sink(")");
    }
}

struct PairWrapper {
    Value v;
    void toString(scope void delegate(const(char)[]) sink) const {
        v.toString(sink);
    }
}

void printTree(Value v, in uint n) {
    if (n == 0)
        writeln(PairWrapper(v));
    else
        printTree(new Pair(v, v), n - 1);
}

void main(in string[] args) {
    import core.memory, core.stdc.stdlib;
    GC.disable;
    immutable depth = (args.length == 2) ? args[1].to!uint : 5;
    printTree(new Integer(42), depth);
    exit(0);
}

The toString() inside the interface and PairWrapper are used because D classes don't yet support the sink. The sink is important to increase performance, avoiding allocating lot of strings. This version is more efficient also because it leads to more coherence for the L1 code cache. This D solution is just a little slower than the Haskell solution (perhaps because the Haskell solution caches the formatting of the integer 42, I don't know. If I perform such caching the D code becomes faster than the Haskell code), and it's much faster than the Java version (despite currently the D garbage collector is less efficient than the OracleVM ones).
programming

Permutations exercise in D (and C)

By leonardo maffi, V.1.0, 1 Aug 2014.


A nice programming exercise in Haskell by Harry Garrood, "Permutations: an exercise" (21 Jul 2014):
http://harry.garrood.me/blog/permutations-an-exercise/
http://www.reddit.com/r/haskell/comments/2ba54b/a_little_programming_exercise_about_permutations/

Harry Garrood writes:
We're going to use Haskell, because this is all about functions (in the mathematical sense), and so Haskell, being a functional programming language, is especially well suited to the job.
Most of the code you below is in D language (with a C version at the end) because it's a good language to solve similar problems.

The problem statement defines a card shuffling technique:
1) Take one card from the top of the deck and discard it into a second pile.
2) Take another card from the top of the deck, and put it at the bottom of the deck.
3) Repeat these two steps, putting all discarded cards from step 1 into the same pile, until the original deck is all gone and the second pile has all the cards in it.

The problem is: how many shuffles does it take until a deck is in the same order as when you started, for a deck with an arbitrary number of cards? Write a function, f :: Int -> Int, such that, for a deck with n cards, f n is the minimum number of shuffles required to return it to its original order.

Below I explain the varios D implementations, that you can find in the archive:
http://www.fantascienza.net/leonardo/js/permutations.zip

The fantascienza site is down, so you can temporarily find the programs here:

perms1.hs: http://lpaste.net/2412259684489625600
perms2.hs: http://lpaste.net/390308567523000320
perms3.d: http://dpaste.dzfl.pl/2eb987664dfd
perms3b.d: http://dpaste.dzfl.pl/0aa48d395cb2
perms4.d: http://dpaste.dzfl.pl/0b0cb344ccce
perms5.d: http://dpaste.dzfl.pl/9f929c2d064d
perms6.d: http://dpaste.dzfl.pl/dbf1d3f19321
perms7.d: http://dpaste.dzfl.pl/a8eddf54067d
perms7b.d: http://dpaste.dzfl.pl/8a9a1afc10f8
perms7c.d: http://dpaste.dzfl.pl/45ff250cafa7
perms8.d: http://dpaste.dzfl.pl/32614ae6430c
perms9.c: http://dpaste.dzfl.pl/608dd06e44a5


I compile all the code on a 32 bit Windows using:
dmd -O -release -inline -noboundscheck
ghc -O3
gcc -Ofast -flto -std=c99 -Wall -Wextra

dmd 2.066beta4
Glasgow Haskell Compiler, Version 7.6.1, stage 2 booted by GHC version 7.4.2
gcc 4.0.0
- - - - - - - - - - - - - - - - - - -

perms1.hs:
perms2.hs:

The perms1.hs and perms2.hs programs are adapted from the two solutions by Harry Garrood, with small changes.

This is the original 'permutations' function:
permutation :: (Deck -> Deck) -> Int -> PermutationGraph
permutation f n = go 1 xs
    where
    xs = unCardAll . f . makeDeck $ n
    go m (y:ys) = (y, m) : go (m+1) ys
    go _ []     = []

This is my version:
permutation :: (Deck -> Deck) -> Int -> PermutationGraph
permutation shuf n = zip perm [1 ..]
    where perm = unCardAll $ shuf $ makeDeck n
perms2.hs runs on the six test cases (n = 4, 5, 52, 53, 100, 200) in about 0.08 seconds (such small timings are not precise).

- - - - - - - - - - - - - - - - - - -

perms3.d:
perms3b.d:

This program is equivalent to the perms1.hs program.
Harry Garrood defines a very nice and clean Haskell function that performs the two first steps of the shuffling:
step :: (Deck, Deck) -> (Deck, Deck)
step (deck, pile) = case deck of
    a:b:cs -> (cs ++ [b], a : pile)
    [a]    -> ([], a : pile)
    []     -> ([], pile)

I have implemented this 'step' function in various ways in D, and at the end this is the version I have used (here I have removed the pre/post-conditions, see the files for the full code), this performs the whole shuffling (including the third step):
Deck deckShuffle(in Deck Ain) pure nothrow @safe @nogc {
    typeof(return) B;
    const(Card)[] A = Ain;

    while (!A.empty) {
        // Step 1:
        B = [Card(A.front)] ~ B;
        A.popFront;

        // Step 2:
        if (!A.empty) {
            A ~= A.front;
            A.popFront;
        }
    }

    return B;
}

This function is (strongly) pure, it accepts a constant array of Cards and performs the shuffling using slices and concatenations, building a new output array B. Here a Deck is a dynamic array, but perhaps for this algorithm a more efficient data structure for a Deck is a (singly linked) list (about as in the Haskell code).

In perms3 I have a commented-out the range-based version of 'makeDeck' because currently the 'array' function is not @safe for that use case. So I've written a more internally imperative version of 'makeDeck':
enum makeDeck = (in uint n) pure nothrow => iota(1, n + 1).map!Card.array;

I have used a Typedef to implement the 'newtype' of the Haskell code (despite currently Typedef has some bugs):
alias Card = Typedef!int;
The function 'f1' uses a do-while because Phobos still lacks an 'iterate' range-based function, as used in the 'shuffle' Haskell function.

I have run both perms1.hs and perms3.d on the few testing values, both print:
4 2
5 5
52 510
53 53
100 120
200 8460

perms1.hs produces this output in about 2.72 seconds:
[2,5,510,53,120,8460]

perms3.d produces this output in (best of 3) in about 1.39 seconds:
4 2
5 5
52 510
53 53
100 120
200 8460
I have then created a perms3b.d program that is very similar to perms3.d, but represents a Deck with a SList (a singly linked list) hoping to improve performance avoiding the O(n) copies of the slices of the array-based version. But perms3b.d runs in about 30.4 seconds. In most cases linked lists don't give any performance advantage.

- - - - - - - - - - - - - - - - - - -

perms4.d:

This version is similar to perms3.d, but here I have implemented 'deckShuffle' in a lower level way. This deckShuffle is now @nogc (it doesn't allocate on the heap), and works in-place on two given arrays, using indexes (it could also be implemented using pointer arithmetic, but there's no need for that, this function is fast and @safe). Now deckShuffle is still annotated with 'pure' but now it's weakly pure:
void deckShuffle(Deck A, Deck B) pure nothrow @safe @nogc {
    size_t aTop = 0,
           aLength = A.length,
           bTop = B.length - 1; // B is empty at start.

    while (aLength) {
        // Step 1:
        B[bTop] = A[aTop];
        aTop = (aTop + 1) % A.length;
        bTop--;
        aLength--;

        // Step 2:
        if (aLength) {
            A[(aTop + aLength) % A.length] = A[aTop];
            aTop = (aTop + 1) % A.length;
        }
    }
}
This kind of code is more fiddly and requires more care and testing compared to more a functional style, but it's also much faster and produces no heap garbage. The run-time of this function is also more predictable.

The run time of 'perms4.d' to produce the six output pairs is about 0.06 seconds, so this is about 23 times faster than 'perms3.d'. Unlike many other languages, in D it's easy to mix low level code with higher level code, and produce code that still looks natural and free of most accidental complexity.

- - - - - - - - - - - - - - - - - - -

perms5.d:

This implements the algorithm from perms2.hs. For simplicity I have used the slower but simpler 'deckShuffle' from perms3.d.

Instead of writing the various functions of the Haskell version, building the various intermediate data stutures (like the list of pairs), I have written a single 'f2' function that computes the lcm of the cycleLen without storing the actual cycles. D has less 'glue' than Haskell, so often in D it's more convenient to write a little larger functions (with the side effect that they are often faster than the Haskell versions, but also less easy to combine in different ways):
ulong f2(in uint n) pure nothrow {
    uint[uint] graph;
    foreach (immutable uint i, immutable x; n.makeDeck.deckShuffle)
        graph[i + 1] = cast(uint)x;

    typeof(return) result = 1;

    while (graph.length) {
        auto x = graph.byKey.front;
        uint cycleLen = 0;
        while (x in graph) {
            cycleLen++;
            immutable y = graph[x];
            graph.remove(x);
            x = y;
        }
        result = lcm(result, cycleLen);
    }

    return result;
}
This code is less general and more specialized than the Haskell functions. It's less easy to see what the code is doing. Here 'graph' of the pairs is an associative array to avoid the linear searches in the lists (but a linear scan is still present despite being hidden, the one performed by byKey.front).

The run-time of perms5.d is very small, so beside calling 'f2' on the six test cases, this function also computes 'f2' on the numbers n in [1, 1_000[. The total run-time is about 1.29 seconds.

If I modify perms2.hs to compute f2 in the same [1, 1_000[ interval, the run-time is very large (about half an hour, and it uses 32 bit integers, so some results for the [1, 1_000[ interval are wrong). So the 'f2' function in D has some advantages (and it's still strongly pure). The D 'f2' function returns a ulong because when n grows large, the output grows a lot.

- - - - - - - - - - - - - - - - - - -

perms6.d:

This is similar to perms5.d, but uses the more efficient 'deckShuffle' function from perms4.d.
The run-time of perms6.d is only 0.20 seconds (this includes the computations for n in [1, 1_000[), so it's more than six times faster than perms5.d.

- - - - - - - - - - - - - - - - - - -

perms7.d:
perms7b.d:
perms7c.d:

This is a lower-level version (that is easy to port to C language), it's similar to perms6.d, but here in 'f2' instead of the associative array 'graph' I have re-used the array D2, filling it with 'emptyItem' when I remove a value. 'f2' is now also @nogc because it accepts fixed-size arrays as arguments and then slices them. So the program allocates the arrays only inside the 'main' (and here are stack-allocated, so even the main is @nogc):
ulong f2(size_t m)(const uint n, ref Card[m] D1, ref Card[m] D2)
pure nothrow @safe @nogc {
    for (uint i = 0; i < n; i++)
        D1[i] = cast(Card)i;
    deck_shuffle(D1, D2, n);

    typeof(return) result = 1;
    uint graph_len = n;
    enum emptyItem = uint.max;

    while (graph_len) {
        uint x = uint.max;
        for (uint i = 0; i < n; i++)
            if (D2[i] != emptyItem)
                x = i;

        uint cycle_len = 0;
        while (D2[x] != emptyItem) {
            cycle_len++;
            const uint y = D2[x];
            D2[x] = emptyItem;
            graph_len--;
            x = y;
        }
        result = lcm(result, cycle_len);
    }

    return result;
}

void main() nothrow @nogc {
    import core.stdc.stdio: printf;

    const uint nMax = 1_000;
    Card[nMax] d1 = void, d2 = void;

    for (uint n = 1; n <= nMax; n++)
        printf("%llu\n", f2(n, d1, d2));
}
perms7.d runs in about 0.05 seconds. So it's about 4 times faster than perms6.d.

To avoid overflow bugs I have created the perms7b.d and perms7c.d versions, that use core.checkedint (despite core.checkedint is not meant for user code). For both perms7b.d and perms7c.d the run-time is similar to the unchecked versions, and perms7b.d spots the overflow bugs:
...
1389: 4247163320400
1390: 465585120
1391: 1391
1392: 6570
1393: 4294945235192621194 (overflow)
1394: 3655769040
1395: 1194
1396: 930
1397: 1748154975840
1398: 1398
...
- - - - - - - - - - - - - - - - - - -

perms8.d:

This version is very similar to perms7.d, but I've used BigInt instead of ulong to compute all correct results (if there are no other bugs or design mistakes).

perms8.d is still quite fast, it runs in about 0.08 seconds (this includes the computations for n in [1, 1_000[). If I increase nMax to 10_000 it runs in about 6.6 seconds. D BigInts are not very fast, but I think this is not a significant problem for this program.

- - - - - - - - - - - - - - - - - - -

perms9.c:

This is C code derived from perms7.d.

perms9.c runs in about 0.03 seconds or less. perms9.c is a little faster than perms7.d, but the speed difference is really visible when nMax is much larger (but then the program gives wrong outputs becuse of overflows). The difference between 0.03 and 0.05 seconds is mostly caused by the time taken to initialize the D-runtime.

- - - - - - - - - - - - - - - - - - -
programming

K-Nearest Neighbor in D language

By leonardo maffi,
version 1.1.

Warning: currently nearest4/nearest5 don't work correctly, I'll try to fix them.

(Later this article will be moved to my site.)

These blog posts present a performance comparison between F# and OCaml implementations of a simple K-Nearest Neighbor (with K=1) classifier algorithm to recognize handwritten digits:

http://philtomson.github.io/blog/2014/05/29/comparing-a-machine-learning-algorithm-implemented-in-f-number-and-ocaml/

http://philtomson.github.io/blog/2014/05/30/stop-the-presses-ocaml-wins/

One of the Reddit discussions:

http://www.reddit.com/r/fsharp/comments/26vl3w/stop_the_presses_ocaml_wins_in_terms_of_speed_vs/

The original F# and OCaml code:
https://github.com/philtomson/ClassifyDigits

So I've studied and compared versions of this code in D language.

First I've written a functional-style version, 27 CLOC lines long, that is similar to the array-based OcaML version:
import std.stdio, std.algorithm, std.range, std.array, std.conv,
       std.string, std.math, std.typecons;

struct LabelPixels { ubyte label; ubyte[] pixels; }

immutable readData = (in string fileName) =>
    fileName
    .File
    .byLine
    .dropOne
    .map!(r => r.chomp.split(",").to!(ubyte[]))
    .map!(a => LabelPixels(a[0], a.dropOne))
    .array;

immutable distance = (in ubyte[] p1, in ubyte[] p2)
pure /*nothrow*/ @safe /*@nogc*/ =>
    double(p1.zip(p2).map!(ab => (ab[0] - ab[1]) ^^ 2).sum).sqrt;

immutable classify = (in LabelPixels[] trainingSet, in ubyte[] pixels)
/*pure nothrow*/ @safe =>
    trainingSet
    .map!(s => tuple(pixels.distance(s.pixels), ubyte(s.label)))
    .reduce!min[1];

void main() {
    const trainingSet = "training_sample.csv".readData;
    const validationSample = "validation_sample.csv".readData;

    immutable nCorrect = validationSample
                         .map!(s => trainingSet.classify(s.pixels) == s.label)
                         .sum;
    writeln("Percentage correct: ", double(nCorrect) / validationSample.length * 100);
}

Notes to this first version:
- Inside LabelPixels the pixes are represented with an unsigned byte each. This reduces the RAM used to store the data and allows to use the CPU cache more efficiently.
- "readData" is a lambda assigned to a global immutable value. This is not fully idiomatic D style, but I've used this because it looks more similar to the F# code. In practice I usually don't use such global lambdas because the output type gives me useful information to understand the purpose of the function, and they are less easy to find in the assembly (you can't just search for "readData" in the asm listing to find this lambda.
- The D standard library contains a module to read CSV files (std.csv), but I always find its API bad, so I've used File.byLine and I have processed the lines lazily.
- In readData I've used the eager function split instead of the lazy splitter (that uses less RAM) because for the data sets used by this program split is faster (readData loads both CSV files in about 0.8-0.9 seconds on my PC).
- In the "distance" lambda the nothrow and @nogc attributes are commented out because zip() doesn't yet respect such attributes (zip throws an exception in certain cases and the code to throw the exception allocates it on the heap). The "classify" lambda calls "distance" so it can't be nothrow and @nogc (and because reduce doesn't have a default value in case its input range is empty, so it can throw, and because classify contains a heap-allocating lambda).
- The distance lambda contains code like "map!(ab => (int(ab[0]) - ab[1])" because D is not yet able to de-structure tuples, like the 2-tuples generated lazily by zip.
- The classify lambda contains a map of tuples because the max/min functions of Phobos still lack an optional "key" function like the max/min functions of Python. So I've created tuples of the mapped value plus the label and I've used reduce to find the min. I have used "ubyte(s.label)" to produce a mutable tuple, that reduce doesn't accept (unless you use a mutable seed).

Most run-time of this program is used inside the distance function. So if you want to optimize this program that's the function to improve.

In the zip you can find a "nearest1b.d" file that is very similar to this first function, but it's adapted to the current version of the LDC2 compiler, that is a little older (it doesn't support the @nogc attribute, T(x) syntax for implicit casts, and its Phobos doesn't have a sum function).

The D compilers (even the well optimizing LDC2 complier) is not yet very good at optimizing that functional distance function. So in the "nearest2.d" version of the program (that you can find in the zip archive) I've written distance in imperative way, this produces a program that compiled with ldc is more than 3 times faster:
uint distance(in ubyte[] p1, in ubyte[] p2) pure nothrow @safe {
    uint tot = 0;
    foreach (immutable i, immutable p1i; p1)
        tot += (p1i - p2[i]) ^^ 2;
    return tot;
}

Now this distance function can be pure nothrow @safe. If you look this function also cheats a little not computing the square root, because it's not useful if you want to search the closest. This function also computes the total using integers.

I have not shown the asm generated by LDC2 for the nearest lambda in the "nearest1.d" program because it's too much complex and messy, but I can show the asm for this second simpler distance function:
__D8nearest38distanceFNaNbNfyAhyAhZk:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	24(%esp), %ecx
	xorl	%eax, %eax
	testl	%ecx, %ecx
	je	LBB8_3
	movl	28(%esp), %edx
	movl	20(%esp), %esi
	.align	16, 0x90
LBB8_2:
	movzbl	(%edx), %edi
	movzbl	(%esi), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%edi, %eax
	incl	%edx
	incl	%esi
	decl	%ecx
	jne	LBB8_2
LBB8_3:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$16

As you see the loop gets compiled to quite simple asm with 9 X86 (32 bit) instructions.

To improve performance I have changed the data layout a little. With the "binary_generator1.d" program you can find in the zip archive I have generated binary files of the training set and validation sample. So now the loading of the data is very simple and very fast (see in the zip archive for the full source of this "nearest3.d" program:
enum nCols = 785;
enum size_t labelIndex = 0;
alias TData = ubyte;
alias TLabel = TData;

immutable(TData[nCols])[] readData(in string fileName) {
    return cast(typeof(return))std.file.read(fileName);
}

uint distance(immutable ref TData[nCols - 1] p1,
              immutable ref TData[nCols - 1] p2)
pure nothrow @safe /*@nogc*/ {
    uint tot = 0;
    foreach (immutable i, immutable p1i; p1)
        tot += (p1i - p2[i]) ^^ 2;
    return tot;
}

TLabel classify(immutable TData[nCols][] trainingSet,
                immutable ref TData[nCols - 1] pixels)
pure nothrow @safe /*@nogc*/ {
    auto closestDistance = uint.max;
    auto closestLabel = TLabel.max;

    foreach (immutable ref s; trainingSet) {
        immutable dist = pixels.distance(s[1 .. $]);
        if (dist < closestDistance) {
            closestDistance = dist;
            closestLabel = s[labelIndex];
        }
    }

    return closestLabel;
}

Notes:
- The classify function is now imperative, but this improves the performance just a little.
- @nogc is still commented out because the current version of the LDC2 compiler doesn't support it yet.
- The main performance difference of this third program comes the data layout. Now I have hard-coded (defined statically) the number of columns nCols. So now the data sets are essentially a ubyte[N][]. In D this is represented as a dense array, that minimize cache misses and reduce by one the number of levels of indirection. Thanks to the compile-time knowledge of the loop bounds inside the distance function, the LLVM back-end of the LDC2 compiler performs some loop unrolling (in theory it can be performed even before, but in practice the the version 3.4.1 of the LLVM is not performing loop unrolling on dynamic bounds).

So the asm of the distance function of the "nearest3.d" program is longer because of a partial loop unwinding:
__D8nearest38distanceFNaNbNfKyG784hKyG784hZk:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	xorl	%ecx, %ecx
	movl	$7, %edx
	movl	16(%esp), %esi
	.align	16, 0x90
LBB1_1:
	movzbl	-7(%esi,%edx), %edi
	movzbl	-7(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	-6(%esi,%edx), %ecx
	movzbl	-6(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	movzbl	-5(%esi,%edx), %edi
	movzbl	-5(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	-4(%esi,%edx), %ecx
	movzbl	-4(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	movzbl	-3(%esi,%edx), %edi
	movzbl	-3(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	-2(%esi,%edx), %ecx
	movzbl	-2(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	movzbl	-1(%esi,%edx), %edi
	movzbl	-1(%eax,%edx), %ebx
	subl	%ebx, %edi
	imull	%edi, %edi
	addl	%ecx, %edi
	movzbl	(%esi,%edx), %ecx
	movzbl	(%eax,%edx), %ebx
	subl	%ebx, %ecx
	imull	%ecx, %ecx
	addl	%edi, %ecx
	addl	$8, %edx
	cmpl	$791, %edx
	jne	LBB1_1
	movl	%ecx, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$4

Thanks to the data layout changes and other smaller improvements, the "nearest3.d" function is almost three times faster than "nearest2.d".

To reduce the run time and better utilize one CPU core, we can use the SIMD registers of the CPU.

To perform a subtraction of the arrays I change the data layout again, this time I represent the pixels with short (in D signed 16 bit integers). So I use a short8 to perform several subtractions in parallel. A problem comes from the label, if I slice it away as in the "nearest3.d" program, the array pointers are not aligned to 16 bytes and this is not accepted by my CPU. To solve this problem I add a padding of 7 short in every line of the matrix after the label. This is performed by the "binary_generator2.d" program.

So the 4th version of the program contains (see in the zip archive for the full code of "nearest4.d"):
enum nCols = 785;
enum size_t labelIndex = 0;
alias TData = short;
alias TLabel = TData;

immutable(TData[nCols + 7])[] readData(in string fileName) {
    return cast(typeof(return))std.file.read(fileName);
}

uint distance(immutable ref TData[nCols - 1] p1,
              immutable ref TData[nCols - 1] p2) pure nothrow /*@nogc*/ {
    alias TV = short8;
    enum size_t Vlen = TV.init.array.length;
    assert(p1.length % Vlen == 0);
    immutable v1 = cast(immutable TV*)p1.ptr;
    immutable v2 = cast(immutable TV*)p2.ptr;

    TV totV = 0;
    foreach (immutable i; 0 .. p1.length / Vlen) {
        TV d = v1[i] - v2[i];
        totV += d * d;
    }

    uint tot = 0;
    foreach (immutable t; totV.array)
        tot += t;
    return tot;
}

TLabel classify(immutable TData[nCols + 7][] trainingSet,
                immutable ref TData[nCols - 1] pixels) pure nothrow /*@nogc*/ {
    auto closestDistance = uint.max;
    auto closestLabel = short.max;

    foreach (immutable ref s; trainingSet) {
        immutable dist = pixels.distance(s[8 .. $]);
        if (dist < closestDistance) {
            closestDistance = dist;
            closestLabel = s[labelIndex];
        }
    }

    return closestLabel;
}

Notes:
- The classify function is not changed much, it performs the slicing to throw away the first eight (short label + 7 padding short) items.
- The distance function is a little more complex, but not too much. It uses the basic SIMD operations like subtraction, sum, product, plus the ".array" attribute that allows me to manage a short8 as a fixed-size value to sum all its contents.
- The code of distance is designed to be adaptable to other sizes of SIMD registers (but it's not able to adapt automatically).
- This "nearest4.d" program is only 1.6 times faster than "nearest4.d" probably because the SIMD code that manages shorts is not very clean.

The asm for the distance function of the "nearest4.d" program is rather long, despite being fast:
__D8nearest48distanceFNaNbKyG784sKyG784sZk:
	pushl	%edi
	pushl	%esi
	pxor	%xmm0, %xmm0
	movl	$208, %ecx
	movl	12(%esp), %edx
	.align	16, 0x90
LBB1_1:
	movdqa	-208(%edx,%ecx), %xmm1
	movdqa	-192(%edx,%ecx), %xmm2
	psubw	-208(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	psubw	-192(%eax,%ecx), %xmm2
	pmullw	%xmm2, %xmm2
	paddw	%xmm1, %xmm2
	movdqa	-176(%edx,%ecx), %xmm0
	psubw	-176(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm2, %xmm0
	movdqa	-160(%edx,%ecx), %xmm1
	psubw	-160(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-144(%edx,%ecx), %xmm0
	psubw	-144(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-128(%edx,%ecx), %xmm1
	psubw	-128(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-112(%edx,%ecx), %xmm0
	psubw	-112(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-96(%edx,%ecx), %xmm1
	psubw	-96(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-80(%edx,%ecx), %xmm0
	psubw	-80(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-64(%edx,%ecx), %xmm1
	psubw	-64(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-48(%edx,%ecx), %xmm0
	psubw	-48(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm1, %xmm0
	movdqa	-32(%edx,%ecx), %xmm1
	psubw	-32(%eax,%ecx), %xmm1
	pmullw	%xmm1, %xmm1
	paddw	%xmm0, %xmm1
	movdqa	-16(%edx,%ecx), %xmm2
	psubw	-16(%eax,%ecx), %xmm2
	pmullw	%xmm2, %xmm2
	paddw	%xmm1, %xmm2
	movdqa	(%edx,%ecx), %xmm0
	psubw	(%eax,%ecx), %xmm0
	pmullw	%xmm0, %xmm0
	paddw	%xmm2, %xmm0
	addl	$224, %ecx
	cmpl	$1776, %ecx
	jne	LBB1_1
	pshufd	$3, %xmm0, %xmm1
	movd	%xmm1, %eax
	movdqa	%xmm0, %xmm1
	movhlps	%xmm1, %xmm1
	movd	%xmm1, %ecx
	pshufd	$1, %xmm0, %xmm1
	movd	%xmm1, %edx
	movd	%xmm0, %esi
	movswl	%si, %edi
	sarl	$16, %esi
	addl	%edi, %esi
	movswl	%dx, %edi
	addl	%esi, %edi
	sarl	$16, %edx
	addl	%edi, %edx
	movswl	%cx, %esi
	addl	%edx, %esi
	sarl	$16, %ecx
	addl	%esi, %ecx
	movswl	%ax, %edx
	addl	%ecx, %edx
	sarl	$16, %eax
	addl	%edx, %eax
	popl	%esi
	popl	%edi
	ret	$4

The run time of the varions versions, in seconds, best of 3:
  nearest1:   91     (dmd)
  nearest1b:  19.7   (ldc2)
  nearest2:    5.91  (ldc2)
  nearest3:    2.05  (ldc2)
  nearest4:    1.28  (ldc2)

I have compiled the programs with:
dmd -wi -d -O -release -inline -boundscheck=off nearest1.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest1b.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest2.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest3.d
ldmd2 -wi -unroll-allow-partial -O -release -inline -noboundscheck nearest4.d
strip nearest1b.exe nearest2.exe nearest3.exe nearest4.exe

I have used the compilers:

DMD32 D Compiler v2.066

And ldc2 V.0.13.0-beta1, based on DMD v2.064 and LLVM 3.4.1, Default target: i686-pc-mingw32, Host CPU: core2.

On my 32 bit Windows the binaries generated are:
    1.692 nearest1.d
  221.212 nearest1.exe
    1.757 nearest1b.d
1.079.822 nearest1b.ex
    1.716 nearest2.d
1.070.606 nearest2.exe
    2.059 nearest2b.d
1.071.118 nearest2b.exe
    2.037 nearest3.d
1.353.230 nearest3.exe
    2.365 nearest4.d
1.353.742 nearest4.exe
    2.513 nearest5.d

In the zip archive I have also added a "nearest5.d" program that shows a recent improvement of the D type system (not yet available in ldc2), to support fixed-size array length inference for slices passed to templated functions.

With a run-time 1.28 seconds for the "nearest4.d" program there are still ways to reduce the run-time. I am using an old 2.3 GHz CPU, so if you use a modern Intel 4 GHz CPU like the Core i7-4790K and with faster memory, you can see a significant speedup. If you use a modern CPU you can also use YMM registers with short16 SIMD, that should offer some speedup (LDC2 is already able to target such YMM registers, you need to replace short8 with short16 in the nearest4.d program and change the padding). And then this program is very easy to parallelize for two or four cores, you can use the std.parallelism module for the computation of nCorrect with map+reduce in the main function, that should give some speedup.

You can find all the code and data here:
http://www.fantascienza.net/leonardo/js/nearest.7z

(This 7z archive contains the data sets too, I hope this is OK. Otherwise I'll remove them from the 7z archive.)
programming

Links and updates

Updates:

Given the natural numbers n and m, find the minimum number of integer-sided squares that tile an (n,m) rectangle:
http://www.fantascienza.net/leonardo/js/index.html#squares_problem
A D solution that improves the performance of a C++ solution.

A benchmark from the Computer Game site, in D:
http://www.fantascienza.net/leonardo/js/index.html#meteor_bench

A D implementation of a simple and famous simulation:
http://www.fantascienza.net/leonardo/js/segregation_game.zip

Various type-level solutions in D language of the "Instant Insanity" puzzle (inspired by a Haskell solution):
http://www.fantascienza.net/leonardo/js/instant_insanity.zip

-------------

Links:


A testing experience in Haskell:
http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/

A simple but sometimes useful optimization for modern CPUs:
http://blog.libtorrent.org/2013/12/memory-cache-optimizations/

A nice talk about the Julia language:
http://vimeo.com/84661077

"Types for Units-of-Measure: Theory and Practice" by Andrew Kennedy, the complexity behind the units of measure feature of F# (PDF file):
http://research.microsoft.com/en-us/um/people/akenn/units/cefp09typesforunitsofmeasure.pdf

A little story that reminds us that teaching debugging at the University is important:
http://danluu.com/teach-debugging/
programming

Links and updates

Updated software:

http://www.fantascienza.net/leonardo/js/index.html#fasta
http://www.fantascienza.net/leonardo/js/index.html#audioactive


Added software:

http://www.fantascienza.net/leonardo/js/index.html#puzzle15game
http://www.fantascienza.net/leonardo/js/index.html#naval_simulation
http://www.fantascienza.net/leonardo/js/index.html#water_filling
http://www.fantascienza.net/leonardo/js/mark_sweep.zip
http://www.fantascienza.net/leonardo/js/index.html#challenge139


Links, mostly programming:

Easy use of Wikipedia from Python:
https://github.com/goldsmith/Wikipedia

A little sad Python-Haskell comparison:
http://eyallotem.blogspot.co.il/2013/05/comparing-python-and-haskell.html

Introductions to Parser Combinators:
http://blog.fogcreek.com/fparsec/
http://yannesposito.com/Scratch/en/blog/Parsec-Presentation/

"What I Wish I Knew When Learning Haskell", not easy to understand:
http://dev.stephendiehl.com/hask/

Mathics is a general-purpose computer algebra system:
http://www.mathics.net/

Slides of the StrangeLoop 2013 conference:
https://github.com/strangeloop/StrangeLoop2013/tree/master/slides/sessions

Introduction to Typed Clojure:
http://nathanic.org/posts/2013/typed-clojure-tour/

Simple introduction to Monoids for programmers:
http://fsharpforfunandprofit.com/posts/monoids-without-tears/
http://fsharpforfunandprofit.com/posts/monoids-part2/

Several simple C++ implementations of software rendeerers:
http://kagamin.net/hole/simple/index.htm

Cute but with several reserves:
http://www.xfront.com/Haskell/Why-Functional-Programming-Matters-Part-2.pdf

Doctest for Haskell:
https://github.com/sol/doctest-haskell#readme

Units-of-Measure in F#:
http://clear-lines.com/blog/post/Safe-refactoring-with-FSharp-Units-of-Measure.aspx

Nice and simple Pythagoras Tree in F#:
http://fsharpnews.blogspot.it/2009/05/pythagoras-tree.html

Eye-opening, "Dispelling the Myths of Parallel Computing" by Patrick H. Madden:
http://www.cs.binghamton.edu/~pmadden/pubs/dispelling-ieeedt-2013.pdf

Parse, of the Rebol language:
http://www.red-lang.org/2013/11/041-introducing-parse.html