| A little puzzle: Given an array of n integers, find all Pythagorean triples in the array, that is, three elements such that a^2 + b^2 = c^2. Do this in O(n^2) time.
A simple solution is to test the equation with a hash of the squares on every unordered (a^2,b^2) pair (see the first Python version below).
First I have written a Python solution that uses a set, then I have translated it to D using my set module, then using D associative arrays, then I have translated the code to C adapting a simple hash code to integer keys. Then I have simplified the C code, removing the values from the hash.
That's the faster general solution, but the code can be improved if the problem is made less general. So I have tried to count the number of distinct Pythagorean triples among the integers [1,10_000]. So I have tried to simplify the hash some more. I have tried a really simple hash, and while counting the collision frequency I have found a very simple hash configuration where every number is present at its place in the array, or it's absent. Then as usual I have backported that code to Psyco and D.
Click to see the source code ( Click here )
Using MinGW 4.2.1 I have compiled that C code with: gcc -O3 -s -march=pentiumpro -fprofile-generate triples.c -o triples Followed by a run, followed by: gcc -O3 -s -march=pentiumpro -fprofile-use triples.c -o triples
The general D version works with any group of input numbers (not tested below).
The D1 version is really close to the C version. The D2 version is faster and higher-level, it uses my d libs: www.fantascienza.net/leonardo/so/libs_d.zip
The timing results (Pentium3 500 MHz):
Timings triples, N = 10_000:
D general: 13.38 s
Psyco: 6.89 s (4.12 X) (1.87 X)
D1: 4.02 s (2.40 X) (1.09 X)
D2: 3.68 s (2.20 X) (1 X)
C: 1.67 s (1 X) (0.45 X)
Triples in [1, 10000): 12467
Triples in [1, 20000): 27171
Triples in [1, 40000): 58728
Triples in [1, 80000): 394138
The timings are quite good for Psyco but quite bad for D: the C-like D version is 2.4 times slower than the C version, and the Psyco version is just 1.7 times slower than the C-like D version (and the Psyco version is 1.9 times slower than the faster D version). This means that with this tiny program there's not that much purpose in using D instead of Psyco. | comments: 1 comment or Leave a comment  |
| A little programming challenge, "Jolly Jumpers" in D: http://mikael.jansson.be/journal/2008/03/jolly-jumpers-challenge
The version jolly() below comes from a C version by the Reddit user manannan (indents reduced to 2 spaces for LiveJournal). The version jolly2() is for maniacs :-) It uses 1/8 of memory of jolly() using a bitmap with the bts compiler intrinsic, and alloca() to speed up memory allocation.
import std.math: abs, ceil;
import std.intrinsic: bts;
import std.c.stdlib: alloca;
import std.stdio: writefln;
bool jolly(int[] arr) {
auto flags = new ubyte[arr.length - 1];
foreach (i, el; arr[0 .. $-1]) {
auto diff = abs(arr[i+1] - el) - 1;
if (diff < 0 || diff > arr.length || flags[diff]++)
return false;
}
return true;
}
bool jolly2(int[] arr) {
int nuints = cast(int)ceil(arr.length / cast(float)(uint.sizeof * 8));
uint[] flags = (cast(uint*)alloca(nuints * uint.sizeof))[0 .. nuints];
if (flags.ptr is null)
throw new Exception("Stack overflow");
flags[] = 0;
foreach (i, el; arr[0 .. $-1]) {
int diff = abs(arr[i+1] - el) - 1;
if (diff < 0 || diff > arr.length || bts(flags.ptr, diff))
return false;
}
return true;
}
void main() {
foreach (a; [[1, 4, 2, 3], [1, 4, 2, -1, 6]])
writefln(jolly(a), " ", jolly2(a));
}
| comments: Leave a comment  |
| Slow updates because of limited net access.
The C language is designed for the hardware of the computer. Python is designed for the programmer. The D language is designed for both the designer of its compiler and for its compiler.
-----------------------
Few links:
Many free video courses, for science and engineering, there are an couple of them for Python too, with interesting PDF files, that show how to use Python in various scientific situations: http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering.html
Very nice, how to teach some Computer Science without computer to young children, some of the things shown are nice, like the sorting algorithms: http://csunplugged.com/ But I don't know if it's an advantage for young children to learn sorting nets and other things. I think they have more important things to learn first, like reading complex texts, writing well, some mathematics, statistics, geometry, combinatorics, etc.
A really interesting parser+interepreter, PyMeta, it's a version of OMeta+Python: http://washort.twistedmatrix.com/ OMeta is a part of a dream to create now foundations of computer programming, this is the third part of an article that discusses related matters: http://www.moserware.com/2008/04/towards-moores-law-software-part-3-of-3.html
-----------------------
As soon as possible I'll update my D libs, there are many news. New things: - editDistance, editDistanceFast: they find the Levenshtein distance between two arrays (strings too). - ALL: template that's true if the template predicate is true for all the items of the given tuple. - MAP: template that maps a template on all the items of the given tuple and return the resulting tuple. - BKtree: fast implementation of BK-trees to find nearby objects, given a distance function. - polyCentroid: to find area and centroid coordinate of a given polygon. - delVoidC, delVoidGC, NewVoidCArray, NewVoidCMatrix, NewVoidGCArray, NewVoidGCMatrix: high-performance functions to allocate/deallocate 1D/2D void arrays. - xchain, Chainable: to chain any number of any kind of iterable, plus now all x-something lazy iterable classes support the ~ operator to chain them. Changes: - Removed boxarray and strarray functions because they are of little use. - ExprType and xsplit simplified and improved. - 'select' module renamed to 'ranking' to avoid any name clashing with func.select. - Removed dependencies from 'meta' package, it's not included anymore.
-----------------------
I have developed the editDistanceFast/BKtree code in Python and D, and I have found that there are some advantages in developing in two (quite different) languages at a time. - If during the development I put a bug in one of the two implementations, they may give different outputs, so I am able to see there's a bug. - Optimizing the code for D allows me to focus to low level things. While if I program just in Python I seem not much able to "see" some of such details; and in this code I have found that every time I have invent a way to speed up the D code, the same trick was able to speed up the Psyco (the Python JIT) version too. - Python allows me to quickly write a program that works, that is nearly bug-free. It's great for prototyping, while developing in D is slower. Later I can usually translate the code to D. Python allows me to write very short code too, and short programs are very useful, because they often hide less bugs and you can understand them better and faster. Later I may use my D libs to translate the Python code to a very short D code (using the high-order functions, etc). While if I program in D from the start I may miss spots where I can shorten the code with little performance loss. - The short original Python code, plus the wonderful doctests allow me to set a starting point, to define what's the correct output of the code. Later I can translate the Python doctests to D unittests, and even later I can use the short and easy Python version to test if the D version is correct. - So jumping between the two languages allows me to find many things that I may miss when I develop in just one language. The result is code that is readable, debugged, short, and fast :-)
As an example of the results of such language jumping, this is the D version of the editDistance, that's back-translated from Python, showing a quite short and readable code (I have created a pair of editDistance and editDistanceFast in both languages, the fast version has limits in the length of the possible input iterables/arrays, and avoid dynamic allocations):
int editDistance(T)(T[] s1, T[] s2) {
if (len(s1) > len(s2)) { auto sa = s1; s1 = s2; s2 = sa; }
auto r1 = range(len(s2) + 1);
auto r2 = new int[len(r1)];
foreach (i, c1; s1) {
r2[0] = i + 1;
foreach (j, c2; s2)
r2[j+1] = c1 == c2 ? r1[j] : min(r2[j], r1[j], r1[j+1]) + 1;
auto ra = r1; r1 = r2; r2 = ra;
}
return r1[$ - 1];
}
While the following Psyco code is essentially back-translated from the fast D version:
def editDistanceFast(s1, s2, r1=[0]*35, r2=[0]*35):
if s1 == s2: return 0
if len(s1) > len(s2):
s1, s2 = s2, s1
len_s2 = len(s2)
assert len(s2) <= 34, "Error: one input sequence"\
"is too much long (> 34), use editDistance()."
for i in xrange(len_s2 + 1):
r1[i] = i
r2[i] = 0
i = 0
for c1 in s1:
r2[0] = i + 1
j = 0
for c2 in s2:
if c1 == c2:
r2[j+1] = r1[j]
else:
a1 = r2[j]
a2 = r1[j]
a3 = r1[j+1]
if a1 > a2:
if a2 > a3:
r2[j+1] = 1 + a3
else:
r2[j+1] = 1 + a2
else:
if a1 > a3:
r2[j+1] = 1 + a3
else:
r2[j+1] = 1 + a1
j += 1
aux = r1; r1 = r2; r2 = aux
i += 1
return r1[len_s2]
There you can even see those r1=[0]*35 and r2=[0]*35 that are a way to implement a kind of the static arrays I have used in the fast D version. The end result is that the fast D version is just 4.5 times faster than the fast Psyco version.
You can find the Python version here: http://www.fantascienza.net/leonardo/so/bktree.py | comments: 2 comments or Leave a comment  |
| Here I have collected few benchmarks where the D language, or the DMD compiler, or the Phobos std lib don't shine in their performance. Two of the following performance problems (see 'gc1' and 'sort' benchmarks) have already a solution.
http://www.fantascienza.net/leonardo/js/slow_d.zip
--------------------
Compilers/interpreters used: Digital Mars D Compiler v1.028
gcc version 4.2.1-dw2 (mingw32-2)
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Psyco V.1.5.2 (JIT for Python)
javac 1.6.0_03, Java version "1.6.0_03"
All timings are "warm", best of 3, in seconds. CPU used is a Pentium3 500 MHz, 256 MB RAM, with Win.
Compilation flags used (when not specified otherwise): gcc: -O3 -s dmd: -O -release -inline Python: no flags.
-------------------
BENCHMARKS:
recursive, n = 38:
C: 3.89 s (with __builtin_expect)
C: 3.99 s
D: 4.95 s
hash, n = 500_000:
D: 7.07 s (7.15 s with GC patched)
D: 5.89 s (GC disabled)
Psyco: 4.23 s
(Note that Python+Psyco has much bigger overhead compared to D in any instruction, simple loops too).
wordcount, on a txt file of 13_312_768 bytes:
C: 0.73 s
D: 5.77 s
gc1, with 6.3 MB of text:
loading splitting total
time time time
D: 1.87 s 3.58 s 13.76 s
Python: 0.28 s 1.98 s 3.52 s
Python: 0.1 s 2.0 s 3.38 s (load with 'rb', same result)
With a Patch to the D GC, plus disabling the GC:
D: 0.08 s 0.72 s 1.36 s (GC patched + GC disabled after load)
gc2, n=1_000, m=10_000:
seconds MB
DMD class: 18.95 1.7
GDC class: 17.91 1.8
DMD struct: 11.77 1.7
GDC struct: 12.31 1.8
Python: 37.10 3.1
Psyco: 15.68 3.5
Java: 2.19 7.3
gc3 (binarytrees), n = 15:
Java: 9.12 s
D: 35.01 s
sort, const n = 1_000_000:
Random distribution:
sort: 2.934
fastSort: 0.881
Already sorted arrays:
sort: 1.723
fastSort: 0.41
Inverted order arrays:
sort: 1.853
fastSort: 0.651
(For a better and more complete version of fastSort see my D libs: www.fantascienza.net/leonardo/so/libs_d.zip ). | comments: Leave a comment  |
| A little amusing post about language syntax: http://cadrlife.blogspot.com/2008/03/returning-considered-difficult.html
"Suppose you want a function to find all subsequences of a list with a certain length. For instance, the subsequences of (1...5) with length 2 are as follows."
(1 2) (1 3) (1 4) (1 5)
(2 3) (2 4) (2 5)
(3 4) (3 5)
(4 5)
In a lisp dialect:
(def subn (n li)
(if (is 0 n) '(())
(no li) '()
(join (map [cons (car li) _] (subn (- n 1) (cdr li)))
(subn n (cdr li)))))
Haskell:
subn 0 _ = [[]]
subn _ [] = []
subn n (x:xs) = map (x:) (subn (n-1) xs) ++ subn n xs
Python:
def subn(n, a):
if not n: return [[]]
if not a: return []
return [[a[0]] + sub for sub in subn(n-1, a[1:])] + subn(n, a[1:])
The Scala version too uses pattern matching, showing how much it can be useful (it may be useful for D language too):
def subn[T](n: int, li : List[T]) : List[List[T]] = (n,li) match {
case (0,_) => List(Nil)
case (_,Nil) => Nil
case (_,x::xs) => (subn(n-1, xs) map (x :: _)) ::: subn(n, xs)
}
D version using my libs:
import d.func: putr, map, not, range, xrange, select, BaseType, array;
T[][] subn(T)(int n, T[] a) {
if (!n) return new T[][](1, 0);
if (not(a)) return new T[][0];
return map((T[] s){return a[0..1] ~ s;}, subn(n-1, a[1..$])) ~ subn(n, a[1..$]);
}
If you want shorter lines you can use select (reverted from the 'list' name):
T[][] subn2(T)(int n, T[] a) {
if (!n) return new T[][](1, 0);
if (not(a)) return new T[][0];
T[] s;
return select(a[0..1] ~ s, s, subn2(n-1, a[1..$])) ~ subn2(n, a[1..$]);
}
This version is able to digest any iterable:
BaseType!(TyIter)[][] subn3(TyIter)(int n, TyIter iter) {
auto a = array(iter);
alias BaseType!(TyIter) T;
if (!n) return new T[][](1, 0);
if (not(a)) return new T[][0];
return map((T[] s){return a[0..1] ~ s;}, subn3(n-1, a[1..$])) ~ subn3(n, a[1..$]);
}
void main() {
putr( subn(2, range(5)) );
// Output: [[0, 1], [0, 2], [0, 3], [0, 4], [1, 2], [1, 3], [1, 4],
// [2, 3], [2, 4], [3, 4]]
putr( subn2(3, range(5)) );
// [[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [0, 3, 4],
// [1, 2, 3], [1, 2, 4], [1, 3, 4], [2, 3, 4]]
putr( subn3(2, xrange(5)) );
// Output: [[0, 1], [0, 2], [0, 3], [0, 4], [1, 2], [1, 3], [1, 4],
// [2, 3], [2, 4], [3, 4]]
} | comments: Leave a comment  |
| I have (slightly) adapted the Python-like generators for D written by Witold Baryluk to my libs. They allow a simpler syntax that can be used for single iterations (not parallel scan allowed, you have to use some threads or light threads). So I have used them to translate to D some self-recursive generators from Python code by David Eppstein:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/221457
As an example this yields Fibbinary numbers (no consecutive ones in binary representation): http://www.research.att.com/projects/OEIS?Anum=A003714
struct A003714 {
void generator() {
yield(1);
foreach(x; A003714()) {
yield(2 * x);
if (!(x & 1))
yield(2 * x + 1);
}
}
mixin Generator!(int);
}
The original Python version:
def A003714():
yield 1
for x in A003714():
yield 2 * x
if not (x & 1):
yield 2 * x + 1
The syntax isn't as clean as Python one, but it's simple enough anyway. They are 2-3 times slower than handwritten opApply struct methods, so you may want to use them only where code size/readability matters more.
( Click here to see more examples ) | comments: Leave a comment  |
| The Euler problem n.14:
The following iterative sequence is defined for the set of positive integers: n -> n/2 (n is even) n -> 3n + 1 (n is odd) Using the rule above and starting with 13, we generate the following sequence of 10 terms: 13 -> 40 -> 20 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1 Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1. Which starting number, under one million, produces the longest chain of Collatz numbers?
My Python solution, that uses memoization of results to speed up computation:
import array
def main(m):
lens = array.array('H', [0]) * m
max_steps = max_steps_pos = 1
for i in xrange(1, m):
n = i
n_steps = 1
while n != 1:
if n < i:
n_steps += lens[n] - 1
break
n = 3 * n + 1 if n & 1 else n // 2
n_steps += 1
if n_steps > max_steps:
max_steps = n_steps
max_steps_pos = i
lens[i] = n_steps
return max_steps, max_steps_pos
import psyco; psyco.bind(main)
print "max_steps, max_steps_pos:", main(1000000)
Psyco is often able to manage array.array very efficiently. 'H' represents unsigned shorts [0 .. 2^16-1], because the chains are never too much long. This allows to reduce memory used and improve CPU cache efficiency (mostly in the D version). In Python you can use 'l' that is safer and doesn't slow down the code much, but uses twice the memory (so you can manage smaller problems).
The D version is essentially the same, the only significant difference is that the array isn't pre-initialized:
import std.stdio, std.c.stdlib;
void main() {
const uint m = 1_000_000;
auto lens = (cast(ushort*)malloc(m * ushort.sizeof))[0 .. m];
uint max_steps = 1;
uint max_steps_pos = 1;
for (uint i = 1; i < m; i++) {
uint n = i;
uint n_steps = 1;
while (n != 1) {
if (n < i) {
n_steps += lens[n] - 1;
break;
}
n = n & 1 ? 3 * n + 1 : n / 2;
n_steps++;
}
if (n_steps > max_steps) {
max_steps = n_steps;
max_steps_pos = i;
}
lens[i] = n_steps;
}
writefln("max_steps, max_steps_pos: ", max_steps, " ", max_steps_pos);
}
Timings:
m = 1_000_000:
D/C: 0.29 s ( 1 X)
Psyco: 1.60 s ( 5.5 X) with array 'H'
Python: 19.20 s (66 X) with list
A C version is very similar to that D version, and the running time (with GCC 4.2.1) is the same, but the executable is almost 20 times smaller.
The D code scans m = 100_000_000 in 25 s finding a chain of length 758 on n= 83_706_505.
An Haskell version: http://www.haskell.org/haskellwiki/Euler_problems/11_to_20#Problem_14
To improve cache efficiency I have tried to store half of the values, but the code is slower (it allows to scan twice the values with the same max memory). | comments: Leave a comment  |
| GCC supports gotos to variables addresses too, they can be useful in some special situations, like when you want to optimize an interpreter (and sometimes a finite state machine too, see below), you can find a short post about gcc-specific C extensions: http://majek4.blogspot.com/2008/03/useful-c-extensions-gcc-specific.html
This is example code from that blog post:
#include "stdio.h"
#include "assert.h"
int main(int argc, char** argv) {
int value = 2;
if (argc == 2)
value = atoi(argv[1]);
assert(value >= 0 && value <= 2);
const void *labels[] = {&&VAL0, &&VAL1, &&VAL2};
goto *labels[value];
assert(0);
VAL0:
puts("The value is 0");
goto END;
VAL1:
puts("The value is 1");
goto END;
VAL2:
puts("The value is 2");
goto END;
END:
return 0;
}
Compiling that C code with MinGW (GCC) V.4.2.1 it produces the following assembly (two parts in the middle of the code):
...
LC1:
.ascii "value >= 0 && value <= 2\0"
LC2:
.ascii "The value is 0\0"
LC3:
.ascii "The value is 1\0"
LC4:
.ascii "The value is 2\0"
...
...
call __assert ; corrisponde alla riga assert(0);
L6:
movl $L7, -20(%ebp)
movl $L8, -16(%ebp)
movl $L9, -12(%ebp)
movl -8(%ebp), %eax
movl -20(%ebp,%eax,4), %eax
jmp *%eax
L7:
movl $LC2, (%esp)
call _puts
jmp L10
L8:
movl $LC3, (%esp)
call _puts
jmp L10
L9:
movl $LC4, (%esp)
call _puts
L10:
movl $0, %eax
addl $36, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
...
Where the jmp *%eax is the computed goto. Note that in ANSI C you can write the middle part of that code as:
switch (value) {
case 0:
puts("The value is 0");
break;
case 1:
puts("The value is 1");
break;
case 2:
puts("The value is 2");
break;
}
return 0;
But in more complex situations it may help.
I have tried to use that kind of goto in the C version of the Audioactive series generator, and it speeds up the code about 3%. Inside the S0 state the following code:
switch (*p1++) {
case '1': goto S1;
case '2': goto S2;
case '3': goto S3;
}
Is replaced by:
goto *labelsS123[*p1++ - '1'];
Where labelsS123 is:
const void *labelsS123[] = {&&S1, &&S2, &&S3};
| comments: Leave a comment  |
| I have removed a bug from the D/C code, and I have added a faster Python version. Psyco is able to speed it up *25* times, so it's just 3 times slower than the faster C version (that uses a different algorithm), and it uses essentially the same memory: http://www.fantascienza.net/leonardo/js/audioactive.zip
Beside that Python version, I have tried to translate to Python the finite state automata version too, producing something like this:
p1 = p2 = 0
pend = len1
S0, S1, S2, S3, S11, S22, S33, END = xrange(8)
state = S0
while True:
if state == S0:
if p1 == pend: state = END
elif s1[p1] == "1": p1 += 1; state = S1
elif s1[p1] == "2": p1 += 1; state = S2
elif s1[p1] == "3": p1 += 1; state = S3
elif state == S1:
if p1 == pend: s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = END
elif s1[p1] == "1": p1 += 1; state = S11
elif s1[p1] == "2": p1 += 1; s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = S2
elif s1[p1] == "3": p1 += 1; s2[p2] = '1'; p2 += 1; s2[p2] = '1'; p2 += 1; state = S3
...
But it results a bit slower than the version (derived from the switc-based C/D version that I have later removed) that you can find in the audioactive zip. | comments: Leave a comment  |
|