leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 1 entries.

Tags:, , , , ,
Subject:Programming languages for bioinformatics
Time:04:49 pm
I have found this interesting suite of benchmarks:
http://www.bioinformatics.org/benchmark/

Related to this "A comparison of common programming languages used in bioinformatics" article:
http://www.biomedcentral.com/1471-2105/9/82

See this too:
http://hackmap.blogspot.com/2008/02/fast-python-with-shedskin.html

I have tried to repeat some of your benchmarks, but I can't repeat the 9+ GB file used for one of them.
So far I have tested only the "alignment.py" program, because it's the only with data available.

Few comments on that work:

1) For Python, for such 3 benchmarks the Psyco JustInTime compiler helps a lot. Just install it:
http://psyco.sourceforge.net/
and add the following line to the code:
import psyco; psyco.full()


2) I haven't run the program "parse.py", but it shows various inefficiencies:
- No Psyco
- heavy-loop processing code outside funtions. Just put that big while 1: inside a main() function will help.
- Binary file loading is faster than normal one, just read the file with "rb".
- Python files are iterables, so to scan a file you just need:
for line in file("somefile", "rb"): print line
- To take the next line you just need:
f = file("somefile", "rb"):
print f.next()
- You don't need to use string functions, Python strings have methods, so you can do "something".replace("some", "any")
- Sometimes in Python you can replace REs with string methods.
- This is a bad way to strip a line:
line = string.replace(line,'\n','')
This is better and probably faster:
line = line.rstrip()


3) I like the D language, so I have tried it:
http://www.digitalmars.com/d/1.0/index.html
My timings of the "alignment" program:
  D3:        0.96 s    41 MB    66 lines
  C:         1.05 s    41 MB   146 lines
  C++:       1.22 s    41 MB    87 lines
  D2:        1.95 s    54 MB    60 lines
  D1:        2.73 s    57 MB    58 lines
  Java:      3.72 s    53 MB    79 lines
  Psyco:    11.22 s   168 MB    64 lines
  Python2: 115.3  s  ~160 MB    63 lines
Notes:
3a) C is the C code compiled with:
-pipe -O3 -s -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro -fprofile-generate
-pipe -O3 -s -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro -fprofile-use
3b) D code (compiler V. 1.026) compiled with:
-O -release -inline
D1 is a direct translation of the Java version
D2: inverts the building of the strings, and reverses them at the end (appending at the start of an array is slow thing in any language)
D3: is like D4 but it doesn't pre-clear the allocated memory.

Finally, my D code can be found here:
http://www.fantascienza.net/leonardo/js/bio_bench.zip

Update Mar 7 2008: later I have found one version of the L Gene of the Hantaan virus:
http://beta.uniprot.org/uniprot/P23456.fasta
comments: Leave a comment Add to Memories Tell a Friend

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 1 entries.