leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 2 entries.

Tags:, , , , ,
Subject:Programming languages for bioinformatics
Time:04:49 pm
I have found this interesting suite of benchmarks:
http://www.bioinformatics.org/benchmark/

Related to this "A comparison of common programming languages used in bioinformatics" article:
http://www.biomedcentral.com/1471-2105/9/82

See this too:
http://hackmap.blogspot.com/2008/02/fast-python-with-shedskin.html

I have tried to repeat some of your benchmarks, but I can't repeat the 9+ GB file used for one of them.
So far I have tested only the "alignment.py" program, because it's the only with data available.

Few comments on that work:

1) For Python, for such 3 benchmarks the Psyco JustInTime compiler helps a lot. Just install it:
http://psyco.sourceforge.net/
and add the following line to the code:
import psyco; psyco.full()


2) I haven't run the program "parse.py", but it shows various inefficiencies:
- No Psyco
- heavy-loop processing code outside funtions. Just put that big while 1: inside a main() function will help.
- Binary file loading is faster than normal one, just read the file with "rb".
- Python files are iterables, so to scan a file you just need:
for line in file("somefile", "rb"): print line
- To take the next line you just need:
f = file("somefile", "rb"):
print f.next()
- You don't need to use string functions, Python strings have methods, so you can do "something".replace("some", "any")
- Sometimes in Python you can replace REs with string methods.
- This is a bad way to strip a line:
line = string.replace(line,'\n','')
This is better and probably faster:
line = line.rstrip()


3) I like the D language, so I have tried it:
http://www.digitalmars.com/d/1.0/index.html
My timings of the "alignment" program:
  D3:        0.96 s    41 MB    66 lines
  C:         1.05 s    41 MB   146 lines
  C++:       1.22 s    41 MB    87 lines
  D2:        1.95 s    54 MB    60 lines
  D1:        2.73 s    57 MB    58 lines
  Java:      3.72 s    53 MB    79 lines
  Psyco:    11.22 s   168 MB    64 lines
  Python2: 115.3  s  ~160 MB    63 lines
Notes:
3a) C is the C code compiled with:
-pipe -O3 -s -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro -fprofile-generate
-pipe -O3 -s -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro -fprofile-use
3b) D code (compiler V. 1.026) compiled with:
-O -release -inline
D1 is a direct translation of the Java version
D2: inverts the building of the strings, and reverses them at the end (appending at the start of an array is slow thing in any language)
D3: is like D4 but it doesn't pre-clear the allocated memory.

Finally, my D code can be found here:
http://www.fantascienza.net/leonardo/js/bio_bench.zip

Update Mar 7 2008: later I have found one version of the L Gene of the Hantaan virus:
http://beta.uniprot.org/uniprot/P23456.fasta
comments: Leave a comment Add to Memories Tell a Friend

Tags:,
Subject:Programs: bioutil and dotplot (e altro)
Time:07:58 pm
I have updated the bioutil.py library (renamed from the old name bio_util.py):
http://www.fantascienza.net/leonardo/so/index.html#bioutil1

Python Program to compute the dotplot of two given genetic sequences:
http://www.fantascienza.net/leonardo/so/dotplot/dotplot.html
An example output:


Uno dei problemi della bioinformatica e' che la biologia e' analogica, non esitono mai regolarita' esatte, anche il codice genetico ha vari casi particolari, varianti, modifiche, ecc. Altre irregolarita' sono aggiunte dal fatto che i genomi che noi abbiamo letto sono dei risultati sperimentali, per cui hanno approssimazione un'intrinseca. Per cui lavorare col computer su genomi e sulla bionformatica in generale puo' essere irritante perfino usando Python, perche' non si riesce mai ad inquadrare o definire bene qualcosa, ci sono sempre casi non-proprio-giusti-ma-quasi da tenere di conto. Questo rende ad esempio difficile fare delle tabelle "definite", ad esempio un semplice dict per convertire qualcosa in qualcosa differente.

Citazione da "How Perl Saved the Human Genome Project", by Lincoln Stein (http://bioperl.org/wiki/How_Perl_saved_human_genome ):
>Perl is forgiving. Biological data is often incomplete, fields can be missing, or a field that is expected to be present once occurs several times (because, for example, an experiment was run in duplicate), or the data was entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to pick up and correct a variety of common errors in data entry. Of course this flexibility can be also be a curse.<
comments: 1 comment or Leave a comment Add to Memories Tell a Friend

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 2 entries.