| Here I have collected few benchmarks where the D language, or the DMD compiler, or the Phobos std lib don't shine in their performance. Two of the following performance problems (see 'gc1' and 'sort' benchmarks) have already a solution.
http://www.fantascienza.net/leonardo/js/slow_d.zip
--------------------
Compilers/interpreters used: Digital Mars D Compiler v1.028
gcc version 4.2.1-dw2 (mingw32-2)
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Psyco V.1.5.2 (JIT for Python)
javac 1.6.0_03, Java version "1.6.0_03"
All timings are "warm", best of 3, in seconds. CPU used is a Pentium3 500 MHz, 256 MB RAM, with Win.
Compilation flags used (when not specified otherwise): gcc: -O3 -s dmd: -O -release -inline Python: no flags.
-------------------
BENCHMARKS:
recursive, n = 38:
C: 3.89 s (with __builtin_expect)
C: 3.99 s
D: 4.95 s
hash, n = 500_000:
D: 7.07 s (7.15 s with GC patched)
D: 5.89 s (GC disabled)
Psyco: 4.23 s
(Note that Python+Psyco has much bigger overhead compared to D in any instruction, simple loops too).
wordcount, on a txt file of 13_312_768 bytes:
C: 0.73 s
D: 5.77 s
gc1, with 6.3 MB of text:
loading splitting total
time time time
D: 1.87 s 3.58 s 13.76 s
Python: 0.28 s 1.98 s 3.52 s
Python: 0.1 s 2.0 s 3.38 s (load with 'rb', same result)
With a Patch to the D GC, plus disabling the GC:
D: 0.08 s 0.72 s 1.36 s (GC patched + GC disabled after load)
gc2, n=1_000, m=10_000:
seconds MB
DMD class: 18.95 1.7
GDC class: 17.91 1.8
DMD struct: 11.77 1.7
GDC struct: 12.31 1.8
Python: 37.10 3.1
Psyco: 15.68 3.5
Java: 2.19 7.3
gc3 (binarytrees), n = 15:
Java: 9.12 s
D: 35.01 s
sort, const n = 1_000_000:
Random distribution:
sort: 2.934
fastSort: 0.881
Already sorted arrays:
sort: 1.723
fastSort: 0.41
Inverted order arrays:
sort: 1.853
fastSort: 0.651
(For a better and more complete version of fastSort see my D libs: www.fantascienza.net/leonardo/so/libs_d.zip ). | comments: Leave a comment  |
| I have found this interesting suite of benchmarks: http://www.bioinformatics.org/benchmark/
Related to this "A comparison of common programming languages used in bioinformatics" article: http://www.biomedcentral.com/1471-2105/9/82
See this too: http://hackmap.blogspot.com/2008/02/fast-python-with-shedskin.html
I have tried to repeat some of your benchmarks, but I can't repeat the 9+ GB file used for one of them. So far I have tested only the "alignment.py" program, because it's the only with data available.
Few comments on that work:
1) For Python, for such 3 benchmarks the Psyco JustInTime compiler helps a lot. Just install it: http://psyco.sourceforge.net/ and add the following line to the code: import psyco; psyco.full()
2) I haven't run the program "parse.py", but it shows various inefficiencies: - No Psyco - heavy-loop processing code outside funtions. Just put that big while 1: inside a main() function will help. - Binary file loading is faster than normal one, just read the file with "rb". - Python files are iterables, so to scan a file you just need: for line in file("somefile", "rb"): print line - To take the next line you just need: f = file("somefile", "rb"): print f.next() - You don't need to use string functions, Python strings have methods, so you can do "something".replace("some", "any") - Sometimes in Python you can replace REs with string methods. - This is a bad way to strip a line: line = string.replace(line,'\n','') This is better and probably faster: line = line.rstrip()
3) I like the D language, so I have tried it: http://www.digitalmars.com/d/1.0/index.html
My timings of the "alignment" program:
D3: 0.96 s 41 MB 66 lines
C: 1.05 s 41 MB 146 lines
C++: 1.22 s 41 MB 87 lines
D2: 1.95 s 54 MB 60 lines
D1: 2.73 s 57 MB 58 lines
Java: 3.72 s 53 MB 79 lines
Psyco: 11.22 s 168 MB 64 lines
Python2: 115.3 s ~160 MB 63 lines
Notes: 3a) C is the C code compiled with: -pipe -O3 -s -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro -fprofile-generate -pipe -O3 -s -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro -fprofile-use 3b) D code (compiler V. 1.026) compiled with: -O -release -inline D1 is a direct translation of the Java version D2: inverts the building of the strings, and reverses them at the end (appending at the start of an array is slow thing in any language) D3: is like D4 but it doesn't pre-clear the allocated memory.
Finally, my D code can be found here: http://www.fantascienza.net/leonardo/js/bio_bench.zip
Update Mar 7 2008: later I have found one version of the L Gene of the Hantaan virus: http://beta.uniprot.org/uniprot/P23456.fasta | comments: Leave a comment  |
| This page (related to the Oz/Mozart language) shows a very simple concurrency benchmark ("Death by Concurrency") that creates lot of threads: http://www.mozart-oz.org/documentation/apptut/node9.html#chapter.concurrency.cheap The Oz language uses light threads, so it can manage lot of them.
Phobos std lib of the D language has a std.thread that allows to use threads much like in Java6, so I have compared them. Under the cut you can see that the code is much similar:
( Java and D source code here )
The results are interesting, n=1_000, on a Pentium3 at 500 MHz:
CTime RTime MB
Java6: 4.45 7.18 ~31
D DMD: 0.40 1.33 ~10.5
Where:
CTime is the compile time, seconds
RTime is the running time, seconds
MB is the peak memory used
Note:
CTime 40 / 4.45 = 8.98
RTime 7.18 / 1.33 = 5.4
MB 31 / 10.5 = 2.95
Used:
DMD v1.024
javac 1.6.0_03
Compiled with:
dmd death.d
javac death.java
But the std.thread of DMD supports up to 1024 threads, while I have succed running up to 2000 with that Java6 program. Note that on D if you use the Tango lib, you can use fibers that are light(er) threads. | comments: Leave a comment  |
| When you find a bug in a compiler (like the DMD compiler) and you submit the bug, you are usually encouraged to give the smaller and simpler program (often just about 5 lines long) that shows the same problem/bug. This helps locate the problem, and remove it faster. At the same way microbenchmarks too can be useful, despite being so little they can't really represent what goes on in normal sized programs. They can show you in the simpler way what's better to improve.
This tiny benchmark focus on GC performance and shows similar results to the "binary trees" benchmark I have shown here few weeks ago, but it's quite shorter, so it may be more useful. This code isn't meant to represent normal programs with thousands of classes, etc, it just shows something quite limited. This is an adaptation of the "Object Test" that you can find used and discussed here: http://www.twistedmatrix.com/users/glyph/rant/python-vs-java.html http://blog.snaplogic.org/?p=55 http://programming.reddit.com/info/24ynh/comments Note that I test the GC of Phobos only, I don't use Tango yet.
Object Test timings (on PIII @ 500 MHz), best of 3 runs (seconds, approximate Mbytes memory used), n=1_000, m=10_000:
seconds MB
DMD class: 18.95 1.7
GDC class: 17.91 1.8
DMD struct: 11.77 1.7
GDC struct: 12.31 1.8
Python: 37.10 3.1 (ShedSkin version)
Psyco: 15.68 3.5
ShedSkin 6.35 1.6
Java (1): 2.19 7.3
Java (2): 2.10 7.8
See below for the sources. You can see the Java is about 9 (~= 18.95 / 2.10) times faster than the program produced by DMD. Even Psyco (the JIT for the turtle-slow language Python) gives faster performance (and the Python GC is simple, it's just a reference count plus cycle detection). You can also see Java (as usual) uses much more RAM (7.8/1.7 ~= 4.6).
Note that with some manual GC optimization the running time for the DMD class benchmark can become about one-two seconds smaller, but the memory used can be about 4 MB:
import std.gc;
class ObjectTest { ObjectTest next; }
void main() {
for (uint k = 0; k < 50; k++) {
std.gc.disable();
for (uint i = 0; i < 20; i++) {
auto root = new ObjectTest();
for (uint j = 0; j < 10_000; j++) {
root.next = new ObjectTest();
root = root.next;
}
}
std.gc.fullCollect();
}
}
-------------------
COMPILED/RUN WITH: gcc version 3.4.5 (mingw special) (gdc 0.24, using dmd 1.020)
DMD v1.024
javac 1.6.0_03 java version "1.6.0_03"
Python 2.5.1
ShedSkin 0.0.25 (shedskin.sourceforge.net , a Python to C++ compiler)
-------------------
COMPILATION/RUNNING PARAMETERS: GDC used as: gdc -O3 -s -frelease -finline-functions -ffast-math -fomit-frame-pointer -funroll-loops -march=pentiumpro obj_test_class.d -o obj_test_class_gdc
DMD used as: dmd -O -release -inline obj_test_class.d -ofobj_test_class_dmd.exe
Java compiled with: javac ObjectTest.java
Java (1) run with: java ObjectTest
Java (2) run with: java -Xms20m -Xbatch ObjectTest
SS and Psyco compiled and run without flags.
-------------------
SOURCES:
// obj_test_class.d
class ObjectTest { ObjectTest next; }
void main() {
for (uint i = 0; i < 1_000; i++) {
auto root = new ObjectTest();
for (uint j = 0; j < 10_000; j++) {
root.next = new ObjectTest();
root = root.next;
}
}
}
-------------------
// obj_test_struct.d
struct ObjectTest { ObjectTest* next; }
void main() {
for (uint i = 0; i < 1_000; i++) {
auto root = new ObjectTest();
for (uint j = 0; j < 10_000; j++) {
root.next = new ObjectTest();
root = root.next;
}
}
}
-------------------
// ObjectTest.java
public class ObjectTest {
public ObjectTest next;
public static void main(String[] args) {
for (int i = 0; i < 1000; i++) {
ObjectTest root = new ObjectTest();
for (int j = 0; j < 10000; j++) {
root.next = new ObjectTest();
root = root.next;
}
}
}
}
Note that the HotSpot Java compiler FAQ warns to not use code like that: http://java.sun.com/docs/hotspot/HotSpotFAQ.html >if you insist on using/writing microbenchmarks like this, you can work around the problem by moving the body of main to a new method and calling it once from main to give the compiler a chance to compile the code, then calling it again in the timing bracket to see how fast HotSpot is.<
-------------------
# obj_test.py
from psyco.classes import __metaclass__
class ObjectTest: pass
def main():
for i in xrange(1000): # N
root = ObjectTest()
for j in xrange(10000): # M
root.next = ObjectTest()
root = root.next
import psyco; psyco.full()
main()
-------------------
# obj_test_ss.py
class ObjectTest: pass
def main():
for i in xrange(1000): # N
root = ObjectTest()
for j in xrange(10000): # M
root.next = ObjectTest()
root = root.next
main()
| comments: Leave a comment  |
| | Tags: | java, python | | Subject: | Modeling with Java and Python | | Time: | 11:02 pm |
|
| In one of the last Dr Dobbs Journal (DDJ n. 390, November 2006) there is an interesting article: "Faster Development Through Modeling", by Jeff Cahoon, it can be seen here fully: http://www.ddj.com/193104822 Plus discussions: http://www.forums.ddj.com/jive3/thread.jspa?threadID=300396120
It explains that to create a program made of many similar parts you can create a working "model" of its repeated module, then transform it into a template, and then use it to create the whole code:
- Create a model of the application. - Write a miniapplication that implements the first instance of the repeated set of steps. - Pull apart the miniapplication into template files replacing the named parts in the repeated set of steps with recognizable strings for substitution. - Write code that can rebuild the miniapplication from the templates and the model. - Generate all the code for the whole application.
This is a portion of an example template code:
private HashMap build<%TfmName%>Lookup() throws Exception {
CustomTransform customTransform = new CustomTransform();
<%DimName%>ManagerOlapFactory <%dimName%>MOFactory =
new <%DimName%>ManagerOlapFactory();
<%dimName%>MOFactory.setProperties(properties);
<%DimName%>ManagerOlap <%dimName%>ManagerOlap =
<%dimName%>MOFactory.create<%DimName%>ManagerOlap();
<%DimName%>Olap <%dimName%>Olap[] =
<%dimName%>ManagerOlap.loadAll();
HashMap lookup = new HashMap();
Class[] preformatArray = {Object.class};
boolean preformatSet = false;
Method preformat = null;
try {
preformat = customTransform.getClass().getDeclaredMethod(
"pre<%TfmName%>", preformatArray);
preformatSet = true;
} catch (NoSuchMethodException nsme) {
preformatSet = false;
}<%...%> are templates.
This a comment from a reader:
>With a warehouse scenario he shows how different kind of code can be generated and he estimates that development time was reduced 30-50 percent. I consider this a very modest improvement and believe that one possible reason can be detected from the design model shown with the article (fig 1): The abstraction of the model is very low (code level) and seems to have some issues with low cohesion/high coupling. I haven't seen the whole model, or other models made. Anyway, industry experiences show that the speed will be fundamentally faster (500-1000%) if modeling language can lift the abstraction higher.<
Using my templat.py: http://www.fantascienza.net/leonardo/so/templat.html it's very easy to implement such templating (a CLisp/Scheme programmer can probably laugh at such efforts). (In few days I'd like to improve templat.py, so it can ignore templates inside Python strings and comments). Python supports various kinds of templating, this is available with Python 2.4+:
>>> from string import Template
>>> s = Template('${who}likes$what')
>>> s.substitute(who='tim', what='kung pao')
'timlikeskung pao'
So using templat.py it becomes (this is a templating of a template, it can be done with just one templating):
{{
from string import Template
# name initializations. The vars can be loaded from a dict,
# or from a file, etc.
TfmName = ...
DimName = ...
dimName = ...
print Template("""
private HashMap build${TfmName}Lookup() throws Exception {
CustomTransform customTransform = new CustomTransform();
${DimName}ManagerOlapFactory ${dimName}MOFactory =
new ${DimName}ManagerOlapFactory();
${dimName}MOFactory.setProperties(properties);
${DimName}ManagerOlap ${dimName}ManagerOlap =
${dimName}MOFactory.create${DimName}ManagerOlap();
${DimName}Olap ${dimName}Olap[] =
${dimName}ManagerOlap.loadAll();
HashMap lookup = new HashMap();
Class[] preformatArray = {Object.class};
boolean preformatSet = false;
Method preformat = null;
try {
preformat = customTransform.getClass().getDeclaredMethod(
"pre${TfmName}", preformatArray);
preformatSet = true;
} catch (NoSuchMethodException nsme) {
preformatSet = false;
}
""").substitute(locals())
}}But Python is a dynamic language, so it doesn't need the creation of all such boring code, probably there are ways to create the multiple methods without any templating, just making the code itself a bit more flexible. | comments: Leave a comment  |
|