leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 1 entries.

Tags:, ,
Subject:String processing with D
Time:01:02 pm
If you are a bit interested in natural language processing, I suggest you to look at the NLTK Natural Language Processing in Python book:
http://nltk.org/doc/en/book.pdf

It's a whole book witten by experts that uses Python plus the NLTK library, and it looks quite nice.

For certain kinds of string/text processing the Perl language is the "best" one, but its syntax and semantics isn't easy to learn as much as Python ones (and I think Python programs may become "better organized". But if you are a good programmer, you keep your programs very tidy and organized, and you know Perl well enough, then you can write well organized programs with Perl too). So the autors have chosen Python among many other possible languages (and I belive they have chosen the right language for this task).

Near the end of that book (page 363) they show their little Python program again (#1):
import sys
    for line in sys.stdin:
        for word in line.split():
            if word.endswith('ing'):
                print word

They compare it against the same algorithm written in other languages, like Perl (#2):
while (<>) {
    foreach my $word (split) {
        if ($word =~ /ing$/) {
            print "$word\n";
        }
    }
}

[Note: that code is *broken*, it can't spot words like "...in the beginning.", that is words ending with "ing" not followed by whitespace or end of string. In the following discussion I have used the same broken algoritm.]

I am learning the D language now, so I have compared it with D. D language isn't a scripting language (today "dynamic language" is the PC way to call them) like Perl, Python, Ruby, Tcl, etc, but its string functions (of the built-in Phobos standard library) are very inspired to Python ones. But first of all to do more fair tests I have written this Python version that uses Psyco, that's faster (#3):
import sys, psyco
def main():
    for line in sys.stdin:
        for word in line.split():
            if word.endswith('ing'):
                print word
psyco.full()
main()

This is a native D version of that Python code (#4):
import std.stdio, std.string;
void main() {
    string line;
    while (readln(stdin, line))
        foreach(word; line.split())
            if (word.length > 2 && word[$-3 .. $] == "ing")
                writefln(word);
}

That's not as nice as the Python version, but it's *far* simpler and cleaner than a C version (that can be seen in the book, page 365, it uses regex.h and it's less broken, taking in account words ending with "ing" followed by comma too).

I have written (soon to be released on my site) a package with few functions/classes to allow D programmers to use some functional-style coding and to perform some operations in a more pythonc way, so this is a little test for it. Using my libs a word-by-word translation of that Python code to D is (#5):
import std.string, d.func, d.string;
void main() {
    foreach(line; xstdin())
        foreach(word; line.split())
            if (word.endsWith("ing"))
                putr(word);
}

I belive it's close enough :-) It uses my xstdin() class, putr alias, and endsWith() function.

This is faster D version, it uses my xplit() (#6):
import std.stdio, d.func, d.string;
void main() {
    string line;
    while (readln(stdin, line))
        foreach(word; line.xsplit())
            if (word.length > 2 && word[$-3]=='i' && word[$-2]=='n' && word[$-1]=='g')
                putr(word);
}

If you want to go even faster you can use my xsplitArray() that splits according to just a given char, but this version shows that lower-level code has bugs more often, because this doesn't find one word in my test case (#7):
import std.stdio, std.string, d.func;
void main() {
    string line;
    while (readln(stdin, line))
        foreach(word; xsplitArray(line, ' '))
            if (word.length > 2 && word[$-3]=='i' && word[$-2]=='n' && word[$-1]=='g')
                writefln(word);
}

This is the precedent version debugged (#8):
import std.stdio, std.string, d.func;
void main() {
    string line;
    while (readln(stdin, line))
        foreach(word; xsplitArray(line.chomp(), ' '))
            if (word.length > 2 && word[$-3]=='i' && word[$-2]=='n' && word[$-1]=='g')
                writefln(word);
}

This is an easy D version that I'd probably use in most situations (#9):
import d.func, d.string;
void main() {
    foreach(line; xstdin())
        foreach(word; line.xsplit())
            if (word.endsWith("ing"))
                putr(word);
}

Some speed tests on a very large txt novel (~6.65 MB) on a PIII @ 500MHz, D code is compiled with -O -release -inline, best timing of 3 runs (to warm up the file cache too):
#1: 7.15 s
#3: 3.83 s
#4: 1.37 s
#5: 1.58 s
#6: 0.87 s
#8: 0.75 s
#9: 1.23 s

The #8 D version is about 9.5 times faster than the pure Python version, but the code is longer and it has required a bit of debugging. I think the #9 D version is a good comprimise for most situations, it's often fast enough and you can write it quickly in a bug-free way, it's essentially like the Python version, but it uses the xsplit that works in a lazy way (I think Python too can enjoy a lazy xsplit() string method, it's expecially useful when the strings are quite long, like the paragraphs of this novel).
comments: Leave a comment Add to Memories Tell a Friend

leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:User Info.
View:Website (My Website).
You're looking at the latest 1 entries.