leonardo (leonardo_m) wrote,

High-level Spelling Corrector in D

In the last blog posts I have shown low-level examples of coding in D, to simulate a finite state machine used to generate terms of the audioactive sequence.

And now an example of high-level style coding in D. The following is related to the article "How to Write a Spelling Corrector" by the famous Peter Norvig:

Python version (21 lines, not counting the caller last one):
import re, collections
from string import ascii_lowercase as alphabet

words = lambda text: re.findall(r'[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

nwords = train(words(file('big3.txt').read()))

def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in xrange(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in xrange(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in xrange(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in xrange(n+1) for c in alphabet])  # insertion

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

known = lambda words: set(w for w in words if w in nwords)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=lambda w: nwords[w] if w in nwords else -1)

print correct('speling'), "\n", correct('korrecter')

D version using my libs (23 lines, not counting the caller last one):
import std.string, std.regexp, std.file, d.all;

int[string] nwords;
static this() {
    foreach (w; RegExp("[a-z]+", "g").match(inToLower(cast(string)read("big3.txt"))))

Set!(string) edits1(string w) {
    int i, n = len(w);
    char c;
    return set(list(w[0..i] ~ w[i+1..$], i, xrange(n)) ~                     // deletion
               list(w[0..i] ~ w[i+1] ~ w[i] ~ w[i+2..$], i, xrange(n-1)) ~   // transposition
               list(w[0..i] ~ c ~ w[i+1..$], i, xrange(n), c, lowercase[]) ~ // alteration
               list(w[0..i] ~ c ~ w[i..$], i, xrange(n+1), c, lowercase[])); // insertion

Set!(string) knownEdits2(string word) {
    string s1, s2;
    return set(list(s2, s1, edits1(word), s2, edits1(s1), s2 in nwords));

Set!(string) known(T)(T words) { return set(words) & nwords; }

string correct(string w) {
    auto found = lazyOr(known([w]), known(edits1(w)), knownEdits2(w), set([w]));
    return max(found, (string wo){ return wo in nwords ? nwords[wo] : -1; });

void main() { putr( correct("speling"), \n, correct("korrecter")); }

This D program has found one performance bug (problem already fixed, fix not available yet) in the D garbage collector(s), one possible bug using AAs with the ?: operator, a bug in my libs, has made me add that lazyOr function to my libs, and it was fun to write :-)
I have created lazyOr() to write this program (but it will be useful elsewhere, it's similar to the Python 'or' operator), if you don't want to use lazyOr() you can define just this line:
T or(T)(T x, lazy T y) { return isTrue(x) ? x : y(); }
That you can use like this:
auto found = or(or(or(known([w]), known(edits1(w))), knownEdits2(w)), set([w]));

That "g" in the Phobos REgex API is ugly, that API needs some cleaning and improviments (maybe PCRE can just be plug into Phobos (with an improved API) intead of the current regex engine, this will allow to remove most regex bugs, improve its speed, work with a very standard re syntax, etc.).

Note that Python program has corner-case bugs, is fragile, it's quite slow, it's not a good example of programming, etc. Much better Python programs can be written for this purpose. Probably its only good side is to be short, so it can show in a short space the idea behind a spelling corrector. The D code shares most of the same problems. Such high-level style coding in D can be useful to weite prototypes too, that later if necessary can be translated to faster lower-level style D (or even assembly) code.

With the patch to the GC the D version runs a little faster than the Python version (about 7.5 seconds on my very old PC, probably a little more than 1.5 seconds on a more modern PC).

Note the "big3.txt" is just a large block of texts in English, it comes from this version:
But I have removed non-ASCII chars (> 127).

Update: http://leonardo-m.livejournal.com/112262.html
Tags: d language, python

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.