?

Log in

No account? Create an account

High-level Spelling Corrector in D - reprise - leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).

Security:
Subject:High-level Spelling Corrector in D - reprise
Time:06:24 pm
Years ago I wrote a D V.1 language implementation of the little Norvig spell corrector, using my own library:
http://leonardo-m.livejournal.com/59589.html

The original code an explanation (the Python code is changed in the meantime):
http://norvig.com/spell-correct.html

In the meantime the D language and its standard library Phobos have improved a lot, so now I rewrite the D solution using just the standad library.

This is partially "golfed" D code, it's not an example of good D code.
import std.stdio, std.regex, std.algorithm, std.range, std.File, std.string, std.ascii;

int[string] nWords;

auto set(R)(R r) { return r.zip(0.repeat).assocArray; }

auto edits1(string W) @safe {
    immutable n = W.length;
    auto deletion = n.iota.map!(i=>W[0..i]~W[i+1..$]);
    auto transposition = iota(n-1).map!(i=>W[0..i]~W[i+1]~W[i]~W[i+2..$]);
    auto alteration = n.iota.cartesianProduct(lowercase).map!(p=>W[0..p[0]]~cast(char)p[1]~W[p[0]+1..$]);
    auto insertion = iota(n+1).cartesianProduct(lowercase).map!(p=>W[0..p[0]]~cast(char)p[1]~W[p[0]..$]);
    return chain(deletion, transposition, alteration, insertion).set;
}

auto known(R)(R words) { return words.filter!(w => w in nWords).set.keys; }

enum knownEdits2 = (string word) => word.edits1.byKey.map!(e1 => e1.edits1.byKey).joiner.known;

T or(T)(T x, lazy T y) { return x.length ? x : y; }

string correct(string word) {
    auto candidates = [word].known.or(word.edits1.byKey.known.or(word.knownEdits2.or([word])));
    return candidates.minPos!((w1, w2) => nWords.get(w1, -1) < nWords.get(w2, -1)).front;
}

void main() {
    foreach (mo; "big.txt".readText.toLower.matchAll("[a-z]+"))
        nWords[mo.front]++;
    writeln("speling".correct, '\n', "korrecter".correct);
}

The output is:
spelling
corrected

Phobos currently lacks a set implementation, so I've defined a function that uses the built-in associative arrays.
The "or" function is from the D1 code, with a small improvement.

The run-time of the D code compiled with ldc2 (V.0.15.1) is about 1.5 seconds (32 bit Windows). Python 2.6.6 runs the updated Python code (searching for both words) in about 1.3 seconds.

The Python version creates the NWORDS dict in about 1.1 seconds. The D version takes about 1 second to create it.
comments: Leave a comment Previous Entry Share Next Entry

High-level Spelling Corrector in D - reprise - leonardo
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).