leonardo - A word counting problem
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).

Security:
Subject:A word counting problem
Time:04:31 pm
A little programming problem from this blog post:
http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

The problem is:
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
A solution with shell scripting:
Here's the script that does that, with each command given its own line:
bash:
1:  tr -cs A-Za-z '\n' |
2:  tr A-Z a-z |
3:  sort |
4:  uniq -c |
5:  sort -rn |
6:  head -${1}
Explanation:
1) Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
2) Transliterate upper case to lower case.
3) Sort to bring identical words together.
4) Replace each run of duplicate words with a single representative and include a count (-c).
5) Sort in reverse (-r) numeric (-n) order.
6) Pass through a stream editor; quit (q) after printing the number of lines designated by the script's first parameter (${1}).

Even if data-filtering programs for these exact purposes were not at hand, these operations would well be implemented separately: for separation of concerns, for easier development, for piecewise debugging, and for potential reuse. The simple pipeline given above will suffice to get answers right now, not next week or next month. It could well be enough to finish the job.
Some comments to that blog post:
Andre is quite right that this is an example of the elegance of FP vs. imperative -- but the flip side is that FP often fails when it runs up against real-world edge cases. Knuth's version may not handle contractions -- but it could, easily. Could McIlroy's? What if I want British/US spellings (such as "color" and "colour") to be counted as one word? What about hyphenated words?
20+ years later, we still don't have sed or awk available as a library, afaik. We also don't have a tr-like function which works on streams, or a sort for text files in standard libraries. Which is why you simply can't use it when you write something that isn't completely solvable via shell scripting - it doesn't get compiled in, and your team leader/senior programmer/whatever will definitely not allow you to call shell commands from C, C++ or Pascal, even when it pays off.
This is a modern solution (from the blog post http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ with several small changes, probably it can be improved further):
import System.Environment (getArgs)
import Data.List (sortBy)
import Data.Char (toLower, isLetter)
import qualified Data.HashMap.Strict as HM

createReport :: Int -> String -> String
createReport n text =
    unlines $
    map (\(count, w) -> w ++ " " ++ show count) $
    take n $
    sortBy (flip compare) $
    map (\(w, count) -> (count, w)) $
    HM.toList $
    HM.fromListWith (+) $
    map (\w -> (w, 1)) $
    words $
    map toLower $
    map (\c -> if isLetter c then c else '\n') $
    text

main = do
    [fileName, nstr] <- getArgs
    let n = read nstr
    text <- readFile fileName
    putStr $ createReport n text
Usage: runhaskell wfreqh.hs test_data.txt 10

A comment from that blog post:
The shell script operates on raw text and everything is just strings being parsed and reparsed by the respective Unix utility programs.

The Haskell program is statically typed. It is type-checked by the compiler, which generates native code. The program uses standard libraries and data types, such as lists and hash maps.

Also, the Haskell program could be refined, extended, optimized in various ways. The most important optimizations I can think of off the top of my head:
- Using a better representation of strings than the default built-in "string as list of characters". Easily accessible advice can be found on Stack Overflow and browsing through Haskell library documentation, such as for the text package.
- Loop fusion, deforestation can be applied to deal with the apparent allocation of lots of new lists in the pipeline. One of the selling points of using a language like Haskell is the opportunity for the compiler to perform radical optimizations that are impossible for languages that have side effects.

I don't write many bash scripts these days. General-purpose programming languages can do a decent job munging data without difficulty. The situation was different decades ago when there was C, and few standard high-level libraries for the C world.
This is a D language version (it requires no installation of libraries like the unordered-containers from Cabal in the Haskell program). It works with unicode text. If you only have ASCII text the code becomes simpler. That Haskell version seems to have problems with unicode text (on the other hand don't expect the D code to perform a fully correct handling of unicode sorting, that is a complex problem).
import std.stdio, std.conv, std.file, std.string,
       std.algorithm, std.range;

void main(string[] args) {
    args[1]
    .readText
    .toLower
    .tr("A-Za-z", "\n", "cs")
    .split
    .sort()
    .group
    .array
    .schwartzSort!q{ -a[1] }
    .take(args[2].to!uint)
    .map!q{ text(a[0], " ", a[1]) }
    .join("\n")
    .writeln;
}
D version: 17 nonblank lines, Haskell: 23 lines.
(The cast is needed because of small D bug that probably will be fixed. The sort function has () to avoid calling the built-in sort, that will be deprecated.)
File sizes:
9.005.142 test_data.txt
      365 wfreqd.d
      701 wfreqh.hs
1.565.710 wfreqh.exe (stripped)
  510.492 wfreqd.exe
Haskell code compiled with:
ghc -O3 wfreqh.hs
Glasgow Haskell Compiler, Version 7.6.1, stage 2 booted by GHC version 7.4.2

D code compiled with:
dmd -O -release -inline -noboundscheck wfreqd.d
DMD Version 2.064alpha.
Runtime, best of 3 (top 100 words):
  wfreqh: 7.45
  wfreqd: 6.24 
Tests performed on Windows 32 bit, using an about 9 MB text file in English.

In theory the shell program is fully lazy (even the sort can be done with an external sorting algorithm), while the D (and probably the Haskell) version load the whole file in memory.

The D and Haskell programs are strongly statically typed so they are able to catch several programming errors at compile-time.

The shell program uses text, so it has to convert numbers to text and then from text to numbers, while the D and Haskell programs use lazy sequences of arbitrary types, that can be more powerful and save time in the textual serialization and deserialization. And unless the floating point values are serialized as hex floats, such textual conversions can also introduce some error.

If you have to perform some operation not present among the unix tools, you can write a small program (in C or another language) that uses text pipes, and call it like the other shell commands. In Haskell or D you just write a function (or a lazy Range in D, often a struct), and call it, this is simpler and more integrated, but it's less modular.

Update Jun 21 2013: wiz offers this Haskell version (gist.github.com/wiz/5826131 ):
{-# LANGUAGE TupleSections #-}

import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import           Data.Char (isLetter)
import           Data.List (sortBy)
import           Data.Function (on)
import qualified Data.HashMap.Strict as HM
import System.Environment (getArgs)

collect = sortBy (flip compare `on` snd)
        . HM.toList
        . HM.fromListWith (+)
        . map (,1)
        . filter (not . TL.null)
        . TL.split (not . isLetter)
        . TL.toLower

display (word, count) = do
    TL.putStr word
    putStr ": "
    print count

main = do
    [fname, count] <- getArgs
    text <- TL.readFile fname
    mapM_ display $ take (read count) (collect text)

This Haskell version is slower than the precedent one, it takes about 10.8 seconds to run. The D code compiled with the ldc2 compiler (that optimizes better than dmd) takes 4.5 seconds to run (if you use ldc2 2.063 then use .sort!q{-a[1] < -b[1]} instead of the schwartzSort, that was improved later in Phobos).

There are many ways to improve the performance of the D code. Like using an hash counter, that is missing in Phobos (run-time with ldc2 is about 3.17 seconds):
import std.stdio, std.conv, std.file, std.string,
       std.algorithm, std.range, std.traits;

auto hashCounter(R)(R items) if (isForwardRange!R) {
    size_t[ForeachType!R] result;
    foreach (x; items)
        result[x]++;
    return result.byKey.zip(result.byValue);
}

void main(string[] args) {
    args[1]
    .readText
    .toLower
    .tr("A-Za-z", "\n", "cs")
    .split
    .hashCounter
    .array
    .sort!q{ -a[1] < -b[1] }
    .take(args[2].to!uint)
    .map!q{ text(a[0], " ", a[1]) }
    .join("\n")
    .writeln;
}
comments: Leave a comment Previous Entry Share Next Entry


livejournal
Subject:Haskell vs. D
Link:(Link)
Time:2013-06-20 05:03 pm (UTC)
User thedeemon referenced to your post from Haskell vs. D saying: [...] A nice little practical comparison: http://leonardo-m.livejournal.com/109201.html [...]
(Reply) (Thread)


versusvoid
Link:(Link)
Time:2013-06-20 06:23 pm (UTC)
Такое ощущение, что код на хаске раздули специально. Покуда на D пишут args[2].to!uint, то вместо (read nstr) объявляют n отдельно. Да и функцию лишнюю объявили. Да и тип ей приписать не забыли. Да и map отдельно делают.
Да и не оптимальное решение, наверняка.
Да и судьи куплены.
(Reply) (Thread)

(Deleted comment)

dpwiz
Link:(Link)
Time:2013-06-20 08:05 pm (UTC)
Try this one: https://gist.github.com/wiz/5826131 at least it looks better (:
(Reply) (Parent) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-20 08:23 pm (UTC)
Thank you. I will benchmark it soon.

Edit: This Haskell version is slower than the precedent one, it takes about 10.7 seconds to run. (It's also longer but it still doesn't deal with an UTF-8 input text).

I have also found a problem. If the input text contains a line like:
http://www.reddit.com/r/programming/
It outputs a line like:
httpwwwredditcomrprogramming: 1

Edited at 2013-06-20 10:13 pm (UTC)
(Reply) (Parent) (Thread)


dpwiz
Link:(Link)
Time:2013-06-21 05:41 am (UTC)
oops.. the slowdown i think brought by me stupidly reversing the sorted list instead of reverse-sorting like original.

I've dug into profiler results and two slow points are sort and isLetter.

https://gist.github.com/wiz/5826131/revisions
(Reply) (Parent) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-21 11:42 am (UTC)
I have updated the benchmark at the end of the post. Unless I am missing something this version is not faster...
(Reply) (Parent) (Thread)


dpwiz
Link:(Link)
Time:2013-06-21 04:57 pm (UTC)
That's quite unusual...

I've benchmarked (http://campfire.aenor.ru/ltVSstrings.html) a recent version and it measurably faster even than that of your post.
(Reply) (Parent) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-20 07:38 pm (UTC)
The Haskell code comes from that franklinchen.com blog, I have improved it just a little. I am sure there are ways to improve the Haskell code further, but I am not an Haskell expert. If you have suggestions to improve the Haskell code, then I will accept them. Maybe there are ways to speed up the Haskell code using a different representation for the text (like Text, bytestrings, etc).
(Reply) (Parent) (Thread)

dmzlj
Link:(Link)
Time:2013-06-20 08:18 pm (UTC)
import Control.Monad
import Data.Char
import Data.List
import Data.Maybe
import Data.Ord
import qualified Data.Map as M
import System.Environment (getArgs)
import Text.Printf

main = do
  (f:n:_) <- getArgs
  ws <- liftM (words.map (\c -> if isLetter c then toLower c else ' ')) (readFile f)
  let d = foldr (flip (M.insertWith (+)) 1) M.empty ws
  let es = take (read n) $ reverse $ sortBy (comparing snd) (M.toList d)
  forM_ es $ \(s,n) -> printf "%4d %s\n" (n :: Int) s :: IO ()
(Reply) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-20 09:55 pm (UTC)
Your program gives me "Stack space overflow: current size 8388608 bytes.".
(Reply) (Parent) (Thread)

dmzlj
Link:(Link)
Time:2013-06-21 03:25 am (UTC)
what size of the file?
(Reply) (Parent) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-21 11:22 am (UTC)
9.005.142 bytes, as written in the post.
(Reply) (Parent) (Thread)


migmit
Link:(Link)
Time:2013-06-21 04:21 am (UTC)
Could you give a link to the test_data file, so that we can test our solutions on the same data you have?
(Reply) (Parent) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-21 11:24 am (UTC)
It's private data. But a person in a Reddit comment has used the Complete Works of William Shakespeare (5.3 MiB plaintext): http://www.gutenberg.org/ebooks/100
(Reply) (Parent) (Thread)


dpwiz
Link:(Link)
Time:2013-06-21 05:42 am (UTC)
Don't to map toLower on [Char]s ever. It calls a huge chunk of FFI for each character.
(Reply) (Parent) (Thread)


inv2004
Link:(Link)
Time:2013-06-20 08:52 pm (UTC)
K, I am sure not the shortest solution, but it includes definition of primitive functions: il (isLetter), lc (lower).

Az:_ci (65+!26),97+!26 / define "ABC....Zabc...z"
AZ:26#Az / define "A..Z"
il:{x _in Az}' / is in "A...z"
lc:{:[x _in AZ;_ci (_ic x)+32;x]}' / if "A..Z" -> "a..z"

s:,/" ",'0:"1.txt" / read text into string
ws:1_'(&~il s)_ s / split into words"
ws:ws@&{0<#x}'ws / remove empty words
o:3#>{+/x _sm ws}'?ws / find order of occurence, take 3
{(x;(?ws)@x)}'o / output occurence and word
(Reply) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-20 09:11 pm (UTC)
Thank you. Now we need an Ursala version... ;-) (http://rosettacode.org/wiki/Category:Ursala )
(Reply) (Parent) (Thread)


lelf
Link:(Link)
Time:2013-06-20 08:54 pm (UTC)
> It works with unicode text.

> .tr("A-Za-z", "\n", "cs")

oh yeah
(Reply) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-20 09:28 pm (UTC)
The D code loads the text with "readText" that loads UTF-8 instead of the faster "read" that loads bytes. And "toLower" tries to convert the unicode to lowercase. Then "tr" filters only a small subset of the chars.

At this development stage of D standard library I think there is no way to write a D program this short that deals correctly with Unicode. You need a Unicode collation algorithm and more. There is so much to do to make it more correct.
(Reply) (Parent) (Thread)


lelf
Link:(Link)
Time:2013-06-21 12:50 pm (UTC)
So nowhere near to work with unicode would be more correct ;-)
(Reply) (Parent) (Thread)


inv2004
Link:(Link)
Time:2013-06-21 07:27 pm (UTC)
Because it is Friday:

J:


lc=:(LATIN_LC,a.) {~ (LATIN_UC,a.) i. ] NB. to lower case
nl=:' ' ((I.@:(0=e.&LATIN))@:]) } ] NB. replace non LATIN to ' '
3 {. \:~ (#;{.)/.~ ;: lc nl freads '1.txt' NB. main programm
(Reply) (Thread)


dpwiz
Link:(Link)
Time:2013-06-21 08:04 pm (UTC)
Now add proper unicode support.

Edited at 2013-06-21 08:10 pm (UTC)
(Reply) (Parent) (Thread)


inv2004
Link:(Link)
Time:2013-06-21 08:32 pm (UTC)
require 'ufread'

+ replace fread with ufread , which supports UTF-16, UTF-8, 8-bit ansi.

This script reads a text file in various formats, and returns a text string in UTF-8 format, which is the default for J



(Reply) (Parent) (Thread)


inv2004
Link:(Link)
Time:2013-06-21 07:29 pm (UTC)
Of course with the help of J mail-list
(Reply) (Thread)

ext_2032977
Link:(Link)
Time:2013-06-22 10:10 am (UTC)
I must say that what you are comparing here is rather meaningless. Proper unicode handling is a big deal and hard to do, it really slows down things a lot. People put serious effort to make sure it works fast. So comparing the solution that doesn't work with unicode to one that really does is at least unfair.

The other thing is using String and expecting it to be fast - it's not. If you care about correctness and speed you should either use Text from "text" library or ByteString from "bytestring". The first one handles unicode properly. The second simply works on bytes (represented as Word8 datatype). Here are my timings:

-- ldc2 -O -release -disable-boundscheck w
-- ./wfreqd pg100.txt 100 0,75s user 0,01s system 99% cpu 0,764 total
-- ./wordfreq-text pg100.txt 100 1,33s user 0,03s system 99% cpu 1,367 total
-- ./wordfreq-bs pg100.txt 100 0,44s user 0,01s system 99% cpu 0,453 total
-- ./wordfreq-bs pg100.txt 100 +RTS -A2m -H2m 0,39s user 0,02s system 99% cpu 0,409 total

Proper unicode solution is 1.78x slower, but an equivalent is 1.86x faster (with proper RTS flags).

Code here: https://gist.github.com/Tener/5840164
(Reply) (Thread)

ext_2033254
Link:(Link)
Time:2013-06-22 02:20 pm (UTC)
I just want to clarify the last sentence of the above comment: the Haskell version using ByteString is 86% faster than the D code.
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2013-06-22 09:07 pm (UTC)
I just improved the D code to run in less than half the time. All I did
was create a "byWord" range to iterate more efficiently over the original
string, instead of "tr" and "split".

Here are the timings (always using dmd -noboundcheck -inline -O -release,
and testing against Shakespeare Complete Works (5.3 MiB plaintext)):

Minimum of 10 runs of your hashcounter version:
2781 ms
Minimum of 10 runs of byWord version:
805 ms

Here is my version using "byWord":

import std.stdio, std.conv, std.file, std.string,
std.algorithm, std.range, std.traits, std.ascii;


auto hashCounter(R)(R items) if (isForwardRange!R) {
size_t[ForeachType!R] result;
foreach (x; items)
result[x]++;
return result.byKey.zip(result.byValue);
}

void main(string[] args) {
//Slow:
// args[1]
// .readText
// .toLower
// .tr("A-Za-z", "\n", "cs")
// .split

//Faster:
args[1]
.readText
.byWord
.map!toLower()
.array
.hashCounter
.array
.sort!"-a[1] < -b[1]"()
.take(args[2].to!uint)
.map!q{ text(a[0], " ", a[1]) }
.join("\n")
.writeln;
}

/** Range that extracts words from a string. Words are
strings composed only of chars accepted by std.ascii.toAlpha() */
struct byWord {
string s;
size_t pos;
string word;

this(string s) {
this.s = s;
popFront();
}

@property bool empty() const {
return s.length == 0;
}

@property string front() {
return word;
}

void popFront() {
if (pos == s.length) {
//Mark the range as empty, only after popFront fails:
s = null;
return;
}

while (pos < s.length && !std.ascii.isAlpha(s[pos])) {
++pos;
}
auto start = pos;
while (pos < s.length && std.ascii.isAlpha(s[pos])) {
++pos;
}

if (start == s.length) {
//No more words. Range empty:
s = null;
} else {
word = s[start .. pos];
}
}
}

unittest {
assert([] == array(byWord("")));
assert([] == array(byWord("!@#$")));
assert(["a", "b"] == array(byWord("a b")));
assert(["a", "b", "c"] == array(byWord("a b c")));
}

--Juan Manuel Cabo

(Reply) (Thread)

(Anonymous)
Link:(Link)
Time:2013-06-26 04:01 pm (UTC)
I wrote a simple F# script. It processed the pg100.txt in 300 milliseconds on first pass without any tweaking.
(Reply) (Thread)


leonardo_m
Link:(Link)
Time:2013-06-26 04:37 pm (UTC)
Good, let's see the code!
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2013-06-26 05:28 pm (UTC)
Here's the F# code.

//wfreqFS.fs
//fsc -O wfreqFS.fs
//wfreqFS pg100.txt 100 -> elapsed 1662 ms

open System
open System.IO
open System.Diagnostics

let wordFreqs (inputStr:string) =
inputStr.Split([|' '|], StringSplitOptions.RemoveEmptyEntries )
|> Array.toSeq
|> Seq.map (fun (s:string) -> s.ToLower())
|> Seq.countBy id
|> Seq.sortBy (fun (x,y) -> y*(-1))

[]
let main args =
match args with
| [| fileName; count |] ->
let timer = new Stopwatch()
timer.Start()
let testStr = File.ReadAllText fileName
let result = wordFreqs testStr |> Seq.take (int count)
Seq.iter (fun (x,y) -> printfn "'%s' : %d " x y) result
timer.Stop()
printf "elapsed %d ms" timer.ElapsedMilliseconds
| _ ->
printfn "USAGE: wfreqFS fileName count"
0
(Reply) (Parent) (Thread)

(Anonymous)
Link:(Link)
Time:2013-06-26 05:30 pm (UTC)
//fsc -O wfreqFS.fs
//wfreqFS pg100.txt 100
//elapsed 1662 ms


open System
open System.IO
open System.Diagnostics

let wordFreqs (inputStr:string) =
inputStr.Split([|' '|], StringSplitOptions.RemoveEmptyEntries )
|> Array.toSeq
|> Seq.map (fun (s:string) -> s.ToLower())
|> Seq.countBy id
|> Seq.sortBy (fun (x,y) -> y*(-1))

[]
let main args =
match args with
| [| fileName; count |] ->
let timer = new Stopwatch()
timer.Start()
let testStr = File.ReadAllText fileName
let result = wordFreqs testStr |> Seq.take (int count)
Seq.iter (fun (x,y) -> printfn "'%s' : %d " x y) result
timer.Stop()
printf "elapsed %d ms" timer.ElapsedMilliseconds
| _ ->
printfn "USAGE: wfreqFS fileName count"
0
(Reply) (Parent) (Thread)

leonardo - A word counting problem
View:Recent Entries.
View:Archive.
View:Friends.
View:Profile.
View:Website (My Website).