Programming Puzzles: Letter and Word Frequency

2015-04-22
programming, haskell, ruby

Motivation

I recently purchased an ErgoDox keyboard1, and I’ve been thinking about creating my own keyboard layout in the spirit of the Dvorak Simplified Keyboard layout (DSK). One of the guiding principles of DSK was putting an emphasis on reducing finger travel by identifying the most commonly occurring letters of the English language, and placing them on the home row of the keyboard. Now, the Dvorak layout was patented in 1936 according to Wikipedia, with subsequent adjustments culminating in the present form of DSK.

I want to find out for myself what letters are the most common. Sure, I could blindly trust an online source like Wikipedia, but it just feels like an easy problem to solve. Also, I am not sure if Dvorak considered word frequency as well as letter frequency. Ideally, one should use the data of both word and letter frequency to determine what is the most commonly typed “letter” on a US ASCII keyboard for English writers.

The Problems

Letter Frequency

Write a program that reads a file (plaintext) and counts how many times each letter occurs in the file. You must treat A as the same letter as a. You may limit yourself to the plain US ASCII 26-letter alphabet, discarding all letters with diacritics. Sort the letters by their frequence; for each letter, display the letter itself, its relative frequency percentage to the file as a whole, and the number of times this letter appears (raw count). E.g., the letter a counted by this program might look like this: a = 2.00% (200 occurrences).

Word Frequency

Write a program that reads a file (plaintext) and counts how many times each word occurs in the file. The precise definition of a “word” is up to you, but you must exclude arabic numerals and also standalone punctuation characters (e.g., “* is not a word”). Display the 100 most common words in similar fashion to the Letter Frequency problem.

Both the letter and word frequency problems will use the Project Gutenberg plaintext file of Moby Dick.

Now, before you go on to read my solutions, I encourage you to write a solution on your own using your favorite programming language.

Ruby Version

module TextFreq
  # Given a string, count every occurrence of letters a-z (case insensitively).
  def TextFreq.freq_l(src)
    # Construct the array to hold the running totals (occurrences) of each
    # letter. There are 26 letters in the alphabet, so we can just have an array
    # of 26 integers.
    occs = Array.new(26, 0)

    # Count occurrences of each letter.
    src.each_char do |c|
      if !char_to_idx(c.downcase).nil?
        occs[char_to_idx(c.downcase)] += 1
      end
    end

    occs
  end

  # Simply check if the given character belongs to the range of lowercase ASCII
  # characters that make up the alphabet. "a" is 97, and "z" is 122; the numbers
  # for bounds-checking "c" come from these two (offset by 1 to account for the
  # exclusive comparison).
  def TextFreq.char_to_idx(c)
    96 < c.ord && c.ord < 123 ? c.ord - 97 : nil
  end

  # Given a string, count every occurrence of a particular word. We define a
  # "word" as a sequence of charactes that
  #   - does not have any punctuation characters at the beginning or end, and
  #   - does not have any numbers in it
  # . We take into account that text files from Project Gutenberg use a double
  # dash for an em dash to separate two words.
  def TextFreq.freq_w(src)
    occs = {}
    words = src.split(/\W*\s\W*/).map do |w|
        w.empty? ? nil : w.downcase
      end.compact
    words.each do |w|
      # Guard against cases like "*" for bullet points and such.
      if w =~ /\w/
        if w =~ /--/
          w.split("--").each do |y|
            count_word(occs, lstrip_punc(y))
          end
        else
          count_word(occs, lstrip_punc(w))
        end
      end
    end

    occs
  end

  # Add 1 to the hash for an existing key (word); otherwise, store a new
  # instance of that word.
  def TextFreq.count_word(hash, w)
    hash.key?(w) ? hash[w] += 1 : hash.store(w, 1)
    hash
  end

  # Remove leading punctuation.
  def TextFreq.lstrip_punc(w)
    w.match(/\w.*/)[0]
  end

  # Display the frequencies of letters and or words. For letters, we are only
  # concerned about 26 different values, so we print all of them out. However
  # for words, depending on the corpus there might be thousands, or even
  # millions, of different words; thus, we only display the top 100 most common
  # words.
  def TextFreq.disp_freq(occs)
    if occs.is_a?(Array)
      sum = occs.inject(0, :+)
      occs.zip(("a".."z").to_a).sort.reverse.each do |cnt, c|
        puts "#{c} = "\
          + "%.2f%%" % (cnt/sum.to_f * 100.0)\
          + " (#{cnt} occurrences)"
      end
    else
      sum = occs.values.inject(0, :+)
      occs.sort_by {|w, cnt| cnt}.reverse.take(100).each do |w, cnt|
        puts "#{w} = "\
          + "%.2f%%" % (cnt/sum.to_f * 100.0)\
          + " (#{cnt} occurrences)"
      end
    end
  end
end
#!/usr/bin/env ruby

require_relative './text_freq'

fname = ARGV[0]
file = File.open(fname, 'r:utf-8')
corpus = file.read

occs_l = TextFreq.freq_l(corpus)
TextFreq.disp_freq(occs_l)

puts "-" * 80

occs_w = TextFreq.freq_w(corpus)
TextFreq.disp_freq(occs_w)

file.close

And here is the output (pg2701.txt is Moby Dick):

$ ./analyze.rb ~/pg2701.txt
e = 12.29% (118967 occurrences)
t = 9.25% (89549 occurrences)
a = 8.16% (78959 occurrences)
o = 7.31% (70698 occurrences)
n = 6.89% (66670 occurrences)
i = 6.88% (66585 occurrences)
s = 6.72% (65012 occurrences)
h = 6.56% (63444 occurrences)
r = 5.51% (53342 occurrences)
l = 4.47% (43298 occurrences)
d = 4.01% (38769 occurrences)
u = 2.81% (27217 occurrences)
m = 2.44% (23655 occurrences)
c = 2.39% (23122 occurrences)
w = 2.33% (22500 occurrences)
g = 2.19% (21239 occurrences)
f = 2.19% (21228 occurrences)
p = 1.83% (17711 occurrences)
y = 1.78% (17209 occurrences)
b = 1.77% (17165 occurrences)
v = 0.90% (8721 occurrences)
k = 0.85% (8196 occurrences)
q = 0.16% (1567 occurrences)
j = 0.12% (1176 occurrences)
x = 0.11% (1062 occurrences)
z = 0.07% (636 occurrences)
--------------------------------------------------------------------------------
the = 6.74% (14616 occurrences)
of = 3.10% (6708 occurrences)
and = 2.99% (6488 occurrences)
a = 2.20% (4760 occurrences)
to = 2.16% (4677 occurrences)
in = 1.95% (4223 occurrences)
that = 1.38% (2999 occurrences)
his = 1.17% (2530 occurrences)
it = 1.12% (2419 occurrences)
i = 0.92% (1988 occurrences)
but = 0.84% (1823 occurrences)
he = 0.82% (1777 occurrences)
with = 0.82% (1770 occurrences)
as = 0.81% (1751 occurrences)
is = 0.81% (1747 occurrences)
for = 0.76% (1645 occurrences)
was = 0.76% (1645 occurrences)
all = 0.70% (1523 occurrences)
this = 0.66% (1440 occurrences)
at = 0.62% (1334 occurrences)
by = 0.56% (1223 occurrences)
not = 0.54% (1169 occurrences)
from = 0.51% (1105 occurrences)
on = 0.49% (1069 occurrences)
him = 0.49% (1062 occurrences)
so = 0.49% (1061 occurrences)
be = 0.49% (1060 occurrences)
whale = 0.45% (972 occurrences)
you = 0.44% (944 occurrences)
one = 0.42% (906 occurrences)
or = 0.37% (797 occurrences)
there = 0.37% (792 occurrences)
now = 0.36% (779 occurrences)
had = 0.36% (779 occurrences)
have = 0.36% (772 occurrences)
were = 0.32% (683 occurrences)
they = 0.31% (664 occurrences)
which = 0.30% (655 occurrences)
then = 0.29% (628 occurrences)
me = 0.29% (621 occurrences)
their = 0.29% (620 occurrences)
are = 0.29% (619 occurrences)
some = 0.29% (619 occurrences)
when = 0.28% (607 occurrences)
an = 0.28% (600 occurrences)
no = 0.27% (594 occurrences)
my = 0.27% (589 occurrences)
like = 0.27% (581 occurrences)
upon = 0.26% (567 occurrences)
what = 0.26% (566 occurrences)
out = 0.24% (528 occurrences)
into = 0.24% (523 occurrences)
up = 0.24% (516 occurrences)
more = 0.23% (506 occurrences)
if = 0.23% (500 occurrences)
them = 0.22% (471 occurrences)
we = 0.21% (455 occurrences)
man = 0.21% (445 occurrences)
old = 0.20% (444 occurrences)
ahab = 0.20% (432 occurrences)
ye = 0.20% (428 occurrences)
would = 0.20% (428 occurrences)
other = 0.19% (416 occurrences)
been = 0.19% (415 occurrences)
these = 0.19% (405 occurrences)
over = 0.19% (403 occurrences)
will = 0.18% (396 occurrences)
ship = 0.18% (391 occurrences)
though = 0.18% (383 occurrences)
sea = 0.18% (382 occurrences)
its = 0.18% (382 occurrences)
only = 0.17% (378 occurrences)
such = 0.17% (376 occurrences)
down = 0.17% (367 occurrences)
any = 0.17% (363 occurrences)
who = 0.16% (345 occurrences)
yet = 0.16% (344 occurrences)
her = 0.15% (329 occurrences)
time = 0.15% (326 occurrences)
very = 0.15% (323 occurrences)
do = 0.15% (321 occurrences)
long = 0.15% (319 occurrences)
about = 0.15% (318 occurrences)
than = 0.14% (311 occurrences)
still = 0.14% (311 occurrences)
those = 0.14% (307 occurrences)
great = 0.14% (303 occurrences)
said = 0.14% (301 occurrences)
captain = 0.14% (300 occurrences)
before = 0.14% (300 occurrences)
here = 0.14% (299 occurrences)
has = 0.14% (294 occurrences)
must = 0.13% (292 occurrences)
two = 0.13% (288 occurrences)
most = 0.13% (284 occurrences)
seemed = 0.13% (283 occurrences)
last = 0.13% (276 occurrences)
head = 0.13% (275 occurrences)
see = 0.12% (268 occurrences)
thou = 0.12% (267 occurrences)

. The file of course contains remarks and legalese from Project Gutenberg, so if you want more accuracy you would have to redact those parts before running this script.

Letter Frequency

The freq_l method views letters in the limited US ASCII range and uses crude, C-like letter-to-integer equivalence via char_to_idx. We use a simple array of 26 integers, each one corresponding to a letter. But thanks to its stupidity, freq_l runs quite fast — chugging through Moby Dick in a few seconds on my Core i7-4770K 4GHz machine.

Word Frequency

The freq_w method relies almost entirely on a single regex, /\W*\s\W*/, to split the input into words. These words are further processed; we perform a basic sanity check with the /\w/ regex to make sure we are not dealing with just numbers or punctuation, and we also take into account the em dash --. We use a basic hash structure to store the words as keys, and their counts as values.

Haskell Version

module TextFreq where

import Data.Char
import Data.List
import qualified Data.Map.Strict as M
import Data.Ord (comparing)
import qualified Data.Text.Lazy as T
import Data.Word
import qualified Text.Printf as TP

type LHash = M.Map Char Word64

type WProto = T.Text
type WHash = M.Map WProto Word64
data WFSM
  = WordIn
  | WordOutMaybe
  | WordOut
  deriving (Eq)
data WBuild = WBuild WFSM WProto WHash

freqL :: T.Text -> LHash
freqL = T.foldl step occs
  where
  occs = M.empty
  step lhash c
    -- | isAlpha c = M.insertWith (+) (toLower c) 1 lhash -- this picks up non-ASCII 'word' letters, even korean/japanese!
    | elem c (['a'..'z'] ++ ['A'..'Z']) = M.insertWith (+) (toLower c) 1 lhash
    | elem c (concat puncKeys) = case lookup c puncTuples of
      Just pkey -> M.insertWith (+) pkey 1 lhash
      Nothing -> lhash
    | otherwise = lhash
  puncKeys =
    [ "`~"
    , "-_"
    , "=+"
    , "[{"
    , "}]"
    , "\\|"
    , ";:"
    , "'\""
    , ",<"
    , ".>"
    , "/?"
    ]
  puncTuples = concatMap (\keyPair -> [(head keyPair, head keyPair), (last keyPair, head keyPair)]) puncKeys

freqW :: T.Text -> WHash
freqW = (\(WBuild _ _ whash) -> whash) . T.foldl step occs
  where
  -- Use WordOut as the initial state for WFSM, because we're starting from
  -- nothing!
  occs :: WBuild
  occs = WBuild WordOut T.empty M.empty
  step wb@(WBuild wfsm wproto whash) c
    -- Letter.
    | isAlpha c = case wfsm of
      -- This is when we first encounter a letter.
      WordOut -> WBuild WordIn (T.singleton c') whash
      _ -> WBuild WordIn (T.snoc wproto c') whash
    -- Apostrophe. We ignore all leading apostrophes and only store
    -- apostrophes at the end of a word, such as "goin'".
    | c == '\'' = case wfsm of
      -- This is when we encounter an apostrophe either at the middle or
      -- end of a word.
      WordIn -> WBuild WordOutMaybe (T.snoc wproto c') whash
      -- E.g., "goin''" (a contracted "goin''" ending with a nested inner
      -- quote). We store it as "goin'".
      WordOutMaybe -> WBuild WordOut T.empty
        $ M.insertWith (+) wproto 1 whash
      -- Already out of a word area, such as a space character. We do
      -- nothing.
      WordOut -> wb
    -- If we're looking at neither a letter nor an apostrophe.
    | otherwise = case wfsm of
      -- A series of nonsense chars; ignore.
      WordOut -> wb
      -- End of a word.
      _ -> WBuild WordOut T.empty
        $ M.insertWith (+) wproto 1 whash
    where
    c' = toLower c

dispFreqL :: LHash -> IO ()
dispFreqL lhash = mapM_ f . reverse . sortBy (comparing snd) $ M.toList lhash
  where
  total :: Word64
  total = sum $ M.elems lhash
  f :: (Char, Word64) -> IO ()
  f (c, n) = putStrLn $ msg1 ++ msg2 ++ msg3
    where
    perc :: Double
    perc
      | total == 0 = 0
      | otherwise = (fromIntegral n) / (fromIntegral total) * 100
    msg1 = [c] ++ " = "
    msg2 = TP.printf "%.2f%%" perc
    msg3 = " (" ++ show n ++ " occurrences)"

dispFreqW :: WHash -> IO ()
dispFreqW whash = mapM_ f . take 100 . reverse . sortBy (comparing snd) $ M.toList whash
  where
  total :: Word64
  total = sum $ M.elems whash
  f :: (WProto, Word64) -> IO ()
  f (w, n) = putStrLn $ msg1 ++ msg2 ++ msg3
    where
    perc :: Double
    perc
      | total == 0 = 0
      | otherwise = (fromIntegral n) / (fromIntegral total) * 100
    msg1 = T.unpack w ++ " = "
    msg2 = TP.printf "%.2f%%" perc
    msg3 = " (" ++ show n ++ " occurrences)"
module Main where

import Control.Monad
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as T

import TextFreq

main :: IO ()
main = do
  fileList <- T.getContents
  src <- liftM T.concat . mapM (T.readFile . T.unpack) $ T.lines fileList
  dispFreqL $ freqL src
  putStrLn $ replicate 80 '-'
  dispFreqW $ freqW src

Here is the same run against our copy of Moby Dick:

$ ./analyze ~/pg2701.txt
e = 12.29% (118967 occurrences)
t = 9.25% (89549 occurrences)
a = 8.16% (78959 occurrences)
o = 7.31% (70698 occurrences)
n = 6.89% (66670 occurrences)
i = 6.88% (66585 occurrences)
s = 6.72% (65012 occurrences)
h = 6.56% (63444 occurrences)
r = 5.51% (53342 occurrences)
l = 4.47% (43298 occurrences)
d = 4.01% (38769 occurrences)
u = 2.81% (27217 occurrences)
m = 2.44% (23655 occurrences)
c = 2.39% (23122 occurrences)
w = 2.33% (22500 occurrences)
g = 2.19% (21239 occurrences)
f = 2.19% (21228 occurrences)
p = 1.83% (17711 occurrences)
y = 1.78% (17209 occurrences)
b = 1.77% (17165 occurrences)
v = 0.90% (8721 occurrences)
k = 0.85% (8196 occurrences)
q = 0.16% (1567 occurrences)
j = 0.12% (1176 occurrences)
x = 0.11% (1062 occurrences)
z = 0.07% (636 occurrences)
--------------------------------------------------------------------------------
the = 6.67% (14620 occurrences)
of = 3.07% (6732 occurrences)
and = 2.97% (6502 occurrences)
a = 2.18% (4788 occurrences)
to = 2.15% (4706 occurrences)
in = 1.93% (4231 occurrences)
that = 1.37% (3005 occurrences)
his = 1.15% (2530 occurrences)
it = 1.11% (2434 occurrences)
i = 0.91% (1993 occurrences)
but = 0.83% (1823 occurrences)
he = 0.81% (1780 occurrences)
with = 0.81% (1770 occurrences)
as = 0.80% (1752 occurrences)
is = 0.80% (1748 occurrences)
was = 0.75% (1646 occurrences)
for = 0.75% (1646 occurrences)
all = 0.70% (1543 occurrences)
this = 0.66% (1443 occurrences)
at = 0.61% (1335 occurrences)
by = 0.56% (1226 occurrences)
not = 0.53% (1171 occurrences)
whale = 0.51% (1108 occurrences)
from = 0.50% (1105 occurrences)
on = 0.49% (1073 occurrences)
him = 0.49% (1067 occurrences)
so = 0.49% (1066 occurrences)
be = 0.49% (1064 occurrences)
you = 0.43% (946 occurrences)
one = 0.42% (914 occurrences)
there = 0.37% (805 occurrences)
or = 0.36% (797 occurrences)
now = 0.36% (783 occurrences)
had = 0.36% (779 occurrences)
have = 0.35% (773 occurrences)
were = 0.31% (683 occurrences)
they = 0.30% (664 occurrences)
which = 0.30% (655 occurrences)
like = 0.30% (647 occurrences)
me = 0.29% (632 occurrences)
then = 0.29% (630 occurrences)
their = 0.28% (620 occurrences)
some = 0.28% (619 occurrences)
are = 0.28% (619 occurrences)
when = 0.28% (607 occurrences)
an = 0.27% (600 occurrences)
no = 0.27% (596 occurrences)
my = 0.27% (589 occurrences)
upon = 0.26% (568 occurrences)
what = 0.26% (566 occurrences)
out = 0.25% (539 occurrences)
up = 0.24% (524 occurrences)
into = 0.24% (523 occurrences)
more = 0.23% (508 occurrences)
if = 0.23% (501 occurrences)
man = 0.22% (476 occurrences)
them = 0.22% (474 occurrences)
we = 0.21% (455 occurrences)
sea = 0.21% (454 occurrences)
old = 0.21% (452 occurrences)
ship = 0.20% (438 occurrences)
ahab = 0.20% (436 occurrences)
ye = 0.20% (431 occurrences)
would = 0.20% (430 occurrences)
other = 0.19% (416 occurrences)
been = 0.19% (415 occurrences)
over = 0.19% (409 occurrences)
these = 0.19% (406 occurrences)
will = 0.18% (398 occurrences)
though = 0.18% (384 occurrences)
its = 0.17% (382 occurrences)
only = 0.17% (378 occurrences)
down = 0.17% (378 occurrences)
such = 0.17% (376 occurrences)
any = 0.17% (364 occurrences)
who = 0.16% (347 occurrences)
yet = 0.16% (345 occurrences)
head = 0.16% (344 occurrences)
time = 0.15% (334 occurrences)
long = 0.15% (334 occurrences)
her = 0.15% (332 occurrences)
do = 0.15% (324 occurrences)
very = 0.15% (323 occurrences)
about = 0.15% (318 occurrences)
still = 0.14% (312 occurrences)
than = 0.14% (311 occurrences)
captain = 0.14% (308 occurrences)
those = 0.14% (307 occurrences)
great = 0.14% (306 occurrences)
said = 0.14% (305 occurrences)
here = 0.14% (302 occurrences)
before = 0.14% (301 occurrences)
two = 0.14% (298 occurrences)
boat = 0.14% (297 occurrences)
has = 0.13% (294 occurrences)
must = 0.13% (293 occurrences)
most = 0.13% (284 occurrences)
seemed = 0.13% (283 occurrences)
white = 0.13% (281 occurrences)
last = 0.13% (278 occurrences)

.

Letter Frequency

freqL handles letter frequency, and it is a simple foldl operation over the input, while using the Map data structure from the Data.Map library (which acts as a simple hash structure with keys and values). The de facto Haskell compiler GHC comes with the base library which includes the Data.Char module; unlike Ruby, we can simply ask whether a character is a letter with isAlpha, and then use toLower on it to convert it to lowercase. freqL owes its brevity to these standard library functions.

Thanks to these standard library functions, we can easily keep track of more than just the basic 26 alphabetical letters (although in the case of Moby Dick, there does not seem to be any such characters).

Word Frequency

This is probably a convoluted way to keep track of words. I did not opt for using regular expressions, because I wanted to try out a different approach instead of just translating the Ruby solution. I could have used the excellent Parsec library, but I just felt like rolling my own solution.

freqW works by looking at just one character at a time, just like freqL. It also keeps track of the evaluation of the previously-looked-at character, with the wfsm variable (for Word Finite State Machine, a fancy but still pertinent name). wfsm can either say that the last character made us go in a word (WordIn), out of a word for sure (WordOut), or possibly out of a word (WordOutMaybe). Depending on the status of wfsm and the current character, freqW makes various choices.

Now, this mechanism isn’t without its warts. But still, I consider it somewhat elegant in its description of all possible states.

A Diff

For fun, let’s look at the diff of the outputs of the Ruby and Haskell versions. Interestingly, the letter frequency outputs were identical. The word frequency outputs did have some significant changes, such as the word whale occurring 972 and 1108 times in the Ruby and Haskell versions, respectively. I’ve sorted the output by lines for saner diffing.

$ diff -u routW houtW
--- routW   2015-04-22 22:01:59.061404962 -0700
+++ houtW   2015-04-22 22:02:57.679828155 -0700
@@ -1,100 +1,100 @@
-a = 2.20% (4760 occurrences)
+a = 2.18% (4788 occurrences)
 about = 0.15% (318 occurrences)
-ahab = 0.20% (432 occurrences)
-all = 0.70% (1523 occurrences)
-an = 0.28% (600 occurrences)
-and = 2.99% (6488 occurrences)
-any = 0.17% (363 occurrences)
-are = 0.29% (619 occurrences)
-as = 0.81% (1751 occurrences)
-at = 0.62% (1334 occurrences)
-be = 0.49% (1060 occurrences)
+ahab = 0.20% (436 occurrences)
+all = 0.70% (1543 occurrences)
+an = 0.27% (600 occurrences)
+and = 2.97% (6502 occurrences)
+any = 0.17% (364 occurrences)
+are = 0.28% (619 occurrences)
+as = 0.80% (1752 occurrences)
+at = 0.61% (1335 occurrences)
+be = 0.49% (1064 occurrences)
 been = 0.19% (415 occurrences)
-before = 0.14% (300 occurrences)
-but = 0.84% (1823 occurrences)
-by = 0.56% (1223 occurrences)
-captain = 0.14% (300 occurrences)
-do = 0.15% (321 occurrences)
-down = 0.17% (367 occurrences)
-for = 0.76% (1645 occurrences)
-from = 0.51% (1105 occurrences)
-great = 0.14% (303 occurrences)
+before = 0.14% (301 occurrences)
+boat = 0.14% (297 occurrences)
+but = 0.83% (1823 occurrences)
+by = 0.56% (1226 occurrences)
+captain = 0.14% (308 occurrences)
+do = 0.15% (324 occurrences)
+down = 0.17% (378 occurrences)
+for = 0.75% (1646 occurrences)
+from = 0.50% (1105 occurrences)
+great = 0.14% (306 occurrences)
 had = 0.36% (779 occurrences)
-has = 0.14% (294 occurrences)
-have = 0.36% (772 occurrences)
-he = 0.82% (1777 occurrences)
-head = 0.13% (275 occurrences)
-her = 0.15% (329 occurrences)
-here = 0.14% (299 occurrences)
-him = 0.49% (1062 occurrences)
-his = 1.17% (2530 occurrences)
-i = 0.92% (1988 occurrences)
-if = 0.23% (500 occurrences)
-in = 1.95% (4223 occurrences)
+has = 0.13% (294 occurrences)
+have = 0.35% (773 occurrences)
+he = 0.81% (1780 occurrences)
+head = 0.16% (344 occurrences)
+her = 0.15% (332 occurrences)
+here = 0.14% (302 occurrences)
+him = 0.49% (1067 occurrences)
+his = 1.15% (2530 occurrences)
+i = 0.91% (1993 occurrences)
+if = 0.23% (501 occurrences)
+in = 1.93% (4231 occurrences)
 into = 0.24% (523 occurrences)
-is = 0.81% (1747 occurrences)
-it = 1.12% (2419 occurrences)
-its = 0.18% (382 occurrences)
-last = 0.13% (276 occurrences)
-like = 0.27% (581 occurrences)
-long = 0.15% (319 occurrences)
-man = 0.21% (445 occurrences)
-me = 0.29% (621 occurrences)
-more = 0.23% (506 occurrences)
+is = 0.80% (1748 occurrences)
+it = 1.11% (2434 occurrences)
+its = 0.17% (382 occurrences)
+last = 0.13% (278 occurrences)
+like = 0.30% (647 occurrences)
+long = 0.15% (334 occurrences)
+man = 0.22% (476 occurrences)
+me = 0.29% (632 occurrences)
+more = 0.23% (508 occurrences)
 most = 0.13% (284 occurrences)
-must = 0.13% (292 occurrences)
+must = 0.13% (293 occurrences)
 my = 0.27% (589 occurrences)
-no = 0.27% (594 occurrences)
-not = 0.54% (1169 occurrences)
-now = 0.36% (779 occurrences)
-of = 3.10% (6708 occurrences)
-old = 0.20% (444 occurrences)
-on = 0.49% (1069 occurrences)
-one = 0.42% (906 occurrences)
+no = 0.27% (596 occurrences)
+not = 0.53% (1171 occurrences)
+now = 0.36% (783 occurrences)
+of = 3.07% (6732 occurrences)
+old = 0.21% (452 occurrences)
+on = 0.49% (1073 occurrences)
+one = 0.42% (914 occurrences)
 only = 0.17% (378 occurrences)
-or = 0.37% (797 occurrences)
+or = 0.36% (797 occurrences)
 other = 0.19% (416 occurrences)
-out = 0.24% (528 occurrences)
-over = 0.19% (403 occurrences)
-said = 0.14% (301 occurrences)
-sea = 0.18% (382 occurrences)
-see = 0.12% (268 occurrences)
+out = 0.25% (539 occurrences)
+over = 0.19% (409 occurrences)
+said = 0.14% (305 occurrences)
+sea = 0.21% (454 occurrences)
 seemed = 0.13% (283 occurrences)
-ship = 0.18% (391 occurrences)
-so = 0.49% (1061 occurrences)
-some = 0.29% (619 occurrences)
-still = 0.14% (311 occurrences)
+ship = 0.20% (438 occurrences)
+so = 0.49% (1066 occurrences)
+some = 0.28% (619 occurrences)
+still = 0.14% (312 occurrences)
 such = 0.17% (376 occurrences)
 than = 0.14% (311 occurrences)
-that = 1.38% (2999 occurrences)
-the = 6.74% (14616 occurrences)
-their = 0.29% (620 occurrences)
-them = 0.22% (471 occurrences)
-then = 0.29% (628 occurrences)
-there = 0.37% (792 occurrences)
-these = 0.19% (405 occurrences)
-they = 0.31% (664 occurrences)
-this = 0.66% (1440 occurrences)
+that = 1.37% (3005 occurrences)
+the = 6.67% (14620 occurrences)
+their = 0.28% (620 occurrences)
+them = 0.22% (474 occurrences)
+then = 0.29% (630 occurrences)
+there = 0.37% (805 occurrences)
+these = 0.19% (406 occurrences)
+they = 0.30% (664 occurrences)
+this = 0.66% (1443 occurrences)
 those = 0.14% (307 occurrences)
-thou = 0.12% (267 occurrences)
-though = 0.18% (383 occurrences)
-time = 0.15% (326 occurrences)
-to = 2.16% (4677 occurrences)
-two = 0.13% (288 occurrences)
-up = 0.24% (516 occurrences)
-upon = 0.26% (567 occurrences)
+though = 0.18% (384 occurrences)
+time = 0.15% (334 occurrences)
+to = 2.15% (4706 occurrences)
+two = 0.14% (298 occurrences)
+up = 0.24% (524 occurrences)
+upon = 0.26% (568 occurrences)
 very = 0.15% (323 occurrences)
-was = 0.76% (1645 occurrences)
+was = 0.75% (1646 occurrences)
 we = 0.21% (455 occurrences)
-were = 0.32% (683 occurrences)
-whale = 0.45% (972 occurrences)
+were = 0.31% (683 occurrences)
+whale = 0.51% (1108 occurrences)
 what = 0.26% (566 occurrences)
 when = 0.28% (607 occurrences)
 which = 0.30% (655 occurrences)
-who = 0.16% (345 occurrences)
-will = 0.18% (396 occurrences)
-with = 0.82% (1770 occurrences)
-would = 0.20% (428 occurrences)
-ye = 0.20% (428 occurrences)
-yet = 0.16% (344 occurrences)
-you = 0.44% (944 occurrences)
+white = 0.13% (281 occurrences)
+who = 0.16% (347 occurrences)
+will = 0.18% (398 occurrences)
+with = 0.81% (1770 occurrences)
+would = 0.20% (430 occurrences)
+ye = 0.20% (431 occurrences)
+yet = 0.16% (345 occurrences)
+you = 0.43% (946 occurrences)

French?

Here is the Haskell version’s output on the first volume of Les Misérables in the original French:

e = 14.68% (77528 occurrences)
a = 8.12% (42892 occurrences)
i = 7.65% (40424 occurrences)
t = 7.62% (40270 occurrences)
s = 7.27% (38395 occurrences)
n = 6.76% (35704 occurrences)
r = 6.25% (32985 occurrences)
u = 6.16% (32553 occurrences)
l = 5.81% (30686 occurrences)
o = 5.17% (27315 occurrences)
d = 3.46% (18262 occurrences)
c = 3.06% (16150 occurrences)
m = 2.99% (15800 occurrences)
p = 2.61% (13784 occurrences)
v = 1.95% (10285 occurrences)
é = 1.87% (9852 occurrences)
q = 1.26% (6637 occurrences)
f = 1.18% (6245 occurrences)
h = 1.06% (5623 occurrences)
b = 0.99% (5244 occurrences)
g = 0.93% (4910 occurrences)
j = 0.56% (2973 occurrences)
à = 0.53% (2795 occurrences)
x = 0.40% (2102 occurrences)
y = 0.39% (2051 occurrences)
è = 0.32% (1702 occurrences)
ê = 0.30% (1584 occurrences)
z = 0.18% (964 occurrences)
â = 0.08% (410 occurrences)
ç = 0.07% (355 occurrences)
û = 0.06% (335 occurrences)
ô = 0.05% (290 occurrences)
ù = 0.05% (285 occurrences)
w = 0.05% (284 occurrences)
î = 0.05% (276 occurrences)
k = 0.03% (151 occurrences)
ï = 0.01% (47 occurrences)
ë = 0.00% (5 occurrences)
ü = 0.00% (2 occurrences)
ñ = 0.00% (2 occurrences)
--------------------------------------------------------------------------------
de = 3.89% (4472 occurrences)
la = 2.64% (3040 occurrences)
et = 2.57% (2949 occurrences)
il = 2.25% (2582 occurrences)
le = 2.22% (2548 occurrences)
à = 1.94% (2236 occurrences)
les = 1.34% (1538 occurrences)
un = 1.27% (1459 occurrences)
que = 1.17% (1350 occurrences)
qui = 1.11% (1278 occurrences)
dans = 0.99% (1134 occurrences)
une = 0.92% (1062 occurrences)
ce = 0.92% (1062 occurrences)
en = 0.90% (1036 occurrences)
des = 0.82% (948 occurrences)
pas = 0.76% (879 occurrences)
se = 0.75% (859 occurrences)
ne = 0.73% (843 occurrences)
était = 0.69% (792 occurrences)
vous = 0.68% (783 occurrences)
je = 0.67% (770 occurrences)
avait = 0.66% (760 occurrences)
lui = 0.63% (721 occurrences)
du = 0.62% (714 occurrences)
elle = 0.57% (660 occurrences)
sur = 0.56% (640 occurrences)
sa = 0.55% (635 occurrences)
pour = 0.54% (620 occurrences)
son = 0.53% (611 occurrences)
au = 0.50% (579 occurrences)
cette = 0.48% (556 occurrences)
on = 0.47% (537 occurrences)
est = 0.46% (533 occurrences)
qu'il = 0.46% (528 occurrences)
a = 0.46% (524 occurrences)
tout = 0.45% (514 occurrences)
plus = 0.44% (508 occurrences)
comme = 0.44% (503 occurrences)
dit = 0.39% (446 occurrences)
avec = 0.38% (432 occurrences)
c'est = 0.36% (416 occurrences)
y = 0.35% (404 occurrences)
par = 0.34% (392 occurrences)
mais = 0.30% (350 occurrences)
nous = 0.30% (340 occurrences)
ses = 0.28% (321 occurrences)
là = 0.27% (308 occurrences)
bien = 0.27% (305 occurrences)
deux = 0.26% (303 occurrences)
monsieur = 0.26% (296 occurrences)
même = 0.26% (295 occurrences)
cela = 0.26% (295 occurrences)
ces = 0.26% (294 occurrences)
si = 0.24% (273 occurrences)
où = 0.23% (269 occurrences)
m = 0.23% (266 occurrences)
me = 0.21% (238 occurrences)
l'évêque = 0.21% (236 occurrences)
homme = 0.20% (234 occurrences)
sans = 0.20% (233 occurrences)
aux = 0.20% (232 occurrences)
fait = 0.20% (230 occurrences)
madeleine = 0.19% (214 occurrences)
qu'on = 0.18% (210 occurrences)
jean = 0.18% (210 occurrences)
d'un = 0.18% (208 occurrences)
c'était = 0.17% (199 occurrences)
valjean = 0.17% (197 occurrences)
être = 0.17% (196 occurrences)
fantine = 0.17% (192 occurrences)
d'une = 0.17% (190 occurrences)
javert = 0.15% (177 occurrences)
the = 0.15% (176 occurrences)
peu = 0.15% (173 occurrences)
cet = 0.15% (173 occurrences)
faire = 0.15% (172 occurrences)
puis = 0.15% (169 occurrences)
moi = 0.15% (168 occurrences)
j'ai = 0.14% (164 occurrences)
chose = 0.14% (164 occurrences)
été = 0.14% (163 occurrences)
maire = 0.14% (162 occurrences)
dire = 0.14% (159 occurrences)
rien = 0.14% (158 occurrences)
quand = 0.14% (157 occurrences)
sont = 0.13% (153 occurrences)
quelque = 0.13% (153 occurrences)
tous = 0.13% (152 occurrences)
porte = 0.13% (150 occurrences)
ou = 0.13% (148 occurrences)
toute = 0.13% (147 occurrences)
chapitre = 0.13% (144 occurrences)
sous = 0.12% (142 occurrences)
peut = 0.12% (140 occurrences)
mon = 0.12% (138 occurrences)
moment = 0.12% (138 occurrences)
dieu = 0.12% (137 occurrences)
encore = 0.12% (134 occurrences)
l'homme = 0.11% (130 occurrences)
eût = 0.11% (130 occurrences)

. The most common French word in this book is de, meaning of in English. This is because the word for the is split into many different words, most notably la and le, not to mention /l’/ as in l’homme (as you can see near the end of the list), due to the French language’s gender and vowel contraction rules (unlike English, contractions like l’homme in French are mandatory regardless of tone).

And, as a bit of trivia, it is interesting to note that dieu (God) edges out l’homme (man) by 7 occurrences in this text.

Conclusion

I hope you’ve had some fun working on these letter and word frequency problems. The word frequency problem, if you really want to do it correctly, should be handled by a parser using a robust library. By writing these programs, I learned that the input of a program (Unicode? ASCII only?) is just as important as its output.

Happy hacking!


  1. After I receive, assemble, and acclimate myself to it, I will post a review.↩︎