dicio-wc word-to-count.tbl lines words bytes file ------ ------- --------- ------------ 4730 9460 45540 word-to-count.tbl Let's also create a list of the wordpatterns that have either very high strangeness on a few pages, or moderately high strangeness on a fair number of pages. To do that, we start with set of "chosen" words, initially empty sort the word-page strangeness file by page and decreasing strangeness. Then we assign points for each word with at least 2 occurrences and strangeness above a certain minimum, up to a certain number of words per page. We exclude from consideration words that have already been chosen for coloring. The N highest-ranked words in the result are then added to the chosen words. This process is iterated until we have a reasonably large cat page-word.stn \ | fgrep -v -w -f words-special.dic \ | sort +1 -2 +5 -6nr \ | gawk \ -v minStrang=3 \ -v minOccs=2 \ -v maxCands=3 \ ' (($1 >= minOccs)&&($6 >= minStrang)){ \ if ($2!=fp) { p = maxCands; fp=$2; } \ if (p>0) { printf "%s %d\n", $0, p; p--; } \ } ' \ | sort +3 -4 \ > page-word-strangest.stp cat page-word-strangest.stp \ | gawk \ ' /./ { \ if ($4!=w) { \ if(w!=""){ printf "%7d %7d %s\n", t, c, w; } \ t=0; c=0; w=$4; \ } \ t += $7; c++; \ } \ END { if(w!="") { printf "%7d %7d %s\n", t, c, w; } } \ ' \ | sort +0 -1nr \ > word-strangest.pts function strangeness(m, n, M, N, p, q) { # Computes strangeness of having m or more occurrences # of a word in n trials, where M is the count of that # word in the book, and N the total number of word # occurrences in the book. p = M/N; q = (N-M)/N; return (m - p*n)/sqrt(n*p*q); } function printout(mw, fn, i) { # prints $0 with "mw" inserted as field "$(fn)" if (NF < fn-1) { error("not enough output fields\n"); } if (fn == 1) { print mw, $0; } else if (fn == NF+1) { print $0, mw; } else { for (i=1;i page-word.stn dicio-wc page-word.stn loaded 4730 word counts (35177 word occurrences) loaded 225 page sizes (35177 word occurrences) lines words bytes file ------ ------- --------- ------------ 20111 120666 514381 page-word.stn ==== JUNK Now let's make a table that counts the number of pages where each pattern occurs at least once: /bin/rm .tmp foreach f ( `cat all.pages` ) set ffile = "pages-frq/${f}.frq" echo "${ffile}" cat ${ffile} \ | gawk '/./{ print $2; }' \ | sort | uniq \ >> .tmp end cat .tmp | sort | uniq -c | expand \ | sort +0 -1nr \ > pages-per-word-occurs.frq Ditto, counting pages where the word appears at least twice: /bin/rm .tmp foreach f ( `cat all.pages` ) set ffile = "pages-frq/${f}.frq" echo "${ffile}" cat ${ffile} \ | gawk '($1 >= 2){ print $2; }' \ | sort | uniq \ >> .tmp end cat .tmp | sort | uniq -c | expand \ | sort +0 -1nr \ > pages-per-word-plural.frq Ditto, counting pages where the word appears exactly once /bin/rm .tmp foreach f ( `cat all.pages` ) set ffile = "pages-frq/${f}.frq" echo "${ffile}" cat ${ffile} \ | gawk '($1 == 1){ print $2; }' \ | sort | uniq \ >> .tmp end cat .tmp | sort | uniq -c | expand \ | sort +0 -1nr \ > pages-per-word-single.frq dicio-wc pages-per-word-{occurs,plural,single}.frq lines words bytes file ------ ------- --------- ------------ 4711 9422 73137 pages-per-word-occurs.frq 546 1092 7857 pages-per-word-plural.frq 4677 9354 72620 pages-per-word-single.frq Now let's get the words that occur {once,more-than-once,any_amount} on {many,some,a-few} pages: foreach k ( occurs plural single ) cat pages-per-word-${k}.frq \ | gawk '($1 >= 60){ print $2; }' \ | sort \ > words-${k}-on-many-pages.dic cat pages-per-word-${k}.frq \ | gawk '(($1 < 60) && ($1 >= 6)){ print $2; }' \ | sort \ > words-${k}-on-some-pages.dic cat pages-per-word-${k}.frq \ | gawk '($1 < 6){ print $2; }' \ | sort \ > words-${k}-on-rare-pages.dic end dicio-wc words-{occurs,plural,single}-on-{many,some,rare}-pages.dic lines words bytes file ------ ------- --------- ------------ 59 59 312 words-occurs-on-many-pages.dic 513 513 3281 words-occurs-on-some-pages.dic 4139 4139 31856 words-occurs-on-rare-pages.dic 18 18 88 words-plural-on-many-pages.dic 136 136 800 words-plural-on-some-pages.dic 392 392 2601 words-plural-on-rare-pages.dic 3 3 14 words-single-on-many-pages.dic 539 539 3380 words-single-on-some-pages.dic 4135 4135 31810 words-single-on-rare-pages.dic Now get from each page (1) the words that are not too common and occur at least twice on that page, and (2) the words that are not too common and occur once in that page. mkdir pages-clr foreach f ( `cat all.pages` ) set ffile = "pages-frq/${f}.frq" set pfile = "pages-clr/${f}-p.tbl" set sfile = "pages-clr/${f}-s.tbl" /bin/rm -f ${sfile} ${pfile} cat ${ffile} \ | gawk '($1 >= 2){ print $2; }' \ | sort \ | bool 1-2 - words-plural-on-many-pages.dic \ | add-color-field -v outField=2 -v brightness=0.30 \ > ${pfile} cat ${ffile} \ | gawk '($1 == 1){ print $2; }' \ | sort \ | bool 1-2 - words-single-on-many-pages.dic \ | add-color-field -v outField=2 -v brightness=0.21 \ > ${sfile} dicio-wc ${sfile} ${pfile} end Let's make some color tables: Black: words that occur on >= N pages. Vivid colors: words that occur on < N pages, and twice on this page. Dimmer colors: words that occur on < N pages, and only once on this page. gawk \ ' function expand(y,c) { # add to c the largest multiple of y that \ # fits in the RGB cube \ BEGIN { ' \ I started making a list of such words, special-words.txt. I also noted repeated words. Perhaps I should make a colorized version of all pages showing special and repeated words. I created a preliminary table with some of those characteristic words, mapped to colors. Let's reduce the list by equivalence classes: cat word-to-color.tbl \ | add-match-key -f eva2erg.gawk \ -v inField=1 \ -v outField=3 \ -v erase_ligatures=1 \ -v erase_plumes=0 \ -v ignore_gallows_eyes=1 \ -v join_ei=0 \ -v equate_aoy=1 \ -v collapse_ii=0 \ -v equate_eights=1 \ -v equate_pt=1 \ -v erase_q=1 \ -v erase_word_spaces=0 \ | sort +2 -3 | uniq -f 2 \ | gawk '/./{print $3,$2}' \ > pat-to-color.tbl