Hacking at the Voynich manuscript - Side notes 011 Trying to identify possible plant names in the herbal pages. Last edited on 1999-07-28 01:49:30 by stolfi UNDER WORK READ AT YOUR OWN COST AND RISK HARD HATS REQUIRED WE APOLOGIZE FOR THE INCONVENIENCE Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. These files were eventually superseded by L16+H-eva/* I. LOCATING PAGE-SPECIFIC WORDS 1998-01-25 [redone 1999-01-15] The idea is that the plant name should appear once or twice in that page, but very rarely in other pages. Of course, if the language is Chinese-like the plant names will consist of two or more words, so we should consider phrases too. We can extract the information we need from the machine-readable concordance file Notes/037/vms-17-ok.hoc.gz. Recall that its format was LOC TRANS START LENGTH LCTX PHRASE RCTX PATT STAG PNUM HNUM 1 2 3 4 5 6 7 8 9 10 11 where LOC is a line locator, like "f1r.11", "f86v2.R1.12a" etc. TRANS is a letter identifying a transcriber, e.g. "F" for Friedman. START is the index of the first byte of the occurrence in the text line (counting from 1). LENGTH is the original length of the occurrence in the text, including fillers, comments, spaces, etc.. LCTX is a word or word sequence, the left context of PHRASE. PHRASE is the non-empty phrase in question, without any fillers, comments, non-significant spaces, line breaks, etc.. RCTX is a word or word sequence, the right context of PHRASE. PATT a sorting pattern derived from PHRASE; see below. STAG a tag identifying a section of the VMS, e.g. "hea" or "bio". PNUM is the page's p-number, which is sequential and better for sorting. HNUM the section number in the HTML-formatted concordance. First, let's make a list of the herbal text units and pages: cat L16+H-eva/INDEX \ | gawk -v FS=':' '($3=="her" && $6=="parags"){print $0}' \ > her.index cat her.index \ | gawk -v FS=':' '/./{print $2;}' \ > her.units cat her.index \ | gawk -v FS=':' '/./{print substr($7,2,3);}' \ | sort | uniq \ > her.pages Checking the page sequence: seq 1 200 | gawk '//{printf "%03d\n",$1;}' > .foo diff her.pages .foo > .diff There are 128 herbal pages, of which 127 have text: p002-p095, p097-p111, p116, p118, p177-p178, and p185-p198. Page p115 (f65r) has an herbal-like drawing but contains only a "title" and no text. Pages 115 and 116 were mis-classified as "unk" instead of "hea"/"heb" in the L16+H-eva/INDEX file, release 1.6e6. Therefore we should remember to fix their STAG fields. Since the other half of the bifolio is language A, we can assume the same for pages 115 and 1116. (Besides, page 115 has only a title.) If we do that, then we have 96 herbal-A pages and 32 herbal-B pages. Which version shall we use? We could use the majority version ("A"), which is more reliable, or Takeshi's version ("H"), which is more complete (since the "A" version omits all words for which majority was not achieved). Let's compare them: foreach v ( A H ) zcat ../037/vms-17-ok.hoc.gz \ | gawk \ ' (index($2,"'"$v"'") && ($6 \!~ /[-=/.,]/)) { \ print substr($10,2,3), $1, $3, $4; \ } \ ' \ | select-herbal-pages \ | sort \ > .$v.locs end bool 1-2 .A.locs .H.locs > .A-H.locs bool 1-2 .H.locs .A.locs > .H-A.locs dicio-wc .[A-Z]*.locs lines words bytes file ------- ------- --------- ------------ 457 1828 8329 .A-H.locs 10934 43736 196935 .A.locs 102 408 1883 .H-A.locs 10579 42316 190489 .H.locs So there are only 457 words of the "A" version that are not words in Takeshi's; and there are only 102 words in Takeshi's that are not in "A". We might as well use the "A" version. OK, so let's extract from the concordance the phrases from the herbal sections, keeping only PNUM LOC PATT PHRASE STAG 1 2 3 4 5 Note that the PNUM is written without the "p". foreach f ( hea heb ) zcat ../037/vms-17-ok.hoc.gz \ | gawk \ ' (index($2,"A") && ($6 \!~ /[-=/.,]/)) { \ print substr($10,2,3), $1, $8, $6, $9; \ } \ ' \ | gawk -v section=${f} \ ' (($1 == "115")||($1 == "116")){$5 = "hea";} \ ($5 == section){print;} \ ' \ | sort +2 -4 +0 -1n +1 -2 \ > ${f}-17.roc end dicio-wc {hea,heb}-17.roc lines words bytes file ------- ------- --------- ------------ 7597 37985 216872 hea-17.roc 3337 16685 96303 heb-17.roc Next we reduce the data to occurrence counts per page: COUNT PNUM STRING 1 2 3 where STRING is either WORD or PHRASE, and COUNT is the count of occurrences of STRING on page PNUM. foreach f ( hea heb ) cat ${f}-17.roc \ | gawk '/./{print $1, $3;}' \ | sort +0 -1n +1 -2 | uniq -c | expand \ | sort +1 -2n +0 -1nr +2 -3 \ > ${f}-17-pat.spfr cat ${f}-17.roc \ | gawk '/./{print $1, $4;}' \ | sort +0 -1n +1 -2 | uniq -c | expand \ | sort +1 -2n +0 -1nr +2 -3 \ > ${f}-17-phr.spfr end dicio-wc {hea,heb}-17-{phr,pat}.spfr lines words bytes file ------- ------- --------- ------------ 6102 18306 109817 hea-17-phr.spfr 4805 14415 85885 hea-17-pat.spfr 2673 8019 48277 heb-17-phr.spfr 2060 6180 36920 heb-17-pat.spfr As a control experiments, we will take all the phrases/patterns (with correct multiplicities) and assign them randomly to 128 pages with the correct sizes foreach g ( a b ) foreach f ( phr pat ) cat he${g}-17-${f}.spfr \ | randomize-distribution \ > rn${g}-17-${f}.spfr end end dicio-wc {rna,rnb}-17-{phr,pat}.spfr lines words bytes file ------- ------- --------- ------------ 6424 19272 115356 rna-17-phr.spfr 5168 15504 92175 rna-17-pat.spfr 2808 8424 50585 rnb-17-phr.spfr 2176 6528 38957 rnb-17-pat.spfr Let's compute the overall frequency of each pattern and phrase: foreach t ( hea rna heb rnb ) foreach f ( phr pat ) cat ${t}-17-${f}.spfr \ | gawk '/./{print $1, $3;}' \ | combine-counts \ | sort +0 -1nr +1 -2 \ > ${t}-17-${f}.sfr end end dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.sfr lines words bytes file ------- ------- --------- ------------ 2283 4566 33946 hea-17-phr.sfr 996 1992 15004 hea-17-pat.sfr 2283 4566 33946 rna-17-phr.sfr 996 1992 15004 rna-17-pat.sfr 1248 2496 18303 heb-17-phr.sfr 622 1244 9213 heb-17-pat.sfr 1248 2496 18303 rnb-17-phr.sfr 622 1244 9213 rnb-17-pat.sfr foreach f ( phr pat ) diff {hea,rna}-17-${f}.sfr diff {heb,rnb}-17-${f}.sfr end (no output) Page sizes: foreach t ( hea rna heb rnb ) foreach f ( phr pat ) cat ${t}-17-${f}.spfr \ | gawk '/./{print $1, $2;}' \ | combine-counts \ | sort +0 -1nr +1 -2 \ > ${t}-17-${f}.pfr end end dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.pfr lines words bytes file ------- ------- --------- ------------ 96 192 1152 hea-17-phr.pfr 96 192 1152 hea-17-pat.pfr 96 192 1152 rna-17-phr.pfr 96 192 1152 rna-17-pat.pfr 32 64 384 heb-17-phr.pfr 32 64 384 heb-17-pat.pfr 32 64 384 rnb-17-phr.pfr 32 64 384 rnb-17-pat.pfr foreach f ( phr pat ) diff {hea,rna}-17-${f}.pfr diff {heb,rnb}-17-${f}.pfr end (no output) Then we create a file that shows, for each string "w", the shape of its distribution over the pages, defined as the multiset of the nonzero per-page counts of that word, sorted in decreasing order. The file has one record per string in the format STRING TOTCT NPAGES NMISS SHAPE 1 2 3 4 5 where TOTCT is the total occurrence count of the string, NPAGES is the number of pages where the STRING occurs, NMISS is the number of pages where the word doesn't occur, and SHAPE is the nonzero counts, braced by "()" and separated by ",". foreach t ( hea rna heb rnb ) foreach f ( phr pat ) cat ${t}-17-${f}.spfr \ | compute-distr-shape \ | sort +1 -2n +3 -4n +4 -5 +0 -1 \ > ${t}-17-${f}.sdsh end end dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.sdsh lines words bytes file ------- ------- --------- ------------ 2283 11415 48704 hea-17-phr.sdsh 996 4980 25856 hea-17-pat.sdsh 2283 11415 49355 rna-17-phr.sdsh 996 4980 26570 rna-17-pat.sdsh 1248 6240 24996 heb-17-phr.sdsh 622 3110 14072 heb-17-pat.sdsh 1248 6240 25270 rnb-17-phr.sdsh 622 3110 14292 rnb-17-pat.sdsh Then we compute the number of words that have each histogram shape: foreach t ( hea rna heb rnb ) foreach f ( phr pat ) cat ${t}-17-${f}.sdsh \ | gawk '//{print $2, $3, $5;}' \ | sort | uniq -c | expand \ | sort +1 -2n +0 -1nr +2 -3n +3 -4 \ > ${t}-17-${f}.shfr end end dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.shfr lines words bytes file ------- ------- --------- ------------ 121 484 6594 hea-17-phr.shfr 120 480 8060 hea-17-pat.shfr 102 408 6347 rna-17-phr.shfr 103 412 8074 rna-17-pat.shfr 77 308 2875 heb-17-phr.shfr 87 348 3583 heb-17-pat.shfr 69 276 2821 rnb-17-phr.shfr 79 316 3555 rnb-17-pat.shfr STOPPED HERE ----- Then we produce, for each distinct STRING, one record with the format TOTFR MAXFR PMAX SPECF CTMAX SZMAX TOTCT PATT 1 2 3 4 5 6 7 8 where TOTFR is the overall frequency of PATT. MAXFR is the maximum frequency of PATT in any page. PMAX is the PNUM of a page where the freq of PATT is MAXFR. SPECF is the ratio MAXFR/TOTFR CTMAX is the occurrence count of PATT in page PMAX. SZMAX is the total word occurrence count in page PMAX. Here it is: cat her-17.wpfr \ | compute-tot-max-freqs \ | sort -b +2 -3n +3 -4gr \ > her-17.spc dicio-wc her-17.spc lines words bytes file ------- ------- --------- ------------ 1306 10448 51730 her-17.spc Let's format it a bit: cat her-17.spc \ | gawk \ ' BEGIN {pg="";} \ /./ { if($3!=pg){printf "=== page %s word count = %s\n", $3, $6; pg=$3;} \ printf "%7.5f %7.5f %7.5f %4d %4d %s\n", $1, $2, $4, $5, $7, $8; \ } \ ' \ > her-17.spf Note that for each page there are many words with TOTCT=1, CTMAX=1 that end up with high SPECF (around 8). Let's define a word or phrase as "super-specific" if it occurs two or more times on the same page, and zero times elsewhere. Here are the super-specific phrases: cat her-17.spc \ | gawk '(($5>=$7)&&($7 >= 2)){print}' \ > her-17-super.spc dicio-wc her-17-super.spc lines words bytes file ------- ------- --------- ------------ 4 32 162 her-17-super.spc 0.0002 0.0022 058 8.9408 2 63 2 eeeoldo 0.0002 0.0021 066 8.6319 2 112 2 oleedoin 0.0003 0.0027 084 8.4124 3 149 3 otedol 0.0002 0.0021 089 8.3664 2 157 2 dotedo OLD VERSION Looking at the listing above, it seems that the first word of each page, as well as the last one, is characteristic of the page. Moreover, the first word often occurs at the beginning of the second paragraph, sometimes with slight modification: To check the first-word hunch, let's make a list of the first words and word clusters in each page. We beging by extracting the first line of each page (from Friedman's transcription, which is the one we used to build the starting concordance in Note-010): cat her.units \ | gawk \ -v FS='.' \ 'BEGIN{pg="";} /./{if($1!=pg){print;pg=$1} next;}' \ > her-first.units dicio-wc her-first.units lines words bytes file ------ ------- --------- ------------ 127 127 884 her-first.units Next we extract the first lines (usually numbered "1") from each of these units: cat `cat her-first.units | sed -e 's/^/L16-eva\//g'` \ | egrep '[.][01][a]*;F*>' \ > her-first-lines.txt dicio-wc her-first-lines.txt lines words bytes file ------ ------- --------- ------------ 126 252 9093 her-first-lines.txt (The missing page is f65v = p116, which doesn't have a Friedman version.) Next we output all initial words groups with up to 15 EVA chars (not counting spaces): cat her-first-lines.txt \ | enum-text-phrases -f eva2erg.gawk \ -v maxlen=15 \ | gawk '($5==1){print $1, $8;}' \ > her-first-phrases.tbl read 1036 words wrote 2332 phrases Now let's tabulate the pecificity of each of those first-phrases: cat her-f-eva-15.spc \ | gawk '/./ {printf "%s %5.3f\n", $6, $3;}' \ > pat-to-specificity.tbl dicio-wc pat-to-specificity.tbl lines words bytes file ------ ------- --------- ------------ 18750 37500 329436 pat-to-specificity.tbl cat her-f-eva-15.spc \ | gawk '/./ {printf "%s %7d\n", $6, $1;}' \ > pat-to-totct.tbl cat her-first-phrases.tbl \ | gawk '/./{print $1,$2,$2;}' \ | fix-words -f word-equiv.gawk \ -v field=2 \ -v stripq=1 \ -v equatekt=${equatekt} \ -v equatepf=${equatepf} \ | map-field \ -v inField=2 -v outField=2 \ -v table=pat-to-specificity.tbl \ -v default="9.999" \ | map-field \ -v inField=3 -v outField=3 \ -v table=pat-to-totct.tbl \ -v default="0" \ | gawk \ ' BEGIN {fnum="";} \ /./ { \ if (($1 != fnum) &&(fnum != "")) {printf "\n";} \ printf "%-6s %5.3f %4d %-15s %-15s\n", $1, $2, $3, $4, $5; \ fnum=$1; \ } \ ' \ > her-first-pats.fmt Checking whether all first-phrases were found in the concordance: grep ' 9\.999' her-first-pats.fmt Let's now select the first unique (specificity=1) prefix in each page: cat her-first-pats.fmt \ | gawk \ ' BEGIN {prevf="";done=1;} \ /./ { \ if($1!=prevf) \ { if (! done) {print prevlin;} \ done=0; \ } \ if (($2==1.0)&&(! done)) {print; done=1;} \ prevf = $1; prevlin = $0; \ } \ ' \ > her-first-unique-pats.fmt These are the shortest page-specific phrases at the beginning of each page: page spect totn pattern phrase ----- ----- ---- --------------- ------------------ f1v 1.000 1 kchro kchry f2r 1.000 1 kydaino kydainy f2v 1.000 1 kooiincheo kooiin.cheo f3r 1.000 1 ksheos tsheos f3v 1.000 1 koaiin koaiin f4r 1.000 1 kodalcho kodalchy f4v 1.000 1 pchooiin pchooiin f5r 1.000 1 kshodopchoo kshody.pchoy f5v 1.000 1 kocheor k.o.cheor f6r 1.000 1 poaro foar.y f6v 1.000 1 koaro koary f7r 0.200 5 pchodaiin fchodaiin f7v 1.000 1 polysho polyshy f8r 1.000 1 pshol pshol f8v 1.000 1 ckhodsoockh cthod.soocth f9r 1.000 1 kydlo tydlo f9v 1.000 1 pochor fochor f10r 1.000 1 pchockhoshor pchocthy.shor f10v 1.000 1 paiindaiin paiin.daiin f11r 1.000 1 ksholschoal tshol.schoal f11v 1.000 1 poldchodo poldchody f13r 1.000 1 korshor torshor f13v 1.000 1 koair koair f14r 1.000 1 pchodaiinchopol pcho.daiin.chopol f14v 1.000 1 pdychoiin pdychoiin f15r 1.000 1 kshorsheokchalo tshor.shey.tchaly f15v 1.000 1 poror poror f16r 1.000 1 pocheodo pocheody f16v 1.000 1 pchraiin pchraiin f17r 1.000 1 pshododaram fshody.daram f17v 1.000 1 pchodol pchodol f18r 1.000 1 pdrairdo pdrairdy f18v 1.000 1 kopd tofd f19r 1.000 1 pchorodcho pchor.qodchy f19v 1.000 1 pochaiinckhor pochaiin.cthor f20r 1.000 1 kdchodo kdchody f20v 1.000 1 paiis faiis f21r 1.000 1 pchorochockho pchor.oeeockhy f21v 1.000 1 koldsho toldshy f22r 1.000 1 pololsho pololshy f22v 1.000 1 pysaiinor pysaiinor f23r 1.000 1 pydchdom pydchdom f23v 1.000 1 podairol podairol f24r 1.000 1 pororo porory f24v 1.000 1 kchodarchocpho tchodar.chocfhy f25r 1.000 1 pcholdososho fcholdy.soshy f25v 1.000 1 pochaiinoko poeeaiin.qoky f26r 1.000 1 psheoko psheoky f26v 1.000 1 pchedarodaro pchedar.qodary f27r 1.000 1 ksor ksor f27v 1.000 1 pochop fochof f28r 1.000 1 pchodar pchodar f28v 1.000 1 ksholooiiin kshol.qooiiin f29r 1.000 1 poraiin poraiin f29v 1.000 1 kooiinshor kooiin.shor f30r 1.000 1 okchesocheo okchesy.chey f30v 1.000 1 ckhsckhain cthscthain f31r 1.000 1 kchdeo keedey f31v 1.000 1 podair podair f32r 1.000 1 pchaiinshykeodo fchaiin.shykeody f32v 1.000 1 kcheodaiinchol kcheodaiin.chol (*)[1] f33r 1.000 1 kshdar tshdar f33v 1.000 1 karardaiin tar.ar.daiin f34r 1.000 1 pcheoepcho pcheoepchy f34v 1.000 1 kechdochdo kechdy.chdy f35r 1.000 1 ckhoorcholo cthoo.rcholy f35v 1.000 1 parchor parchor f36r 1.000 1 pchapdan pchafdan f36v 1.000 1 pcharaso pcharasy f37r 1.000 1 kocphol tocphol f37v 1.000 1 kshodoockho kshody.qocthy f38r 1.000 1 kolor tolor f38v 1.000 1 okchopchol okchop.chol (*)[2] f39r 1.000 1 kedochshd tedo.chshd f39v 1.000 1 pdair pdair f40r 1.000 1 pcheokeodar pchey.keodar f40v 1.000 1 pchedain pchedain f41r 1.000 1 p*eo p*ey f41v 0.500 2 pcheodo pcheody f42r 1.000 1 ckhsho cthsho f42v 1.000 1 pchockhoshcho pcho.ctho.sheey (*)[3] f43r 1.000 1 karodaiin tarodaiin f43v 1.000 1 pdsairo pdsairy f44r 1.000 1 kshodpo tshodpy f44v 1.000 1 kshookshockhol tsho.qotshy.cthol f45r 1.000 1 pykydal pykydal f45v 1.000 1 koraro korary f46r 1.000 1 pcheocpho pcheocphy f46v 1.000 1 podolshed pody.lshed f47r 1.000 1 pchair pchair f47v 1.000 1 pcheok pcheot f48r 1.000 1 pshdaiin pshdaiin f48v 1.000 1 pcheodcho pcheodchy f49r 1.000 1 pychol pychol f50r 1.000 1 psheor psheor f50v 1.000 1 kchodoldar tchy.do.ldar f51r 1.000 1 kcholdchookcho tcholdchy.qotchy f51v 1.000 1 poshodo poshody f52r 1.000 1 kdokchcpho tdokchcfhy f52v 1.000 1 pchorchcphol pchor.chcphol f53r 1.000 1 kadam kadam f53v 0.500 2 kshorsheo tshor.shey f54r 1.000 1 podaiinshodal podaiin.shodal f54v 1.000 1 pcheodar pcheodar f55r 1.000 1 podaiinshekcho podaiin.shekchy f55v 1.000 1 kchchdchdo kcheedchdy f56r 1.000 1 o*chal o*chal f56v 1.000 1 kcheak kcheat f57r 1.000 1 pocho poeeo f66v 1.000 1 okeodop okeodof f87r 1.000 1 poal poal f87v 1.000 1 pchchodaiin pcheey.daiin f90r1 1.000 1 polcholokeol poleeol.qokeol f90r2 1.000 1 koealchs toealchs f90v2 1.000 1 cphdacho cphdachy f90v1 1.000 1 pcheor pcheor f93r 1.000 1 kodshol kodshol f93v 1.000 1 porsheodo porsheody f94r 1.000 1 kchdoopainr tchdy.opainr f94v 1.000 1 kshedochedar tshedy.chedar f95r1 1.000 1 kshdor kshdor f95r2 1.000 1 kshedoor kshedy.or f95v2 1.000 1 kchodopodar tchody.podar f95v1 0.500 2 kolkchdo toltchdy f96r 1.000 1 korchchor tor.cheeor f96v 1.000 1 psheas psheas (*) Entries fixed since first release of this note. See NOTES section below. Now let's create a table that maps the specific word patterns to colors: several occurrences, all on one page: ff0000 red one occurrence: 9900ff light purple several occurrences, >=1/2 on one page: 0000ff blue several occurrences, <1/2 on any page: 000000 black cat her-f-eva-15.spc \ | gawk \ ' ($1>1){ \ if ($2>=$1) {print $6, "ff0000";} \ else if ($2>=$1/2) {print $6, "0000ff";} \ else {print $6, "000000";} \ } \ ' \ > pat-to-color.tbl dicio-wc pat-to-color.tbl lines words bytes file ------ ------- --------- ------------ 1795 3590 27107 pat-to-color.tbl colorize-pages \ -v stripq=1 \ -v equatekt=${equatekt} \ -v equatepf=${equatepf} \ `cat her.units` make-page-index \ `cat her.units` Rene questions whether the fisrt paragraph is more special than the other paragraphs. So let's extract the first line out of the second paragraph: cat `cat her.units | sed -e 's/^/L16-eva\//g'` \ | egrep ';F*>' \ | gawk \ ' BEGIN { curfn=""; parno=9999; done=0; } \ /^[#]/ { next; } \ /./ { \ fn = $1; \ gsub(/ "/dev/stderr"; } \ curfn=fn; parno=1; lineno=1; done=0; \ } \ else \ { if (lineno == 9999) { parno++; lineno=0; } \ lineno++; \ } \ if ((parno == 2) && (lineno == 1)) { print; done=1; } \ if (match($0, /[=]/)) { lineno=9999; } \ next; \ } \ ' \ > her-second-lines.txt dicio-wc her-second-lines.txt cat her-second-lines.txt \ | enum-text-phrases -f eva2erg.gawk \ -v maxlen=15 \ | gawk '($5==1){print $1, $8;}' \ > her-second-phrases.tbl read 672 words wrote 1561 phrases cat her-second-phrases.tbl \ | gawk '/./{print $1,$2,$2;}' \ | fix-words -f word-equiv.gawk \ -v field=2 \ -v stripq=1 \ -v equatekt=${equatekt} \ -v equatepf=${equatepf} \ | map-field \ -v inField=2 -v outField=2 \ -v table=pat-to-specificity.tbl \ -v default="9.999" \ | map-field \ -v inField=3 -v outField=3 \ -v table=pat-to-totct.tbl \ -v default="0" \ | gawk \ ' BEGIN {fnum="";} \ /./ { \ if (($1 != fnum) &&(fnum != "")) {printf "\n";} \ printf "%-6s %5.3f %4d %-15s %-15s\n", $1, $2, $3, $4, $5; \ fnum=$1; \ } \ ' \ > her-second-pats.fmt grep ' 9\.999' her-second-pats.fmt cat her-second-pats.fmt \ | gawk \ ' BEGIN {prevf="";done=1;} \ /./ { \ if($1!=prevf) \ { if (! done) {print prevlin;} \ done=0; \ } \ if (($2==1.0)&&(! done)) {print; done=1;} \ prevf = $1; prevlin = $0; \ } \ ' \ > her-second-unique-pats.fmt So, here are the shortest page-specific phrases at the beginning of the SECOND paragraph of each page: page spect totn pattern phrase ----- ----- ---- --------------- ------------------ f1v 1.000 1 pokoo potoy f2r 1.000 1 kydain kydain f2v 1.000 1 kchorsho kchor.shy f3r 1.000 1 pcheolshol pcheol.shol f3v 1.000 1 kchorokcham tchor.otcham f4r 1.000 1 pydaiinokcho pydaiin.qotchy f4v 1.000 1 korchoshchor torchy.sheeor f5r 1.000 1 kshoshodo tshy.shody f6v 1.000 1 kchodoshockhol tchody.shocthol f7r 1.000 1 ksholo ksholo f7v 1.000 1 kchorsheod kchor.sheod f8r 1.000 1 kchoep tchoep f8v 1.000 1 pcharcho pchar.cho f9r 1.000 1 pshoain pshoain f9v 1.000 1 pchoropchcho pchor.ypcheey f10r 1.000 1 ocheorckho ycheor.cthy f10v 1.000 1 okchokor qotchy.tor f11r 1.000 1 kcholshor tchol.shor f13r 1.000 1 shorodosho shorodo.shy f13v 1.000 1 poldaiin fol.daiin f14r 1.000 1 sosho*chol soshy.*chol f14v 1.000 1 ochododaiin ychy.dy.daiin f16r 1.000 1 kchorchorchs tchor.chor.chs f16v 1.000 1 pchockhochypcho pchocthy.chypchy f17r 1.000 1 kcho* tcho* f18r 1.000 1 kchorshor tchor.shor f19v 1.000 1 kookcheo toy.tchey f20r 0.250 4 pchockho pchocthy f20v 1.000 1 ksholpol tshol.fol f21r 1.000 1 pchopo pchofy f22r 1.000 1 pchaiinopcho pchaiin.opchy f22v 1.000 1 pshor fshor f23r 1.000 1 okoldookaiir qokoldy.okaiir f23v 1.000 1 ksholshor tshol.shor f24v 1.000 1 kocholchor tochol.chor f26r 1.000 1 pchookedo pcho.qokedy f27r 1.000 1 kcheocheo kchey.chey f28v 1.000 1 kshoiin tshoiin f29r 1.000 1 kcheolcheor kcheol.cheor f29v 1.000 1 kochon tochon f30r 1.000 1 opcholol opchol.ol f31r 1.000 1 kshokeodo tshoteody f31v 1.000 1 pchchodo pcheeody f32r 1.000 1 pchokcheychedo fcho.tcheychedy f32v 1.000 1 kshocphor ksho.cphor f33v 1.000 1 kshdoshepchdo tshdy.shefchdy f34r 1.000 1 kcheoolchckho tcheo.olchckhy f34v 1.000 1 pchedarshear pchedar.shear f35r 1.000 1 paiinchear paiin.chear f36r 1.000 1 podaiir podaiir f36v 1.000 1 kchorckhoiin tchor.ckhoiin f37r 1.000 1 pchokcho pchotchy f37v 1.000 1 okorchoiin qotor.choiin f39r 1.000 1 pchdaiin pchdaiin f39v 1.000 1 pardo pardy f40r 1.000 1 ksheokcheo ksheo.kchey f40v 1.000 1 kochschedo toees.chedy f41r 1.000 1 shedeo shedey f42r 1.000 1 pchocho pcho.chy f42v 1.000 1 posheor posheor f43r 1.000 1 psheso pshesy f43v 1.000 1 oshedoockho y.shedy.octhy f44r 1.000 1 kooshysho toy.shysho f44v 1.000 1 ookalod yokalod f45r 1.000 1 kolshopchor kolsho.pchor f45v 1.000 1 okolchoiin qotol.choiin f46r 1.000 1 kedodokedo tedy.dotedy f47r 1.000 1 polr folr f47v 1.000 1 pchodaiindair pchodaiin.dair f48v 1.000 1 pchedarcheo pchedar.chey f49r 1.000 1 kshoodain ksho.qodain f52v 1.000 1 pcheolsholoiin pcheol.sholoiin f54r 1.000 1 korari korari f55r 1.000 1 kchedar tchedar f55v 1.000 1 okchddaiin okchd.daiin f56r 1.000 1 kchokokchol tchoky.kchol f56v 1.000 1 kchocho kcho.chy f57r 1.000 1 kcheodarokam tcheodar.okam f66v 1.000 1 kchodsheodo tchod.sheody f87r 1.000 1 psheodsho psheodshy f87v 1.000 1 opchchockheo opcheey.cthey f90v2 1.000 1 kcheodocpheal tcheody.cpheal f90v1 1.000 1 ksheodal ksheodal f94v 1.000 1 kedaiinchedo tedaiin.chedy f95r1 1.000 1 kchdor tchdor f95r2 1.000 1 kshod tshod f95v1 1.000 1 kshdal tshdal NOTES (*) In the original release of this file, the word equivalence function had a minor bug that caused it to miss a few equivalences between composite and single words. Because of the bug, three pages had names that were not reaaly specific: [1] f32v was "kcheodaiin" now is "kcheodaiin.chol" [2] f38v was "okchop" now is "okchop.chol" [3] f42v was "pcho.ctho" now is "pcho.ctho.sheey"