Hacking at the Voynich manuscript - Side notes 010 Word distribution maps This is partly a remake of work from Notebook-2.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I went back and started redoing some of the previous tasks using the new encoding. 97-12-04 stolfi =============== I decided it was time to rebuild the word and label location maps, with EVA-based encoding, in the light of the three-way paradigm. The main intermediate file for a location map is a "raw concordance". Each line of this file represents one occurrence of some string in the text, in the format PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS 1 2 3 4 5 6 7 8 9 10 where PNUM is a sequential page number, "001" to "234". FNUM is the corresponding folio-based page number, "f1r" thru "f86v5" to "f106v" UNIT is the code of a text unit within the page, e.g. "P" or "R1" LINE is the code of a line within that unit, 27 or 10a TRANS is a letter identifying a transcriber, e.g. "F" for Friedman START is the index of the first byte of the occurrence in the text line (counting from 1). LENGTH is the original length of the occurrence in the text, including fillers, comments, spaces, etc.. POS is a number giving the approximate position of the occurrence within the whole text; used for sorting, etc. STRING is the non-empty string in question, without any fillers, comments, non-significant spaces, line breaks, etc.. OBS An arbitrary non-empty string, without embedded blanks. The STRING field may extend across one or more line breaks. In that case the line breaks are not included in the string and not counted in the LENGTH. For EVA format text, the START is relative to column 20. In that case LENGTH does not include columns 1-19 and "#"-comments. It does include "{}"-comments, "!" and "%" fillers, and any ASCII blanks beyond column 20. The START and LENGTH fields are used only by programs that list or highlight the occurrecnes in the original text. They may be 0 if not known. Similarly the POS field is used only for computing positional correlations and building block-based (as opposed to page-based) occurrence maps. It too can be 0 of not known. For word-based maps, the STRING is a single and whole Vms word, delimited by EVA word separators [-=,.]; or a sequence of two or more consecutive words. The string may extend across comments, fillers, and ordinary line breaks ("-"); but not across paragraph breaks ("=") or changes of textual unit. (To simplify processing, the words in the STRING field are always separated by a single ".", irrespective of the original separators used in the text. The delimiters surrounding the STRING are *not* included.) In word-based concordances, a reasonable choice for the POS field is the number of non-empty words preceding the occurrence in the sample. 97-11-17 stolfi =============== First, I collected some "interesting" words, the ones whose distribution is worth mapping, and which may have non-trivial semantics attached. From a message by R. Zandberger I got the labels of f67r2, and entered as new text unit L16-eva/67r2.L. (Robert Firth once conjectured they were the Ptolemaic planets.) Using data posted by John Grove, I split several of my textual units (L16-eva/f*) into smaller units, distinguishing real "parags" from his so-called "titles". (Almost all of his "titles" are actually short lines placed at the *end* of a paragraph). The affected files are: f1r.P -> f1r.P1 f1r.T1 f1r.P2 f1r.T2 f1r.P3 f1r.T3 f1r.P4 f1r.T4 f8r.P -> f8r.P1 f8r.T1 f8r.P2 f8r.T2 f8r.P3 f8r.T3 f9r.P -> f9r.P f9r.T f16r.P -> f16r.P1 f16r.T1 f16r.P2 f18r.P -> f18r.P f18r.T f19v.P -> f19v.P f19v.T f22v.P -> f22v.P f22v.T f24r.P -> f24r.P f24r.T f25r.P -> f25r.P f25r.T f27r.P -> f27r.P f27r.T f28v.P -> f28v.P1 f28v.T1 f28v.P2 f28v.T2 f31r.P -> f31r.P f31r.T f39r.P -> f39r.P f39r.P f40v.P -> f40v.P f40v.T f41v.P -> f41v.P f41v.T f42r.P -> f42r.P1 f42r.T1 f42r.P2 f42r.T2 f42r.P3 f42r.T3 f42v.P -> f42v.P f42v.T (new) -> f57v.T (new) -> f58v.T (new) -> f65r.L (old) -> f66r.W {entered months ago} f82r.P -> f82r.P1 f82r.T1 f82r.P2 (new) -> f85r2.T f85r1.P -> f85r1.P f85r1.T f86v5.P -> f86v5.P f86v5.T f94r.P -> f94r.P f94r.T f101v1.P -> f101v1.P f101v1.T f101v2.P -> f101v2.P f101v2.T f105r.P -> f105r.P1 f105r.T1 f105r.P2 f105r.T2 f108v.P -> f108v.P f108v.T f114r.P -> f114r.P1 f114r.T1 f114r.P2 f114r.T2 I collected all the labels, titles, and isolated words in one big file: cat L16-eva/INDEX \ | egrep -e '^[^:]*:[^:]*:[^:]*:[^:]*:(labels|words|titles):' \ | sed -e 's/:.*$//g' \ > lwt.units cat `cat lwt.units | sed -e 's/^/L16-eva\//g'` \ > labtit.evt I then reformatted this data by hand, producing a small database of "interesting words and phrases", called labtit.idx . Each line of this file describes one line of a label or title. There are six fields, separated by "|" LOCATION|LABEL|CLASS|MEANING|SECTION|COMMENTS 1 2 3 4 5 6 where LOC a location code, e.g. "f86v5.T2.5a;C" LABEL full label or title, in EVA, with EVA word separators CLASS class of label/title, one of P label of plant/vessel, mostly in pharma and herbal sections. T short "title" line under a paragraph, various sections. I "item" label in the list of f66r. B label on "biological" illustration like f77v. N conjectured planet name from f67r1. S star label on astronomical maps. Z label on "day" sectors of zodiac pictures. A other label on astro/cosmo/zodiac diagram. MEANING conjectured meaning, ending with "?" or just "?". SECTION "herbal", "bio", etc. as in the L16-eva/INDEX file. COMMENTS free format, with "_" instead of blanks, or just "-". example: f100r.t.5;C|sar.chas-daiind|P|plant?|pharma|two line label 97-12-06 stolfi =============== Next I obtained the text proper. Since the frequency of occurrence of a label may depend on the transcriber, it is important that we use a single transcription to build the block-based map. It is not advisable to use a mechanical consensus version for that. For one thing, doing so requires mappint the text to an error-tolerant alphabet, which makes the resulting map less valuable and preempts some useful options, such as strict matching. More importantly, the consensus-builder will tend to eliminate certain easy-to-misread words (such as those ending with -i*n) in sections where there are two versions, and keep them where there is only one version---which only adds more noise to an already noisy map. I chose the Friedman version (first [|] alternative, code ";F") because it was the most complete. I made a list of all normal-text units, in binding order cat L16-eva/INDEX \ | gawk \ ' BEGIN{FS=":"} \ ($5 \!~ /^(labels|letters|words|titles|[-?])/){print $1;}' \ > vtx.units Then I concatenated all units together, keeping only Friedman's transcription "F": cat `cat vtx.units | sed -e 's/^/L16-eva\//g'` \ | gawk '/^#/{next;} ($1 ~ /;F>/){print;}' \ > vtx-f-eva.evt dicio-wc vtx-f-eva.evt lines words bytes file ------ ------- --------- ------------ 3901 7867 281253 vtx-f-eva.evt Then I made a complete concordance from this file: cat vtx-f-eva.evt \ | enum-text-phrases -f eva2erg.gawk \ -v maxlen=15 \ | map-field \ -v inField=1 \ -v outField=1 \ -v table=fnum-to-pnum.tbl \ | gawk '/./{print $0, "-";}' \ > vtx-f-eva-15.roc read 33256 words wrote 85626 phrases cat vtx-f-eva-15.roc \ | gawk '($9 \!~ /[.?*]/) {print;}' \ | ( printf "good words = "; wc -l ) good words = 33026 Checking the length distribution of single words: cat vtx-f-eva-15.roc \ | gawk '($9 \!~ /[.]/) {print $9;}' \ | count-word-lengths len nwords example --- ------ ------------------ 1 339 l 2 1753 ol 3 3239 ary 4 6033 aiin 5 8900 sodal 6 6784 chckhy 7 4167 okeolan 8 1435 oqokaiin 9 453 orchcthdy 10 114 ykchedaiin 11 25 chedyotaiin 12 8 lshedyoraiin 13 4 aiinaiiiriiir 14 1 *******chedylo 15 1 pchodolchopchal Checking length distribution with dots and all: cat vtx-f-eva-15.roc \ | gawk '/./ {print $9;}' \ | count-word-lengths len nwords example --- ------ ------------------ 1 339 l 2 1753 ol 3 3249 ary 4 6111 aiin 5 9096 sodal 6 7344 chckhy 7 5440 ol.lkan 8 3864 aiin.ary 9 4228 cheey.qor 10 4956 chckhy.qol 11 5822 chal.chcthy 12 6173 qol.aiin.ary 13 5786 chcthy.chckhy 14 5303 cheey.qor.aram 15 4932 chckhy.qol.aiin 16 4935 qor.aram.ol.lkan 17 4905 chcthy.chckhy.qol 18 1265 ol.lkan.sodal.chal 19 119 qokam.cham.**.ar.al 20 6 lol.tar.shr.r.ol.ols I recorded this as a shell variable: set maxlen = 20 Next I made a similar raw concordance file for the words and phrases in "labtit.idx". The space and possible-space codes ".", ",", "-" were interpreted as word breaks. (This policy maximizes the chance of finding the label in the text.) The POS field was set to zero. I considered multi-word phrases up to 15 characters long, which was the maximum length of any label or title word present in the database. Once again PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS 1 2 3 4 5 6 7 8 9 10 where the OBS field is formed from the CLASS and MEANING fields of the label/title database. I also eliminated identical phrases with same location and offset, which originated from different but concordant transcriptions of the same label. cat labtit.idx \ | enum-label-phrases -f eva2erg.gawk \ -v maxlen=15 \ | map-field \ -v inField=1 \ -v outField=1 \ -v table=fnum-to-pnum.tbl \ | sort -b +8 -9 +1 -4 \ | gawk \ ' /./ { \ f = $2; u = $3; l = $4; w = $9; \ if ((w==wa)&&(f==fa)&&(u==ua)&&(l==la)) next; \ print; wa = w; fa = f; ua = u; la = l; \ } \ ' \ | sort -b +0 -1n +2 -5 +5 -6n \ > labtit-def.roc Checking format: foreach f ( labtit-def vtx-f-eva-15 ) echo " "; echo '=== '$f cat ${f}.roc \ | egrep -v '^[0-9][0-9][0-9] f[0-9]+[rv][1-6]? [A-Za-z][0-9]* [0-9]+[a-z]? [A-Z] [0-9]+ [0-9]+ [0-9]+ [a-z.*?]+ [^ ]+$' end Just for the record, I extracted the good label and title words by themselves: cat labtit-def.roc \ | gawk '($9 \!~ /[.*?]/) {print $9;}' \ | sort | uniq \ > labtit.dic dicio-wc labtit.dic lines words bytes file ------ ------- --------- ------------ 526 526 3656 labtit.dic Just to make sure, I checked the size distribution of these words: cat labtit.dic \ | count-word-lengths len nwords example --- ------ ------------------ 1 5 y 2 12 yy 3 28 sar 4 80 ykdy 5 110 ytain 6 121 yteody 7 84 qokeedy 8 68 ychekchy 9 27 ydaraishy 10 4 sochorcfhy 11 4 ilnirsireik 12 2 saloiinsheol 13 2 otcholcheaiin So the choice of maxlen=15 was quite reasonable. The next step was to assign each location in the text to a certain "bin". Ideally each bin should contain the same number of good words, so that the number of matches in a bin is proportional to the local density of references, undisturbed by bin size. The results are hard to interpret if we mix different languages and subject matters in the same bin. Ideally a bin should contain a single language and section. I.e. we should split the text into "divisions", each containing a maximal set of pages from the same section and language; and then split each division into equal-sized bins, as evenly as possible. To try this idea, I created a file that maps textual unit to section and language. I started from the L16-eva/INDEX cat L16-eva/INDEX \ | gawk 'BEGIN{FS=":"} /./{print $1, ($2 "." $3)}' \ > unit-to-division.tbl Then I counted the number of good words in each division: cat vtx-f-eva-15.roc \ | gawk '($9 \!~ /[.?*]/){print ($2 "." $3);}' \ | map-field \ -v inField=1 \ -v outField=1 \ -v table=unit-to-division.tbl \ | gawk '{print $1}' \ | sort | uniq -c | expand words division ----- --------- 687 ?.A 1462 ?.B 173 astro.? 6690 bio.B 170 cosmo.? 139 cosmo.B 7571 herbal.A 3336 herbal.B 2171 pharma.A 10627 stars.B Unfortunately some divisions are too small, so this approach would be messy --- it would leave too many leftover blocks. A compromise is to separate the two languages but keep then divide each group into blocks containing the same number of good words. To that end, I needed a table mapping page p-numbers to language; cat L16-eva/INDEX \ | gawk 'BEGIN{FS=":"} ($6 \!= "-"){gsub(/p/,"",$6); print $6, $3;}' \ | sort | uniq \ > pnum-to-language.tbl Pages should be homogeneous with respect to language, but let's check anyway: cat pnum-to-language.tbl \ | gawk '/./{if($1==p) print; p=$1;}' Let's count how many good words we got for each language: cat vtx-f-eva-15.roc \ | gawk '($9 \!~ /[.?*]/){print $1;}' \ | map-field \ -v inField=1 \ -v outField=1 \ -v table=pnum-to-language.tbl \ | gawk '{print $1}' \ | sort | uniq -c | expand words lang ------ ---- 343 ? 10429 A 22254 B We could have 10 blocks of A, 21 blocks of B, and one odd block of indeterminate language in between, for a total of 32 blocks: 10429/10 = 1043- 22254/21 = 1060- For the main text concordance, the block index can be computed by counting good single words per language (after sorting by position and length) and dividing the word count by the desired block sizes. I added the block number at end of the line: PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS LANG BLOCK 1 2 3 4 5 6 7 8 9 10 11 12 Here it is: set blNumA = 10 set blNumB = 21 @ nblocks = ${blNumA} + 1 + ${blNumB} cat vtx-f-eva-15.roc \ | sort +7 -8n +6 -7n \ | map-field \ -v inField=1 \ -v outField=11 \ -v table=pnum-to-language.tbl \ | gawk \ -v blNumA=${blNumA} -v blNumB=${blNumB} \ ' BEGIN { \ lo["A"] = 00; lo["?"] = blNumA; lo["B"] = blNumA+1; \ sz["A"] = 1043; sz["?"] = 1050; sz["B"] = 1060; \ } \ ($9 \!~ /[?*]/) { \ word = $9; lang = $11; \ if (word \!~ /[.]/) n[lang]++; \ bnum = lo[lang] + int((n[lang]-1)/sz[lang]); \ print $0, sprintf("%02d", bnum); \ } \ ' \ > vtx-f-eva-15.boc For the label and title concordance, the block number must be estimated indirectly from the page number. To do that we need a table that maps sequential page numbers to blocks. To create that file, I first extracted from the block-oriented text concordance above a file with one line per single word (not phrase), containing only PNUM the sequential page number, eg. p023 FNUM the page f-number, eg f103r2 (just in case). LANG the language, "A", "?", or "B"; BLOCK the sequential block number. Here it is: cat vtx-f-eva-15.boc \ | gawk \ ' ($9 \!~ /[.?*]/) { \ lang = $11; bnum = $12; pnum = $1; fnum = $2; \ print pnum, fnum, lang, bnum; \ } ' \ > vtx-f-eva-15.blk First, I checked whether all blocks indeed had comparable amounts of good words, and were language-homogeneous: cat vtx-f-eva-15.blk \ | gawk '/./{print $4, $3}' \ | sort +0 -1n +1 -2 | uniq -c | expand 1043 00 A 1043 01 A 1043 02 A 1043 03 A 1043 04 A 1043 05 A 1043 06 A 1043 07 A 1043 08 A 1042 09 A 343 10 ? 1060 11 B 1060 12 B 1060 13 B 1060 14 B 1060 15 B 1060 16 B 1060 17 B 1060 18 B 1060 19 B 1060 20 B 1060 21 B 1060 22 B 1060 23 B 1060 24 B 1060 25 B 1060 26 B 1060 27 B 1060 28 B 1060 29 B 1060 30 B 1054 31 B From this file I extracted a table that gives the max, min, and average block number in each page: cat vtx-f-eva-15.blk \ | gawk \ ' /./ { \ pn = $1; lang = $3; bn = ($4 + 0); bc = 1000-bn; \ if (bc > lo[pn]) lo[pn] = bc; \ if (bn > hi[pn]) hi[pn] = bn; \ sb[pn] += bn; ct[pn] ++; \ lg[pn,lang] = lang; \ } \ END { \ for(pn in ct) \ { lang = (lg[pn,"A"] lg[pn,"?"] lg[pn,"B"]); \ bn=int(sb[pn]/ct[pn]+0.5); \ printf "%03d %s %2d %2d %2d\n", pn, lang, 1000-lo[pn], bn, hi[pn]; \ } \ } \ ' \ | sort +0 -1n \ > pnum-block-ranges.tbl This list omits pages that have no transcribed plain text. Since those pages may have labels, we need to assign them by proximity: cat pnum-block-ranges.tbl pnum-to-language.tbl \ | sort -b +0 -1n +1 -2 +3 -4r \ | gawk \ -v blNumA=${blNumA} -v blNumB=${blNumB} \ ' BEGIN { bn["A"] = 0; bn["?"] = blNumA; bn["B"] = blNumA+1; } \ /./ { \ pnum = $1; lang = $2; \ if (NF > 2 ) { bn[lang] = $4; next; } \ else { printf "%03d %02d\n", pnum, bn[lang]; }\ } \ ' \ > pnum-to-block.tbl For the table headers, we need also a table that lists the f-number of the first page in each block: cat vtx-f-eva-15.blk \ | sort +3 -4n +0 -1n \ | gawk 'BEGIN{bo="?"} /./{b=$4;f=$2;if(b\!=bo){print b,f;} bo=b;}' \ > block-to-first-fnum.tbl cat block-to-first-fnum.tbl \ | gawk '/./{print $2;}' \ | rotate-labels -v width=3 -v shift=0 \ > block-headings.txt I also counted the number of good words per page: cat vtx-f-eva-15.blk \ | gawk '/./{printf "p%03d %s\n", $1, $2}' \ | sort +0 -1n | uniq -c | expand \ > vtx-f-eva-words-per-page.cts With the pnum-to-language and pnum-to-block tables, I prepared a block-augmented concordance file, for the "defining" occurrences of all label and title words (in label/title units). Like the text version, this one too has fields PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS LANG BLOCK 1 2 3 4 5 6 7 8 9 10 11 12 Here it is: cat labtit-def.roc \ | gawk '($9 \!~ /[?*]/) {print;}' \ | map-field \ -v inField=1 \ -v outField=11 \ -v table=pnum-to-language.tbl \ | map-field \ -v inField=1 \ -v outField=12 \ -v table=pnum-to-block.tbl \ > labtit-def.boc dicio-wc labtit-def.boc lines words bytes file ------ ------- --------- ------------ 894 10728 42471 labtit-def.boc Finally, I combined the two block-oriented concordance files (labels/titles and text) into a single file, and added as a last field PATT the "pattern" of the string: a mapping of STRING that deletes "unimportant" details. I also added a TAG field that can be used by the map-building script to distinguish the "special" and "ordinary" occurrences. So the resulting file has format PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS LANG BLOCK PATT TAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 foreach f ( labtit-def/+ vtx-f-eva-15/- ) set name = "${f:h}" set tag = "${f:t}" echo "name=${name} tag=${tag}" cat ${name}.boc \ | add-match-key -f eva2erg.gawk \ -v inField=9 \ -v outField=13 \ -v erase_ligatures=1 \ -v erase_plumes=1 \ -v ignore_gallows_eyes=1 \ -v join_ei=1 \ -v equate_aoy=1 \ -v collapse_ii=1 \ -v equate_eights=1 \ -v erase_q=1 \ -v erase_word_spaces=1 \ | gawk -v tag="${tag}" '/./{print $0, tag;}' \ > ${name}.moc end dicio-wc {labtit-def,vtx-f-eva-15}.moc lines words bytes file ------ ------- --------- ------------ 894 11622 49267 labtit-def.moc 84469 1098097 4538232 vtx-f-eva-15.moc To build the occurence maps, I then had to join these two files and sort them so as to bring all entries with same PATT together, with the "special" entries (TAG="+") preceding the "ordinary" ones (TAG="-"). echo "nblocks = ${nblocks}" cat {labtit-def,vtx-f-eva-15}.moc \ | sort -b +12 -13 +13 -14 +8 -9 +0 -1n +7 -8n \ > f15-full.msc cat f15-full.msc \ | make-word-location-map \ -v nblocks=${nblocks} \ -v omitSingles=1 \ -v totOnly=0 \ > f15-full.map cat f15-full.map \ | format-word-location-map \ -v nblocks=${nblocks} \ -v html=1 \ -v maxlen=${maxlen} \ -v ctwd=3 \ -v blockHeadings=block-headings.txt \ -v title="Non-unique word patterns" \ -v showProps=1 \ -v showPattern=0 \ -v showAbsCounts=1 \ -v showRelCounts=0 \ -v showAvgPos=0 \ > f15-full.html Even with "omitSingles=1", this map was humongous (5 MBytes), so I manually split it into sections, and wrote an index f15-full-index.html pointing to the pieces. I also built a smaller version of this map, deleting all ordinary occurrences which were not equivalent to some special occurrence: echo "nblocks = ${nblocks}" cat {labtit-def,vtx-f-eva-15}.moc \ | sort -b +12 -13 +13 -14 +8 -9 +0 -1n +7 -8n \ | gawk '{pt=$13;tg=$14;if(tg\!="-")pts=pt; if(pt==pts) print;}' \ | make-word-location-map \ -v nblocks=${nblocks} \ -v omitSingles=0 \ -v totOnly=0 \ > f15-spec.map echo "maxlen = ${maxlen}" echo "nblocks = ${nblocks}" cat f15-spec.map \ | format-word-location-map \ -v nblocks=${nblocks} \ -v html=1 \ -v maxlen=${maxlen} \ -v ctwd=3 \ -v blockHeadings=block-headings.txt \ -v title="Label and title words" \ -v showProps=1 \ -v showPattern=0 \ -v showAbsCounts=1 \ -v showRelCounts=0 \ -v showAvgPos=0 \ > f15-spec.html I generated another map only with the totals per abstract pattern, for simple words only, all of them, main text only: echo "nblocks = ${nblocks}" cat {labtit-def,vtx-f-eva-15}.moc \ | sort -b +12 -13 +13 -14 +8 -9 +0 -1n +7 -8n \ | gawk '{st=$9;tg=$14;if((tg=="-")&&(st\!~/[.]/))print;}' \ | make-word-location-map \ -v nblocks=${nblocks} \ -v totOnly=1 \ > f-00-tot.map dicio-wc f-00-tot.map lines words bytes file ------ ------- --------- ------------ 2871 114840 404554 f-00-tot.map Let's sort the distributions by similarity and make a picture. There are too many of them, so let's omit the patterns that occur too rarely. First, let's make an histogram of the total counts: cat f-00-tot.map \ | gawk '{print $1;}' \ | sort +0 -1nr | uniq -c | expand \ | compute-cum-freqs It seems we can get enough patterns by looking just at the popular ones: echo "nblocks = ${nblocks}" cat Nofte-010/f-00-tot.map \ | gawk '($1 >= 10){print;}' \ > f-00-tot-p.map cat f-00-tot-p.map \ | sort-distr \ --numValues ${nblocks} \ --skip 6 \ --discrete --geometric \ --cluSort \ --delReSort --repeat 1 \ --verbose \ > f-00-tot-s.map I edited the file f-00-tot-s.map by hand, in an attempt to cluster similar entries toegther. Not very successful, though. I made a picture of it: cat f-00-tot-s.map \ | egrep -v '^ *$' \ | sort-distr \ --numValues ${nblocks} \ --skip 6 \ --discrete --geometric \ --repeat 0 \ --showPicture f-00-tot-s.ppm \ --verbose \ > /dev/null cat f-00-tot-s.ppm \ | ppmtogif \ > f-00-tot-s.gif xv f-00-tot-s.gif & /bin/rm -f f-00-tot-s.ppm I also compared the distribution of words with and without their "o" prefix: cat f-00-tot-p.map \ | gawk '/./ {k=($34 "~"); gsub(/^o*/, "", k); print k, $0;}' \ | sort +0 -1 +34 -35 \ | gawk '/./ {k=$1; if(k\!=ok){print "";} ok=k; print;}' \ | sed -e 's/^[^ ]* //g' \ > f-00-tot-o.map It looks like the distributions of these pairs are surprisingly similar... Then I made HTML versions of those maps: echo "maxlen = ${maxlen}" echo "nblocks = ${nblocks}" foreach f ( p s o ) cat f-00-tot-${f}.map \ | format-word-location-map \ -v html=1 \ -v nblocks=${nblocks} -v maxlen=${maxlen} \ -v ctwd=3 \ -v blockHeadings=block-headings.txt \ -v title="Occurrences of all word patterns per block" \ -v showProps=0 \ -v showLineNumber=0 \ -v showPattern=0 \ -v totOnly=1 \ -v showAbsCounts=0 \ -v showRelCounts=1 \ -v showAvgPos=0 \ > f-00-tot-${f}.html end dicio-wc f-*.html lines words bytes file ------ ------- --------- ------------ 370 14659 65848 f-00-tot-o.html 370 14659 65848 f-00-tot-p.html 370 14659 65848 f-00-tot-s.html Looking at these maps, it seems that these patterns are both very common and uniformly distributed throughout the manuscript: totn pattern most common wd ---- --------- ----------------- 221 toin kaiin~ 635 otoe qokar~ 563 oe or~ 764 eeeo chey~ 62 eeee ches~ 33 oteee qokees~ 87 eeeteeo chckhey~ 53 oeo ary~ 813 otol qokal~ 153 toe kar~ 177 eoe sar~ 37 eeeeol sheeol~ 33 ot ot~ 48 opeeeo opchey~ 78 oeeeo yshey~ 46 eetoin chkaiin~ 1049 oteeo qokeey~ 88 od am~ 484 oto qoky~ 47 eeto chky~ 24 todo tody~ 24 tod kam~ 64 odo ody~ 45 odoe odar~ 95 teeeo kchey~ 14 oeeee oeees~ 48 deeeo dchey~ 113 doie dair~ 30 eeeeoe cheeor~ 44 ooe qoar~ 57 otodo qotody~ 40 teo key~ 43 teol keol~ 62 oeteo qockhy~ 259 eeeoe cheor~ 55 eee she~ 82 o y~ 348 doe dar~ 75 eo sy~