Hacking at the Voynich manuscript Notebook - volume 12 Warning: these notebooks aren't strictly chronological logs. Sometimes I go back and redo things, clarify comments, delete garbage, etc. 97-11-04 stolfi =============== I decided to unfold the "[|]" groups into separate lines. This unfolding should make the consensus more consistent. Note that a group like "[A.|.A]" or or "[A|O]" may be considered a consensus, whereas "[A|P]" may not be, depending on the definition of consensus. Hence it seems sensible to do the unfolding of alternatives before computing the consensus. (In previous attempts at computing the consensus, I would just take the first choice out of every alternation). (I had tried doing the unfolding after mapping to EVA, but there are some half-character choices like P[Z|] which would require posterior editing. Besides, it seems better to have an unfolded version of the FSG encoding, preserving the "%" and "!" alignment markers.) For the unfolding, I wrote a filter "unfold-alternatives" to be used with "filter-files". New transcriber codes were introduced for the variant lines; see "f0.U" for details. Note also that a line with n groups, like "A[P|F]ETR[II|O]G[A|P]E" need only generate two lines "APETRIGAE" and "AFETROGPE" and not 2^n, since the consensus should not be affected too much by crossovers. (Perhaps this is true only if the alternations are well-separated?) Besides, this interpretation is generally closer to the way the '[|]' constructs are used in the file: each branch represents one specific version of the transcription. cat L16/INDEX \ | sed -e 's/:.*$//g' \ > .units.dir mkdir L16-unf foreach f ( `cat .units.dir ` ) echo $f cat L16/$f \ | unfold-alternatives \ > L16-unf/$f end /bin/rm -f .diff foreach f ( `cat .units.dir ` ) echo $f echo ' ' >> .diff echo '=== '$f' ===' >> .diff echo ' ' >> .diff diff L16/$f L16-unf/$f \ | prettify-diff-output \ >> .diff end Expanded and complemented Landini's initial comments, producing L16/f0.{A,I,J,E,S,U}. Included comments about my unfoldings and edits. cp L16/INDEX L16-unf/ tar cvf - L16 | gzip > L16.tgz -rw-r--r-- 1 stolfi staff 170606 Nov 5 19:08 L16.tgz rm -rf L16 Also added new unit L16/f77v.L (and L16-eva/f77v.L), with the labels on figures of page f77v. (I should ask the folks in the mailing list to check the labels...) Then I converted these files to the new EVA encoding. I plan to work as much as possible with that encoding, since it is "the way of the future". mkdir L16-eva foreach f ( L16-unf/f[0-9]* ) echo "$f -> L16-eva/${f:t}" cat ${f} \ | fsg2eva \ > L16-eva/${f:t} end /bin/rm .bugs foreach f ( L16-eva/f[0-9]* ) echo "checking $f" cat ${f} \ | validate-new-evt-format \ -v chars='aoeilmnrchtpkfsqgjdvxy' \ >>& .bugs end Edited manually some occurrences of FSG and Currier codes within '{}' comments. Also fixed a few dozen bugs (bad letters, leading ".", missing lines). The file "f0.V" describes the recoding and the fixes. [Oops, made a mistake in fsg2eva (mapped 'T' to 'th' instead of 'ch'). So now I am trying to redo the mapping without losing the manual edits: mv L16-eva L16-eva-th (recreate L16-eva mechanically as above) mkdir L16-eva-xx foreach f ( L16-eva/f[0-9]* ) set fxx = "L16-eva-xx/${f:t}" echo "$f -> $fxx" cat ${f} \ | sed -e '/^ ${fxx} end diff -r L16-eva-th L16-eva-xx \ | prettify-diff-output \ > .diff (check differences and edit as appropriate) cp -p L16-eva{-th,}/INDEX cp -p L16-eva{-th,}/f0.A cp -p L16-eva{-th,}/f0.V (fix fsg2eva code in f0.V) OK, let's redo everything we were doing... 97-11-08 stolfi =============== Let's compute again the digraph frequencies for English: cat engl-poi.txt | head -685 > .foo dicio-wc .foo lines words bytes file ------ ------- --------- ------------ 685 6929 36813 .foo cat .foo \ | tr ' ' '\012' \ | sed -e 's/$/./g' \ | count-digraph-freqs \ -v pad='.' \ -v chars='.abcdefghijklmnopqrstuvwxyz0123456789' \ -v showentropy=1 Digraph counts: TT . a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 5 6 7 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 6929 . 799 249 298 235 145 235 104 516 555 57 37 200 419 137 388 173 15 168 531 868 93 64 492 . 148 . . 1 . 1 1 . . a 2413 280 3 40 73 114 . 35 40 . 105 . 25 168 52 380 3 34 . 255 266 311 33 78 44 . 71 3 . . . . . . . b 359 2 39 2 . . 115 . . . 12 3 . 34 . . 47 . . 22 10 2 45 1 . . 25 . . . . . . . . c 694 5 107 . 12 . 118 . . 92 40 . 35 19 . . 112 . 1 34 . 44 26 . . . 49 . . . . . . . . d 1405 873 36 . . 21 146 . 4 . 105 . . 14 8 9 97 . . 37 27 . 10 . 2 . 16 . . . . . . . . e 3710 1281 157 3 58 401 116 39 6 4 27 . 3 124 79 316 10 26 1 536 215 138 3 74 26 41 25 1 . . . . . . . f 652 219 39 . . . 79 39 . . 49 . . 15 . . 88 . . 58 . 35 31 . . . . . . . . . . . . g 577 198 24 . . . 70 . 6 87 24 . . 52 3 5 37 . . 37 24 . 10 . . . . . . . . . . . . h 1813 184 327 1 1 1 767 1 . . 216 . . 1 . 33 181 . . 15 8 53 17 . . . 7 . . . . . . . . i 2061 171 63 16 73 90 79 53 56 2 . . 14 100 96 562 81 10 . 69 228 245 3 46 . 1 . 3 . . . . . . . j 61 . 2 . . . 4 . . . . . . . . . 39 . . . . . 16 . . . . . . . . . . . . k 213 63 1 . . 1 85 1 . . 25 . . . . 27 2 . . . 8 . . . . . . . . . . . . . . l 1279 170 125 . 2 71 230 33 4 . 118 . 9 193 9 2 98 2 . . 2 14 14 1 11 . 168 . . . . . . 2 1 m 865 122 121 13 . . 220 1 . . 98 . . 3 7 3 83 26 . 58 8 . 35 . . . 67 . . . . . . . . n 2001 509 29 1 80 344 145 6 286 1 60 1 15 22 2 26 130 1 3 2 66 227 15 12 . 1 17 . . . . . . . . o 2249 291 9 8 24 33 5 195 10 44 50 . 38 66 137 258 115 25 . 211 63 115 369 31 149 . 3 . . . . . . . . p 482 86 42 3 . . 87 . . 5 16 . . 57 . . 57 33 . 57 20 9 10 . . . . . . . . . . . . q 23 . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . r 1806 444 92 4 10 67 388 5 14 5 119 . 19 27 24 38 144 44 . 34 146 57 25 8 4 . 88 . . . . . . . . s 1933 741 78 8 21 . 212 . . 184 93 . 18 16 14 2 85 41 2 1 101 248 56 . 5 . 7 . . . . . . . . t 2547 657 90 . 10 . 274 4 . 772 143 . . 61 2 4 254 . . 56 53 65 43 . 13 . 46 . . . . . . . . u 878 77 19 11 20 23 26 1 47 . 19 . . 90 13 107 1 54 . 134 130 101 . 3 . 2 . . . . . . . . . v 319 2 13 . . . 242 . . . 45 . . . . . 17 . . . . . . . . . . . . . . . . . . w 746 87 186 . . 2 116 3 . 100 126 . . 4 . 40 58 . . 21 1 . 1 . . . 1 . . . . . . . . x 45 4 7 . 11 . 1 . . 1 2 . . . . . . 10 1 . . 8 . . . . . . . . . . . . . y 738 460 2 . 1 2 38 1 . . 13 . . 13 . 52 122 3 . 1 26 3 . 1 . . . . . . . . . . . z 7 1 3 . . . 2 . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 0 1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . 2 1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . 5 1 . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . 6 2 . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . 7 1 . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 36813 6929 2413 359 694 1405 3710 652 577 1813 2061 61 213 1279 865 2001 2249 482 23 1806 1933 2547 878 319 746 45 738 7 1 1 1 1 1 2 1 Next-symbol probability (× 99): TT . a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 5 6 7 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 11 4 4 3 2 3 1 7 8 1 1 3 6 2 6 2 . 2 8 12 1 1 7 . 2 . . . . . . . . a 99 11 . 2 3 5 . 1 2 . 4 . 1 7 2 16 . 1 . 10 11 13 1 3 2 . 3 . . . . . . . . b 99 1 11 1 . . 32 . . . 3 1 . 9 . . 13 . . 6 3 1 12 . . . 7 . . . . . . . . c 99 1 15 . 2 . 17 . . 13 6 . 5 3 . . 16 . . 5 . 6 4 . . . 7 . . . . . . . . d 99 62 3 . . 1 10 . . . 7 . . 1 1 1 7 . . 3 2 . 1 . . . 1 . . . . . . . . e 99 34 4 . 2 11 3 1 . . 1 . . 3 2 8 . 1 . 14 6 4 . 2 1 1 1 . . . . . . . . f 99 33 6 . . . 12 6 . . 7 . . 2 . . 13 . . 9 . 5 5 . . . . . . . . . . . . g 99 34 4 . . . 12 . 1 15 4 . . 9 1 1 6 . . 6 4 . 2 . . . . . . . . . . . . h 99 10 18 . . . 42 . . . 12 . . . . 2 10 . . 1 . 3 1 . . . . . . . . . . . . i 99 8 3 1 4 4 4 3 3 . . . 1 5 5 27 4 . . 3 11 12 . 2 . . . . . . . . . . . j 99 . 3 . . . 6 . . . . . . . . . 63 . . . . . 26 . . . . . . . . . . . . k 99 29 . . . . 40 . . . 12 . . . . 13 1 . . . 4 . . . . . . . . . . . . . . l 99 13 10 . . 5 18 3 . . 9 . 1 15 1 . 8 . . . . 1 1 . 1 . 13 . . . . . . . . m 99 14 14 1 . . 25 . . . 11 . . . 1 . 9 3 . 7 1 . 4 . . . 8 . . . . . . . . n 99 25 1 . 4 17 7 . 14 . 3 . 1 1 . 1 6 . . . 3 11 1 1 . . 1 . . . . . . . . o 99 13 . . 1 1 . 9 . 2 2 . 2 3 6 11 5 1 . 9 3 5 16 1 7 . . . . . . . . . . p 99 18 9 1 . . 18 . . 1 3 . . 12 . . 12 7 . 12 4 2 2 . . . . . . . . . . . . q 99 . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . r 99 24 5 . 1 4 21 . 1 . 7 . 1 1 1 2 8 2 . 2 8 3 1 . . . 5 . . . . . . . . s 99 38 4 . 1 . 11 . . 9 5 . 1 1 1 . 4 2 . . 5 13 3 . . . . . . . . . . . . t 99 26 3 . . . 11 . . 30 6 . . 2 . . 10 . . 2 2 3 2 . 1 . 2 . . . . . . . . u 99 9 2 1 2 3 3 . 5 . 2 . . 10 1 12 . 6 . 15 15 11 . . . . . . . . . . . . . v 99 1 4 . . . 75 . . . 14 . . . . . 5 . . . . . . . . . . . . . . . . . . w 99 12 25 . . . 15 . . 13 17 . . 1 . 5 8 . . 3 . . . . . . . . . . . . . . . x 99 9 15 . 24 . 2 . . 2 4 . . . . . . 22 2 . . 18 . . . . . . . . . . . . . y 99 62 . . . . 5 . . . 2 . . 2 . 7 16 . . . 3 . . . . . . . . . . . . . . z 99 14 42 . . . 28 . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . 0 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . 2 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . 5 99 . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . 6 99 . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . 7 99 . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 19 6 1 2 4 10 2 2 5 6 0 1 3 2 5 6 1 0 5 5 7 2 1 2 0 2 0 0 0 0 0 0 0 0 Previous-symbol probability (× 99): TT . a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 5 6 7 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 19 . 33 69 43 17 4 36 18 28 27 93 17 15 48 7 17 36 65 9 27 34 10 20 65 . 20 . . 99 . 99 99 . . a 6 4 . 11 10 8 . 5 7 . 5 . 12 13 6 19 . 7 . 14 14 12 4 24 6 . 10 42 . . . . . . . b 1 . 2 1 . . 3 . . . 1 5 . 3 . . 2 . . 1 1 . 5 . . . 3 . . . . . . . . c 2 . 4 . 2 . 3 . . 5 2 . 16 1 . . 5 . 4 2 . 2 3 . . . 7 . . . . . . . . d 4 12 1 . . 1 4 . 1 . 5 . . 1 1 . 4 . . 2 1 . 1 . . . 2 . . . . . . . . e 10 18 6 1 8 28 3 6 1 . 1 . 1 10 9 16 . 5 4 29 11 5 . 23 3 90 3 14 . . . . . . . f 2 3 2 . . . 2 6 . . 2 . . 1 . . 4 . . 3 . 1 3 . . . . . . . . . . . . g 2 3 1 . . . 2 . 1 5 1 . . 4 . . 2 . . 2 1 . 1 . . . . . . . . . . . . h 5 3 13 . . . 20 . . . 10 . . . . 2 8 . . 1 . 2 2 . . . 1 . . . . . . . . i 6 2 3 4 10 6 2 8 10 . . . 7 8 11 28 4 2 . 4 12 10 . 14 . 2 . 42 . . . . . . . j 0 . . . . . . . . . . . . . . . 2 . . . . . 2 . . . . . . . . . . . . k 1 1 . . . . 2 . . . 1 . . . . 1 . . . . . . . . . . . . . . . . . . . l 3 2 5 . . 5 6 5 1 . 6 . 4 15 1 . 4 . . . . 1 2 . 1 . 23 . . . . . . 99 99 m 2 2 5 4 . . 6 . . . 5 . . . 1 . 4 5 . 3 . . 4 . . . 9 . . . . . . . . n 5 7 1 . 11 24 4 1 49 . 3 2 7 2 . 1 6 . 13 . 3 9 2 4 . 2 2 . . . . . . . . o 6 4 . 2 3 2 . 30 2 2 2 . 18 5 16 13 5 5 . 12 3 4 42 10 20 . . . . . . . . . . p 1 1 2 1 . . 2 . . . 1 . . 4 . . 3 7 . 3 1 . 1 . . . . . . . . . . . . q 0 . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . r 5 6 4 1 1 5 10 1 2 . 6 . 9 2 3 2 6 9 . 2 7 2 3 2 1 . 12 . . . . . . . . s 5 11 3 2 3 . 6 . . 10 4 . 8 1 2 . 4 8 9 . 5 10 6 . 1 . 1 . . . . . . . . t 7 9 4 . 1 . 7 1 . 42 7 . . 5 . . 11 . . 3 3 3 5 . 2 . 6 . . . . . . . . u 2 1 1 3 3 2 1 . 8 . 1 . . 7 1 5 . 11 . 7 7 4 . 1 . 4 . . . . . . . . . v 1 . 1 . . . 6 . . . 2 . . . . . 1 . . . . . . . . . . . . . . . . . . w 2 1 8 . . . 3 . . 5 6 . . . . 2 3 . . 1 . . . . . . . . . . . . . . . x 0 . . . 2 . . . . . . . . . . . . 2 4 . . . . . . . . . . . . . . . . y 2 7 . . . . 1 . . . 1 . . 1 . 3 5 1 . . 1 . . . . . . . . . . . . . . z 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . 2 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . 5 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 4.094 Next-symbol entropy: 3.334 And now for Portuguese: cat port.txt \ | tr ' -' '\012\012' \ | tr -d '~' \ | egrep -v '^[bcçdfghijklmnpqrstuvwxyz]$' \ | head -6035 > .bar dicio-wc .bar lines words bytes file ------ ------- --------- ------------ 6035 6035 36915 .bar cat .bar \ | sed -e 's/$/./g' \ | count-digraph-freqs \ -v pad='.' \ -v chars='.aàáâãbcçdeéêfghiíjklmnoóôõöpqrstuúüvwxyz0123456789' \ -v showentropy=1 Digraph counts: TT . a à á â ã b c ç d e é ê f g h i í j k l m n o ó ô õ ö p q r s t u ú ü v w x y z ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 6035 . 662 15 2 3 . 44 518 . 903 542 99 . 226 65 17 129 . 11 4 141 182 205 304 3 . . . 452 223 125 405 304 253 11 . 186 1 . . . a 3604 1209 . . . . . 30 105 123 424 . . . 8 53 . 76 2 1 1 200 166 183 39 . . . . 44 6 464 382 60 12 . . 12 . 1 . 3 à 15 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . . . . . . . . . á 65 22 . . . . . . 5 . . . . . 5 . . . . . . . . . . . . . . . . 13 1 6 . . . 11 . 2 . . â 42 . . . . . . . . . . . . . . . . . . . . . 3 39 . . . . . . . . . . . . . . . . . . ã 225 . . . . . . . . . . . . . . . . . . . . . . . 225 . . . . . . . . . . . . . . . . . b 192 . 34 . . . . . 1 . 6 15 19 . . . . 25 . 18 . 9 . . 17 . . . . . . 23 11 7 5 . . 2 . . . . c 1148 5 191 . 2 1 . . . 1 . 241 . 9 . . 11 159 4 . . 8 . 6 436 . . . . . . 21 . 16 36 1 . . . . . . ç 216 . 17 . . . 129 . . . . . . . . . . . . . . . . . 14 . . 56 . . . . . . . . . . . . . . d 1781 8 396 . 1 . . . . . . 723 2 2 . 8 . 129 . 13 . . 1 . 373 . . . . . . 75 . . 50 . . . . . . . e 3859 1236 8 . . . . 1 51 13 44 . . . 43 46 . 75 1 50 . 157 278 438 27 . . . . 58 16 324 740 87 10 . . 26 . 116 . 14 é 250 101 . . . . . 1 8 . 1 . . . . . . 3 . . . . 26 . . . . . . . . 90 1 18 . . . . . . . 1 ê 35 . . . . . . . . . . . . . . . . . . . . . 3 23 . . . . . . . . 9 . . . . . . . . . f 436 4 78 . 4 . . . . . . 33 1 . . . . 156 56 . . . . . 57 . . . . . . 13 . 1 32 . . . . . 1 . g 431 6 12 . . . . . . . . 108 1 . . . . 90 . . . 1 4 3 32 . . . . . . 39 . . 134 . 1 . . . . . h 168 1 44 . . . . . . . . 5 . . . . . 1 . . . . . . 113 . . . . . . . . . 4 . . . . . . . i 1911 24 131 . 3 35 7 6 232 25 136 106 1 1 43 88 . . . 4 . 98 117 262 64 . . . . 29 . 96 224 94 2 1 . 40 . 6 . 36 í 101 1 . . . . . . 56 . 3 . . . 1 7 . . . . . . 3 8 . . . . . 4 . . 4 1 . . . 13 . . . . j 118 . 64 . 9 . . . . . . 26 . . . . . . . . . . . . 3 . . . . . . . . . 16 . . . . . . . k 6 . 2 . . . . . . . . 1 . . . . . . . . 1 2 . . . . . . . . . . . . . . . . . . . . l 1008 105 292 . 3 . . . 3 1 5 131 6 2 2 15 103 83 9 . . 1 20 . 123 12 . . . . 8 . 1 16 57 . . 10 . . . . m 1435 460 257 . 5 . . 27 . . . 222 19 . . . . 65 3 . . . . . 232 . . . 1 130 . . . . 11 3 . . . . . . n 1448 3 146 . . . 28 . 70 42 87 90 1 . 20 79 37 91 1 10 . . . . 125 7 9 . . . . . 50 503 24 10 . 15 . . . . o 3054 1200 2 . 2 . . 59 21 . 125 2 . . 1 30 . 42 . 5 . 131 248 198 5 . . . . 48 . 289 540 36 56 . . 14 . . . . ó 47 7 . . . . . 2 . . . . . . . 12 . . . . . 2 . . . . . . . 4 . 3 9 2 . . . . . 6 . . ô 9 . . . . . . . . . . . . . . . . . . . . . 7 2 . . . . . . . . . . . . . . . . . . õ 59 . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ö 1 . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p 878 1 161 . . . . . . . . 133 2 1 . . . 6 1 . . 128 . . 291 2 . . . . . 143 . . 9 . . . . . . . q 267 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 . 8 . . . . . r 2089 307 340 . 20 2 1 6 42 3 33 450 . 12 69 25 . 321 2 . . . 58 13 121 12 . 2 . 1 1 25 22 149 37 . . 15 . . . . s 2538 1271 83 . 2 . 48 . 25 . 1 286 1 . 14 . . 67 12 . . . 53 . 74 9 . 1 . 42 13 . 87 316 122 . . 11 . . . . t 1673 . 434 . 12 1 12 . 6 . . 351 13 3 . . . 249 5 . . . 7 . 271 2 . . . . . 259 . . 47 . . . 1 . . . u 1176 45 138 . . . . 15 1 8 13 201 . . 3 3 . 45 5 6 . 124 251 61 . . . . . 65 . 86 39 50 . . . . . 2 . 15 ú 26 . . . . . . . . . . . . . . . . . . . . 5 7 7 . . . . . . . 1 4 2 . . . . . . . . ü 9 . . . . . . . . . . 2 . 5 . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . v 355 . 68 . . . . . . . . 116 85 . 1 . . 62 . . . 1 . . 22 . . . . . . . . . . . . . . . . . w 2 . 1 . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . x 133 . 12 . . . . . 4 . . 5 . . . . . 21 . . . . . . 85 . . . . 1 . . . 5 . . . . . . . . y 1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . z 69 12 31 . . . . . . . . 11 . . . . . 13 . . . . 1 . 1 . . . . . . . . . . . . . . . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 36915 6035 3604 15 65 42 225 192 1148 216 1781 3859 250 35 436 431 168 1911 101 118 6 1008 1435 1448 3054 47 9 59 1 878 267 2089 2538 1673 1176 26 9 355 2 133 1 69 Next-symbol probability (× 99): TT . a à á â ã b c ç d e é ê f g h i í j k l m n o ó ô õ ö p q r s t u ú ü v w x y z -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 11 . . . . 1 8 . 15 9 2 . 4 1 . 2 . . . 2 3 3 5 . . . . 7 4 2 7 5 4 . . 3 . . . . a 99 33 . . . . . 1 3 3 12 . . . . 1 . 2 . . . 5 5 5 1 . . . . 1 . 13 10 2 . . . . . . . . à 99 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . á 99 34 . . . . . . 8 . . . . . 8 . . . . . . . . . . . . . . . . 20 2 9 . . . 17 . 3 . . â 99 . . . . . . . . . . . . . . . . . . . . . 7 92 . . . . . . . . . . . . . . . . . . ã 99 . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . . . . . b 99 . 18 . . . . . 1 . 3 8 10 . . . . 13 . 9 . 5 . . 9 . . . . . . 12 6 4 3 . . 1 . . . . c 99 . 16 . . . . . . . . 21 . 1 . . 1 14 . . . 1 . 1 38 . . . . . . 2 . 1 3 . . . . . . . ç 99 . 8 . . . 59 . . . . . . . . . . . . . . . . . 6 . . 26 . . . . . . . . . . . . . . d 99 . 22 . . . . . . . . 40 . . . . . 7 . 1 . . . . 21 . . . . . . 4 . . 3 . . . . . . . e 99 32 . . . . . . 1 . 1 . . . 1 1 . 2 . 1 . 4 7 11 1 . . . . 1 . 8 19 2 . . . 1 . 3 . . é 99 40 . . . . . . 3 . . . . . . . . 1 . . . . 10 . . . . . . . . 36 . 7 . . . . . . . . ê 99 . . . . . . . . . . . . . . . . . . . . . 8 65 . . . . . . . . 25 . . . . . . . . . f 99 1 18 . 1 . . . . . . 7 . . . . . 35 13 . . . . . 13 . . . . . . 3 . . 7 . . . . . . . g 99 1 3 . . . . . . . . 25 . . . . . 21 . . . . 1 1 7 . . . . . . 9 . . 31 . . . . . . . h 99 1 26 . . . . . . . . 3 . . . . . 1 . . . . . . 67 . . . . . . . . . 2 . . . . . . . i 99 1 7 . . 2 . . 12 1 7 5 . . 2 5 . . . . . 5 6 14 3 . . . . 2 . 5 12 5 . . . 2 . . . 2 í 99 1 . . . . . . 55 . 3 . . . 1 7 . . . . . . 3 8 . . . . . 4 . . 4 1 . . . 13 . . . . j 99 . 54 . 8 . . . . . . 22 . . . . . . . . . . . . 3 . . . . . . . . . 13 . . . . . . . k 99 . 33 . . . . . . . . 17 . . . . . . . . 17 33 . . . . . . . . . . . . . . . . . . . . l 99 10 29 . . . . . . . . 13 1 . . 1 10 8 1 . . . 2 . 12 1 . . . . 1 . . 2 6 . . 1 . . . . m 99 32 18 . . . . 2 . . . 15 1 . . . . 4 . . . . . . 16 . . . . 9 . . . . 1 . . . . . . . n 99 . 10 . . . 2 . 5 3 6 6 . . 1 5 3 6 . 1 . . . . 9 . 1 . . . . . 3 34 2 1 . 1 . . . . o 99 39 . . . . . 2 1 . 4 . . . . 1 . 1 . . . 4 8 6 . . . . . 2 . 9 18 1 2 . . . . . . . ó 99 15 . . . . . 4 . . . . . . . 25 . . . . . 4 . . . . . . . 8 . 6 19 4 . . . . . 13 . . ô 99 . . . . . . . . . . . . . . . . . . . . . 77 22 . . . . . . . . . . . . . . . . . . õ 99 . . . . . . . . . . 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ö 99 . . . . . . 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p 99 . 18 . . . . . . . . 15 . . . . . 1 . . . 14 . . 33 . . . . . . 16 . . 1 . . . . . . . q 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 . 3 . . . . . r 99 15 16 . 1 . . . 2 . 2 21 . 1 3 1 . 15 . . . . 3 1 6 1 . . . . . 1 1 7 2 . . 1 . . . . s 99 50 3 . . . 2 . 1 . . 11 . . 1 . . 3 . . . . 2 . 3 . . . . 2 1 . 3 12 5 . . . . . . . t 99 . 26 . 1 . 1 . . . . 21 1 . . . . 15 . . . . . . 16 . . . . . . 15 . . 3 . . . . . . . u 99 4 12 . . . . 1 . 1 1 17 . . . . . 4 . 1 . 10 21 5 . . . . . 5 . 7 3 4 . . . . . . . 1 ú 99 . . . . . . . . . . . . . . . . . . . . 19 27 27 . . . . . . . 4 15 8 . . . . . . . . ü 99 . . . . . . . . . . 22 . 55 . . . 22 . . . . . . . . . . . . . . . . . . . . . . . . v 99 . 19 . . . . . . . . 32 24 . . . . 17 . . . . . . 6 . . . . . . . . . . . . . . . . . w 99 . 50 . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . . . . . . . . . . . . x 99 . 9 . . . . . 3 . . 4 . . . . . 16 . . . . . . 63 . . . . 1 . . . 4 . . . . . . . . y 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . z 99 17 44 . . . . . . . . 16 . . . . . 19 . . . . 1 . 1 . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 16 10 0 0 0 1 1 3 1 5 10 1 0 1 1 0 5 0 0 0 3 4 4 8 0 0 0 0 2 1 6 7 4 3 0 0 1 0 0 0 0 Previous-symbol probability (× 99): TT . a à á â ã b c ç d e é ê f g h i í j k l m n o ó ô õ ö p q r s t u ú ü v w x y z -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 16 . 18 99 3 7 . 23 45 . 50 14 39 . 51 15 10 7 . 9 66 14 13 14 10 6 . . . 51 83 6 16 18 21 42 . 52 50 . . . a 10 20 . . . . . 15 9 56 24 . . . 2 12 . 4 2 1 17 20 11 13 1 . . . . 5 2 22 15 4 1 . . 3 . 1 . 4 à 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . á 0 . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . 1 . . . . . 3 . 1 . . â 0 . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . ã 1 . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . b 1 . 1 . . . . . . . . . 8 . . . . 1 . 15 . 1 . . 1 . . . . . . 1 . . . . . 1 . . . . c 3 . 5 . 3 2 . . . . . 6 . 25 . . 6 8 4 . . 1 . . 14 . . . . . . 1 . 1 3 4 . . . . . . ç 1 . . . . . 57 . . . . . . . . . . . . . . . . . . . . 94 . . . . . . . . . . . . . . d 5 . 11 . 2 . . . . . . 19 1 6 . 2 . 7 . 11 . . . . 12 . . . . . . 4 . . 4 . . . . . . . e 10 20 . . . . . 1 4 6 2 . . . 10 11 . 4 1 42 . 15 19 30 1 . . . . 7 6 15 29 5 1 . . 7 . 86 . 20 é 1 2 . . . . . 1 1 . . . . . . . . . . . . . 2 . . . . . . . . 4 . 1 . . . . . . . 1 ê 0 . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . f 1 . 2 . 6 . . . . . . 1 . . . . . 8 55 . . . . . 2 . . . . . . 1 . . 3 . . . . . 99 . g 1 . . . . . . . . . . 3 . . . . . 5 . . . . . . 1 . . . . . . 2 . . 11 . 11 . . . . . h 0 . 1 . . . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . i 5 . 4 . 5 83 3 3 20 11 8 3 . 3 10 20 . . . 3 . 10 8 18 2 . . . . 3 . 5 9 6 . 4 . 11 . 4 . 52 í 0 . . . . . . . 5 . . . . . . 2 . . . . . . . 1 . . . . . . . . . . . . . 4 . . . . j 0 . 2 . 14 . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . k 0 . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . l 3 2 8 . 5 . . . . . . 3 2 6 . 3 61 4 9 . . . 1 . 4 25 . . . . 3 . . 1 5 . . 3 . . . . m 4 8 7 . 8 . . 14 . . . 6 8 . . . . 3 3 . . . . . 8 . . . 99 15 . . . . 1 11 . . . . . . n 4 . 4 . . . 12 . 6 19 5 2 . . 5 18 22 5 1 8 . . . . 4 15 99 . . . . . 2 30 2 38 . 4 . . . . o 8 20 . . 3 . . 30 2 . 7 . . . . 7 . 2 . 4 . 13 17 14 . . . . . 5 . 14 21 2 5 . . 4 . . . . ó 0 . . . . . . 1 . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . 4 . . ô 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . õ 0 . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ö 0 . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p 2 . 4 . . . . . . . . 3 1 3 . . . . 1 . . 13 . . 9 4 . . . . . 7 . . 1 . . . . . . . q 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 . 88 . . . . . r 6 5 9 . 30 5 . 3 4 1 2 12 . 34 16 6 . 17 2 . . . 4 1 4 25 . 3 . . . 1 1 9 3 . . 4 . . . . s 7 21 2 . 3 . 21 . 2 . . 7 . . 3 . . 3 12 . . . 4 . 2 19 . 2 . 5 5 . 3 19 10 . . 3 . . . . t 4 . 12 . 18 2 5 . 1 . . 9 5 8 . . . 13 5 . . . . . 9 4 . . . . . 12 . . 4 . . . 50 . . . u 3 1 4 . . . . 8 . 4 1 5 . . 1 1 . 2 5 5 . 12 17 4 . . . . . 7 . 4 2 3 . . . . . 1 . 22 ú 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ü 0 . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 . 2 . . . . . . . . 3 34 . . . . 3 . . . . . . 1 . . . . . . . . . . . . . . . . . w 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 0 . . . . . . . . . . . . . . . . 1 . . . . . . 3 . . . . . . . . . . . . . . . . . y 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . z 0 . 1 . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 4.137 Next-symbol entropy: 3.094 Finally, for Latin: cat latn-ock.txt \ | sed \ -e 's/^Discipulus://' \ -e 's/^Magister://' \ | tr '.,;:?\!()"-' ' ' \ | tr ' ' '\012' \ | egrep '.' \ | head -5338 \ | tr 'A-Z' 'a-z' \ > .foo cat .foo \ | tr ' ' '\012' \ | sed -e 's/$/./g' \ | count-digraph-freqs \ -v pad='.' \ -v chars='.abcdefghijklmnopqrstuvwxyz0123456789' \ -v showentropy=1 Digraph counts: TT . a b c d e f g h i l m n o p q r s t u v x y 0 1 2 3 4 5 6 7 8 9 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 5338 . 459 42 414 272 675 134 30 123 418 70 165 185 116 605 398 162 448 189 213 184 . . . 22 6 . 1 1 2 2 2 . a 2328 277 . 119 59 99 165 5 13 4 33 242 207 171 . 93 4 171 121 446 88 10 1 . . . . . . . . . . . b 475 32 23 . . 3 116 . . . 81 1 . . 35 . . 11 20 . 151 1 . 1 . . . . . . . . . . c 1291 83 162 . 26 . 127 . . 29 300 24 . . 247 . 1 14 . 98 179 . . 1 . . . . . . . . . . d 1025 202 45 . 1 5 302 . . 8 267 . 5 . 99 . . . . . 91 . . . . . . . . . . . . . e 3455 511 46 32 183 97 5 12 130 1 21 226 258 262 22 30 27 498 486 468 16 10 114 . . . . . . . . . . . f 224 . 19 . . . 29 20 . . 133 1 . . 8 . . 1 . . 13 . . . . . . . . . . . . . g 306 . 65 . . . 46 . 1 . 54 . . 43 37 . . 28 . . 32 . . . . . . . . . . . . . h 192 7 72 . . . 2 . . . 31 . . . 25 . . 26 . . 29 . . . . . . . . . . . . . i 3880 426 214 171 274 163 67 21 51 2 73 152 131 464 170 167 32 49 423 486 252 86 6 . . . . . . . . . . . l 1078 47 61 . 1 . 187 . 1 . 473 96 1 . 45 . . . . 53 109 3 . 1 . . . . . . . . . . m 1514 856 81 8 1 . 65 . . . 82 . 77 45 96 60 2 . . . 134 6 . 1 . . . . . . . . . . n 1880 229 107 . 158 136 154 16 35 . 280 . . 21 159 . 27 . 95 308 139 15 . 1 . . . . . . . . . . o 1773 267 1 22 68 139 14 1 3 4 . 77 184 336 . 99 4 252 130 153 1 15 . 3 . . . . . . . . . . p 1239 3 179 . . . 199 . . 10 60 68 . . 220 14 . 311 36 59 80 . . . . . . . . . . . . . q 518 5 . . . . . . . . . . . . . . . . . . 513 . . . . . . . . . . . . . r 1909 350 209 14 18 24 374 10 31 . 422 . 3 25 117 20 . 27 34 70 128 33 . . . . . . . . . . . . s 2331 1013 52 1 55 12 219 1 6 . 247 . 19 . 31 58 14 1 117 320 165 . . . . . . . . . . . . . t 2880 921 309 . 2 . 502 . . 11 516 . . . 155 . 8 61 . 5 382 . . 8 . . . . . . . . . . u 2727 31 187 66 27 75 86 4 5 . 215 121 463 328 141 37 . 289 417 219 9 . 7 . . . . . . . . . . . v 363 . 35 . . . 117 . . . 166 . . . 41 . . . . . 3 . 1 . . . . . . . . . . . x 129 42 2 . 4 . 4 . . . 8 . . . 9 54 1 . . 5 . . . . . . . . . . . . . . y 16 . . . . . . . . . . . 1 . . 2 . 8 4 1 . . . . . . . . . . . . . . 0 2 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 24 8 . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 4 1 1 1 . 2 8 3 . . . . . . . . . . . . . . . . . . . . . . 1 1 . 1 . . . . 1 1 3 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 36928 5338 2328 475 1291 1025 3455 224 306 192 3880 1078 1514 1880 1773 1239 518 1909 2331 2880 2727 363 129 16 2 24 8 3 4 5 3 3 4 1 Next-symbol probability (× 99): TT . a b c d e f g h i l m n o p q r s t u v x y 0 1 2 3 4 5 6 7 8 9 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 9 1 8 5 13 2 1 2 8 1 3 3 2 11 7 3 8 4 4 3 . . . . . . . . . . . . a 99 12 . 5 3 4 7 . 1 . 1 10 9 7 . 4 . 7 5 19 4 . . . . . . . . . . . . . b 99 7 5 . . 1 24 . . . 17 . . . 7 . . 2 4 . 31 . . . . . . . . . . . . . c 99 6 12 . 2 . 10 . . 2 23 2 . . 19 . . 1 . 8 14 . . . . . . . . . . . . . d 99 20 4 . . . 29 . . 1 26 . . . 10 . . . . . 9 . . . . . . . . . . . . . e 99 15 1 1 5 3 . . 4 . 1 6 7 8 1 1 1 14 14 13 . . 3 . . . . . . . . . . . f 99 . 8 . . . 13 9 . . 59 . . . 4 . . . . . 6 . . . . . . . . . . . . . g 99 . 21 . . . 15 . . . 17 . . 14 12 . . 9 . . 10 . . . . . . . . . . . . . h 99 4 37 . . . 1 . . . 16 . . . 13 . . 13 . . 15 . . . . . . . . . . . . . i 99 11 5 4 7 4 2 1 1 . 2 4 3 12 4 4 1 1 11 12 6 2 . . . . . . . . . . . . l 99 4 6 . . . 17 . . . 43 9 . . 4 . . . . 5 10 . . . . . . . . . . . . . m 99 56 5 1 . . 4 . . . 5 . 5 3 6 4 . . . . 9 . . . . . . . . . . . . . n 99 12 6 . 8 7 8 1 2 . 15 . . 1 8 . 1 . 5 16 7 1 . . . . . . . . . . . . o 99 15 . 1 4 8 1 . . . . 4 10 19 . 6 . 14 7 9 . 1 . . . . . . . . . . . . p 99 . 14 . . . 16 . . 1 5 5 . . 18 1 . 25 3 5 6 . . . . . . . . . . . . . q 99 1 . . . . . . . . . . . . . . . . . . 98 . . . . . . . . . . . . . r 99 18 11 1 1 1 19 1 2 . 22 . . 1 6 1 . 1 2 4 7 2 . . . . . . . . . . . . s 99 43 2 . 2 1 9 . . . 10 . 1 . 1 2 1 . 5 14 7 . . . . . . . . . . . . . t 99 32 11 . . . 17 . . . 18 . . . 5 . . 2 . . 13 . . . . . . . . . . . . . u 99 1 7 2 1 3 3 . . . 8 4 17 12 5 1 . 10 15 8 . . . . . . . . . . . . . . v 99 . 10 . . . 32 . . . 45 . . . 11 . . . . . 1 . . . . . . . . . . . . . x 99 32 2 . 3 . 3 . . . 6 . . . 7 41 1 . . 4 . . . . . . . . . . . . . . y 99 . . . . . . . . . . . 6 . . 12 . 50 25 6 . . . . . . . . . . . . . . 0 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 99 33 . . . . . . . . . . . . . . . . . . . . . . 4 4 8 8 12 17 4 4 4 . 2 99 37 . . . . . . . . . . . . . . . . . . . . . . 12 12 . 12 . . . . 12 12 3 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 99 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 14 6 1 3 3 9 1 1 1 10 3 4 5 5 3 1 5 6 8 7 1 0 0 0 0 0 0 0 0 0 0 0 0 Previous-symbol probability (× 99): TT . a b c d e f g h i l m n o p q r s t u v x y 0 1 2 3 4 5 6 7 8 9 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 14 . 20 9 32 26 19 59 10 63 11 6 11 10 6 48 76 8 19 6 8 50 . . . 91 74 . 25 20 66 66 50 . a 6 5 . 25 5 10 5 2 4 2 1 22 14 9 . 7 1 9 5 15 3 3 1 . . . . . . . . . . . b 1 1 1 . . . 3 . . . 2 . . . 2 . . 1 1 . 5 . . 6 . . . . . . . . . . c 3 2 7 . 2 . 4 . . 15 8 2 . . 14 . . 1 . 3 6 . . 6 . . . . . . . . . . d 3 4 2 . . . 9 . . 4 7 . . . 6 . . . . . 3 . . . . . . . . . . . . . e 9 9 2 7 14 9 . 5 42 1 1 21 17 14 1 2 5 26 21 16 1 3 87 . . . . . . . . . . . f 1 . 1 . . . 1 9 . . 3 . . . . . . . . . . . . . . . . . . . . . . . g 1 . 3 . . . 1 . . . 1 . . 2 2 . . 1 . . 1 . . . . . . . . . . . . . h 1 . 3 . . . . . . . 1 . . . 1 . . 1 . . 1 . . . . . . . . . . . . . i 10 8 9 36 21 16 2 9 17 1 2 14 9 24 9 13 6 3 18 17 9 23 5 . . . . . . . . . . . l 3 1 3 . . . 5 . . . 12 9 . . 3 . . . . 2 4 1 . 6 . . . . . . . . . . m 4 16 3 2 . . 2 . . . 2 . 5 2 5 5 . . . . 5 2 . 6 . . . . . . . . . . n 5 4 5 . 12 13 4 7 11 . 7 . . 1 9 . 5 . 4 11 5 4 . 6 . . . . . . . . . . o 5 5 . 5 5 13 . . 1 2 . 7 12 18 . 8 1 13 6 5 . 4 . 19 . . . . . . . . . . p 3 . 8 . . . 6 . . 5 2 6 . . 12 1 . 16 2 2 3 . . . . . . . . . . . . . q 1 . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . r 5 6 9 3 1 2 11 4 10 . 11 . . 1 7 2 . 1 1 2 5 9 . . . . . . . . . . . . s 6 19 2 . 4 1 6 . 2 . 6 . 1 . 2 5 3 . 5 11 6 . . . . . . . . . . . . . t 8 17 13 . . . 14 . . 6 13 . . . 9 . 2 3 . . 14 . . 50 . . . . . . . . . . u 7 1 8 14 2 7 2 2 2 . 5 11 30 17 8 3 . 15 18 8 . . 5 . . . . . . . . . . . v 1 . 1 . . . 3 . . . 4 . . . 2 . . . . . . . 1 . . . . . . . . . . . x 0 1 . . . . . . . . . . . . 1 4 . . . . . . . . . . . . . . . . . . y 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 . . . . . . . . . . . . . . . . . . . . . . . 50 4 25 66 74 79 33 33 25 . 2 0 . . . . . . . . . . . . . . . . . . . . . . . 50 4 . 33 . . . . 25 99 3 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 4.023 Next-symbol entropy: 3.255 Summarizing, here are the letter frequency tables for Voynichese (bio/EVA), English (LC), and Latin (LC): Ra Friedm. Currier English Latin nk L freq L freq r freq r freq -- - ----- - ----- - ----- - ----- 01 . 5910 . 6141 . 6929 . 5338 02 e 3974 e 3821 e 3710 i 3880 03 o 3635 o 3619 t 2547 e 3455 04 y 3481 y 3468 a 2413 t 2880 05 h 2605 h 2625 o 2249 u 2727 06 d 2534 d 2534 i 2061 s 2331 07 l 2172 l 2157 n 2001 a 2328 08 k 2026 k 2002 s 1933 r 1909 09 a 1792 a 1865 h 1813 n 1880 10 c 1626 c 1633 r 1806 o 1773 11 q 1565 i 1634 d 1405 m 1514 12 i 1361 q 1547 l 1279 c 1291 13 s 1321 s 1352 u 878 p 1239 14 t 928 t 928 m 865 l 1078 15 n 861 n 860 w 746 d 1025 16 r 837 r 845 y 738 q 518 17 p 192 p 192 c 694 b 475 18 m 56 m 72 f 652 v 363 19 f 33 f 27 g 577 g 306 20 g 11 - - p 482 f 224 21 - - - - b 359 h 192 22 - - - - v 319 x 129 23 - - - - k 213 y 16 24 - - - - j 61 - - 25 - - - - x 45 - - 26 - - - - q 23 - - 27 - - - - z 7 - - -- - ----- - ----- - ----- - ----- xx T 36920 T 37322 T 36805 L 36871 xx h 3.826 h 3.827 h 4.092 1 4.008 gnuplot set term x11 # set term pbm color medium # set output ".letfreqs.ppm" set xrange [-1:28] plot \ ".letfreqs" using 01:03 title "Friedman" with boxes,\ ".letfreqs" using 04:06 title "Currier" with boxes,\ ".letfreqs" using 07:09 title "English" with boxes,\ ".letfreqs" using 10:12 title "Latin" with boxes pause 120 plot \ ".letfreqs" using 03 title "Friedman" with steps,\ ".letfreqs" using 06 title "Currier" with steps,\ ".letfreqs" using 09 title "English" with steps,\ ".letfreqs" using 12 title "Latin" with steps 97-11-11 stolfi =============== Perhaps the "d" should be included in the suffix set. Let's see whether it can be prededed by suffixy letters: cat bio-f-eva-gut.wds \ | sed \ -e 's/^/ /' \ -e 's/\(.d\)/ \1 /g' \ | tr ' ' '\012' \ | egrep -e 'd' \ | sort | uniq -c | expand | sort +0 -1nr \ > .foo Result: 1733 ed 507 d 145 hd 73 ld 27 od 13 yd 7 ad Let's look at the '[loya]d' words: foreach f ( l o y a ) echo '=== '$f cat bio-f-eva-gut.dic \ | egrep -e "$f"'d' end al.al.dy al.dy che.ol.dy ch.l.d.aiin d.air.ol.dy d.al.dy d.ol.dy k.al.dy k.ol.dy l.d l.d.aiin l.d.al ld.al.or l.d.ar ld.chey ld.dy l.d.ol l.dy l.ke.ol.dy l.l.d.ar l.ol.dy ok.al.dy okee.dy.l.dy ok.ol.dy ol.d ol.d.a ol.d.air ol.d.y ol.k.ol.d.y ot.al.d.y ote.ol.d.y otoldy p.ol.d.ak.y p.ol.d.she.dy pshe.al.dy pshe.ol.dy qokaldy qokoldy qok.y.l.d.d.y q.ol.d.y qot.al.d.y rche.al.d shed.al.d.y sheol.d.y sh.l.d.y sh.ol.d.y s.ol.d.y t.ol.d.al yshe.al.d.y chckh.od.y che.od.y d.ar.od.y lk.od.al l.od od.aiin od.al od.ar odched.y oddche.y od.y ot.ar.od.l otee.od.y p.ar.od.y qod.aiin qod.ar qod.che.d qod.ee.d.y qodee.y qod.y qod.yke.y sh.od.y s.od.ar chckh.yd che.d.che.yd.aiin d.air.yd.y dsh.ol.yd olt.yd.y qok.yd.y yd.aiin yd.air.ol yd.ar.al yd.ar.she.y yd.y ch.ad.y d.ar.ad.y ok.ad.y ot.ad.y qok.ad.y t.ad.y so it seems that -od -ad -ld -yd are letters, as well as qod Let's look again at suffixes, ab initio: cat bio-f-eva-gut.wds \ | sed \ -e 's/\([[oaydirslmn]*\)$/- -\1/' \ | egrep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .foo It seems that the following is the "suffix alphabet": -[aoydlsm] -i*n -i*r 97-11-16 stolfi =============== While waiting for a big compile, I began splitting Jim Reeds's mail archives into separate messages: cd ../docs mkdir email-arch mv email-BIG[123]* email-arch cd email-arch I then split the mail forlders to individual files using MH. Must clean them, convert to HTML, and add links. Adding keys for links: <> statistics and structural analysis <> transcribed text and corrections <> discussion on file organization, format, and logistics <> mailing list administration <> discussion about alphabet <> historical and cultural issues <> bibliographic references <> about people <> software <> discussions related to crypto hypothesis <> physical format and properties of the book <> pictures and discussion thereabout <> humor (?) 97-11-20 stolfi =============== At Rene's suggestion, I have plotted Rayman's counts of distinct characters per page: cat rayman-char-counts.txt | tr ',' ' ' > .tmp gnuplot < rayman-char-counts.gif xv rayman-char-counts.{ppm,gif} 97-11-22 stolfi =============== Let's compute the number of EVA characters per page in each transcription: cat L16-eva/INDEX \ | sed -e 's/:.*$//g' \ > all.units set u = ( `cat all.units | sed -e 's/^/L16-eva\//g'` ) cat ${u} \ | count-bytes-per-scribe \ -v scribes='BCDFGIJKLQRTUZ' \ > .bytes-per-page-and-scribe On a different track, I wrote a distribution-sorting program (sort-distr.c) and used it to sort the new label location maps (Note-010.txt, in preparation). 97-11-23 stolfi =============== Let's have another quick look at the A/B differences in midfix frequencies of Note-009.txt. perhaps the difference will become sharper (or disappear) if we collapse k/t and replace ch=sh=ee, cth=ete, etc. foreach guy ( Friedman.f Currier.c ) foreach lang ( A.a B.b ) cat Note-009/he${lang:e}-${guy:e}.factored \ | grep -v -e '- -' \ | eva2erb \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang:e}-${guy:e}-unifs-all.frq cat Note-009/he${lang:e}-${guy:e}.factored \ | grep -e '- -' \ | gawk '/./ {print $1}' \ | eva2erb \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang:e}-${guy:e}-prefs-all.frq cat Note-009/he${lang:e}-${guy:e}.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | eva2erb \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang:e}-${guy:e}-midfs-all.frq cat Note-009/he${lang:e}-${guy:e}.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | eva2erb \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang:e}-${guy:e}-suffs-all.frq foreach elem ( pref midf suff unif ) set file = "he${lang:e}-${guy:e}-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by '"${guy:r}"'\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ----------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ----------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end end dicio-wc .he{a,b}-{f,c}-{pref,midf,suff,unif}s-all.fmt foreach elem ( pref midf suff unif ) set tfiles = ( ) foreach guy ( f c ) foreach lang ( a b ) set file = "he${lang}-${guy}-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end end pr -m -t -i' '1 -w 88 ${tfiles} \ | expand \ > .herbal-${elem}-cmp.txt end dicio-wc .herbal-{pref,midf,suff,unif}-cmp.txt Looking at the suffix frequencies, it seems that the main difference between A and B is that the latter uses "d" instead of some letter that should be in the midfix. If we eliminate the "-do" suffix and renormalize, the by Friedman by Friedman by Friedman language A language B language B minus -do freq pc suffix freq pc suffix freq pc suffix ---- -- ------------- ---- -- ------------- ---- -- ------------- 2200 37 -o 642 26 -do 583 32 -o 1008 17 -ol 583 24 -o 254 14 -or 960 16 -or 254 10 -or 183 10 -ol 365 6 -oiin 183 8 -ol 145 8 -oiin 239 4 -odo 145 6 -oiin 116 6 -odo 127 2 -om 116 5 -odo 63 4 - 124 2 - 63 3 - 41 2 -om 85 1 -odoiin 41 2 -om 34 2 -doiin ---- -- ------------- ---- -- ------------- ---- -- ------------- 5967 99 TOTAL 2431 99 TOTAL 1789 99 TOTAL I still can't see much resemblance in the midfixes. Moreover the B midfixes seem longer on the average. So it is not a matter of moving some suffix letter to the midfix. Basically the B language does not use the ckh/cth gallows. Perhaps ckh = ked, or something of the sort? by Friedman by Friedman by Currier by Currier language A language B language A language B freq pc midfix freq pc midfix freq pc midfix freq pc midfix ---- -- ------------- ---- -- ------------- ---- -- ------------- ---- -- ------------- 1595 27 -ee- 590 24 -k- 1472 28 -ee- 404 21 -k- 913 15 -k- 279 12 -eee- 865 16 -k- 240 13 -ke- 856 14 -kee- 274 11 -ke- 705 13 -kee- 219 12 -eee- 459 8 -eke- 269 11 -ee- 385 7 -eke- 202 11 -ee- 418 7 -eee- 261 11 -kee- 316 6 -eee- 187 10 -kee- 155 3 -ke- 76 3 -eeeke- 128 2 -eeok- 62 3 -eeeke- 152 3 -keee- 72 3 -keee- 128 2 -keee- 48 3 -eeee- 132 2 -eeok- 60 3 -eeek- 101 2 -ke- 48 3 -eeek- 110 2 -pee- 49 2 -pee- 100 2 -pee- 44 2 -keee- 99 2 -eeee- 48 2 -eeee- 75 1 -epe- 34 2 -pee- 93 2 -ekee- 48 2 -p- 72 1 -eeee- 34 2 -peee- 81 1 -epe- 39 2 -peee- 69 1 -eeeke- 33 2 -eke- 73 1 -eeeke- 33 1 -eke- 66 1 -eeokee- 32 2 -p- 60 1 -p- 25 1 -ekee- 56 1 -ekee- 23 1 -ekee- 57 1 -eek- 24 1 -eek- 55 1 -p- 18 1 -eek- 55 1 -eeokee- 20 1 -eeekee- 52 1 -eek- 16 1 -eeeeke- Well, since the prefixes seem OK, let's compare the midfix+suffix together: foreach guy ( Friedman.f ) foreach lang ( A.a B.b ) cat Note-009/he${lang:e}-${guy:e}.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | eva2erb \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang:e}-${guy:e}-tails-all.frq foreach elem ( tail ) set file = "he${lang:e}-${guy:e}-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by '"${guy:r}"'\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ----------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ----------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end end dicio-wc .he{a,b}-{f}-{tail}s-all.fmt foreach elem ( tail ) set tfiles = ( ) foreach guy ( f ) foreach lang ( a b ) set file = "he${lang}-${guy}-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end end pr -m -t -i' '1 -w 88 ${tfiles} \ | expand \ > .herbal-${elem}-cmp.txt end dicio-wc .herbal-{tail}-cmp.txt lines words bytes file ------ ------- --------- ------------ 774 3722 41887 .herbal-tail-cmp.txt by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ---------------- ---- -- ---------------- 404 7 -keeo 153 6 -kor 395 7 -eeo 150 6 -kedo 370 6 -eeol 129 5 -keedo 337 6 -eeor 117 5 -eeedo 197 3 -ko 108 4 -koiin 189 3 -kol 88 4 -kol 176 3 -ekeo 87 4 -eedo 166 3 -eeeo 71 3 -keeo 162 3 -koiin 65 3 -ko 146 2 -keeor 55 2 -eeekeo 132 2 -keeol 51 2 -eeeo 119 2 -kor 31 1 -keeedo 99 2 -keeeo 31 1 -keo 91 2 -eeeor 30 1 -eeor 81 1 -ekeor 30 1 -kom 80 1 -ekeol 29 1 -eeeko 78 1 -eeoiin 28 1 -eeo 64 1 -eeodo 28 1 -keeeo 61 1 -eeeeo 27 1 -eeol 60 1 -eeoko 24 1 -eeodo 57 1 -ekeeo 23 1 -keodo 48 1 -eeeol 23 1 -koin 47 1 -eeekeo 22 1 -eeeor 44 1 -keol 21 1 -eeeodo 42 1 -eeodoiin 21 1 -ekeo 40 1 -eeoekeo 20 1 -kodo 39 1 -eeokeeo 19 1 -eeeeo 39 1 -keo 19 1 -peeedo 37 1 -keeodo 18 1 -keol 33 1 -eeom 18 1 -peedo 30 1 -peeo 17 1 -eeeol 29 1 -kom 16 1 -eeeekeo 28 1 -keor 13 1 -eeko 27 1 -peeor 12 1 -eeeedo 24 0 -ekeodo 12 1 -eeekeeo 24 0 -epeo 12 1 -ekeeo 24 0 -kodo 12 1 -koldo 23 0 -keeoiin 11 1 -eedoiin 22 0 -eeoro 11 1 -eeekedo 21 0 -eekeeo 11 1 -koir 20 0 -ekeoiin 10 0 -ekeedo 19 0 -eee 10 0 -k 19 0 -k 10 0 -keeodo 19 0 -koldo 10 0 -kolo 18 0 -eeoin 9 0 -keor 17 0 -eeeko 9 0 -peeo 17 0 -eeer 8 0 -eeed 17 0 -eeod 8 0 -eeekoiin 17 0 -ekeom 8 0 -por 17 0 -kod 7 0 -eedol 16 0 -eeekeeo 7 0 -eeedoiin 16 0 -eeko 7 0 -eeee 16 0 -eeolo 7 0 -eeek 16 0 -eeon 7 0 -eeer 16 0 -keeod 7 0 -eeoko 16 0 -peeol 7 0 -ekedo 15 0 -eeeodo 7 0 -koro 15 0 -eeokol 7 0 -peeeo 15 0 -eer 6 0 -eeekol 15 0 -keeeeo 6 0 -ked 14 0 -eekoiin 6 0 -kedoiin 14 0 -eeoeeo 6 0 -keedor 14 0 -eeokoiin 6 0 -keeol 14 0 -epeol 6 0 -poiin 14 0 -keeom 5 0 -eed 13 0 -eeoldo 5 0 -eee 13 0 -ekeeeo 5 0 -eeekeedo 12 0 -ee 5 0 -eeekeeeo 12 0 -eeeer 5 0 -eeekor 12 0 -eeeoiin 5 0 -kedor 12 0 -eeoo 5 0 -keeor 11 0 -ekeeor 5 0 -keer 11 0 -keeeor 5 0 -kodoiin 11 0 -keodo 5 0 -koror 11 0 -kodoiin 5 0 -peeedor 11 0 -koin 4 0 -eedom 10 0 -eedo 4 0 -eedor 10 0 -eekor 4 0 -eeedor 10 0 -eeok 4 0 -eeeekeeo 10 0 -eeokeo 4 0 -eeeer 10 0 -ekeeol 4 0 -eeeoekeo 10 0 -epeor 4 0 -eeepeedo 10 0 -kee 4 0 -eeepo 10 0 -keeeol 4 0 -eekoiin 10 0 -koo 4 0 -eer 10 0 -peeeo 4 0 -keed 9 0 -eeeeor 4 0 -keeeeo 9 0 -eeokor 4 0 -keeod 9 0 -ekeeodo 4 0 -peedoiin 8 0 -eeeeol 4 0 -peeol 8 0 -eeeodoiin 3 0 -eedolo 8 0 -eeeom 3 0 -eeedol 8 0 -eeodol 3 0 -eeeeko 8 0 -eeodor 3 0 -eeeked 8 0 -eeokeeeo 3 0 -eeeod 8 0 -eeokeeol 3 0 -eekedo ---- -- ---------------- ---- -- ---------------- 5967 99 TOTAL 2431 99 TOTAL Inspired by Landini's paper, let me prepare a graph of A-freq × B-freq for each segment: foreach guy ( Friedman.f ) foreach elem ( pref midf suff unif tail ) set pfile = "herbal-${guy:e}-${elem}s-all" set afile = "hea-${guy:e}-${elem}s-all" set bfile = "heb-${guy:e}-${elem}s-all" echo "${afile}.frq, ${bfile}.frq -> ${pfile}.plt" /n/gnu/bin/join \ -a 1 -a 2 -e 0 \ -j1 2 -j2 2 \ -o1.1,2.1,0 \ ${afile}.frq ${bfile}.frq \ > .${pfile}.plt plot-lang-diffs ${guy:r} ${elem} ${pfile}.plt end end dicio-wc .he-{f}-{tail}s-all.fmt 97-11-24 stolfi =============== I stole some text in pinyin from http://www-personal.umich.edu/~wbaxter/, cleaned it some and saved it to chin-mch.txt. This is a bad sample: in the first statistics I ran, "zhong1 guo2" (China) came out neat the top. That's because half the sample is a Voice of America semi-political speech... So I removed all (but one) occurrences of "zhong1 guo2" from the sample. Let's run some statistics. Fist, words overall: cat chin-mch.txt \ | tr ' ' '\012' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin.frq count freqy word ----- ----- ----------- 244 0.065 de 118 0.031 shi4 78 0.021 ren2 62 0.016 you3 61 0.016 ta1 55 0.015 xue2 54 0.014 wen2 50 0.013 shi2 50 0.013 zai4 42 0.011 guo2 41 0.011 yi2 40 0.011 yi4 37 0.010 le 35 0.009 ge 34 0.009 shuo1 33 0.009 bu4 ... ..... ..... Now, without tones: cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin-notone.frq count freqy word ----- ----- ----------- 245 0.065 de 223 0.059 shi 129 0.034 yi 93 0.025 ren 71 0.019 you 65 0.017 bu 61 0.016 ta 60 0.016 guo 58 0.015 wen 55 0.015 xue 55 0.015 zi 50 0.013 zai 47 0.012 ji 44 0.012 yu 44 0.012 zhi 43 0.011 ge 40 0.011 mei Now for the initial consonant: cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | sed -e 's/[aeiouü].*$//g' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin-initial.frq count freqy word ----- ----- ----------- 473 0.126 d 402 0.107 y 364 0.097 sh 215 0.057 j 209 0.056 x 198 0.053 h 197 0.053 zh 178 0.047 g 173 0.046 l 166 0.044 z 157 0.042 b 138 0.037 w 130 0.035 r 130 0.035 t 91 0.024 f 90 0.024 m 89 0.024 n 89 0.024 q 75 0.020 k 74 0.020 ch Now for the final (vowels plus terminators): cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | sed -e 's/^[^aeiouü]*//g' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin-final.frq count freqy word ----- ----- ----------- 654 0.173 i 434 0.115 e 311 0.082 u 238 0.063 en 179 0.047 ai 168 0.045 uo 145 0.038 ou 130 0.034 a 126 0.033 an 123 0.033 ing 122 0.032 ong 118 0.031 ei 113 0.030 ian 109 0.029 eng 102 0.027 ao 98 0.026 ang 73 0.019 ui 67 0.018 ue 59 0.016 iao Changing subject again, I have been looking at the differences between languages A and B, particularly the tail (midfix+suffix) distribution. They really look like different languages. Even taking into account possible letter confusion, there seems no simple correspondence between the tails of one and those of the other. Just to be sure, let's try to recompute the tail distributions after collapsing everything that could be equivalent: t,k ---------> t p,f ---------> p r,s ---------> e ei ----------> o o,a,y -------> o ch,sh -------> ee cth,ckh -----> tee cph,cfh -----> pee iiii,iii,ii -> i foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/sh/ee/g' \ -e 's/ch/ee/g' \ -e 's/s/e/g' \ -e 's/r/e/g' \ -e 's/k/t/g' \ -e 's/f/p/g' \ -e 's/cth/tee/g' \ -e 's/ckh/tee/g' \ -e 's/cph/pee/g' \ -e 's/cfh/pee/g' \ -e 's/ei/o/g' \ -e 's/a/o/g' \ -e 's/y/o/g' \ -e 's/iiii/i/g' \ -e 's/iii/i/g' \ -e 's/ii/i/g' \ > .he${lang}-f-ere.factored cat .he${lang}-f-ere.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-ere-midfs-all.frq cat .he${lang}-f-ere.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-ere-suffs-all.frq cat .he${lang}-f-ere.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-ere-tails-all.frq end dicio-wc .he{a,b}-f-ere-{midf,suff,tail}s-all.frq lines words bytes file ------ ------- --------- ------------ 179 358 2964 .hea-f-ere-midfs-all.frq 131 262 1815 .hea-f-ere-suffs-all.frq 655 1310 10600 .hea-f-ere-tails-all.frq 133 266 2169 .heb-f-ere-midfs-all.frq 82 164 1118 .heb-f-ere-suffs-all.frq 420 840 6716 .heb-f-ere-tails-all.frq foreach elem ( midf suff tail ) foreach lang ( A.a B.b ) set file = "he${lang:e}-f-ere-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by Friedman\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end foreach elem ( midf suff tail ) set tfiles = ( ) foreach lang ( a b ) set file = "he${lang}-f-ere-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end pr -m -t -i' '1 -w 54 ${tfiles} \ | expand \ > .herbal-f-ere-${elem}-cmp.txt end dicio-wc .herbal-f-ere-{midf,suff,tail}-cmp.txt Here are the results: by Friedman by Friedman language A language B freq pc midfix freq pc midfix ---- -- ------------------ ---- -- ------------------ 1595 27 -ee- 590 24 -t- 1313 22 -tee- 293 12 -tee- 913 15 -t- 279 12 -eee- 419 7 -eee- 274 11 -te- 241 4 -teee- 269 11 -ee- 191 3 -pee- 95 4 -teee- 155 3 -te- 66 3 -eetee- 132 2 -eeot- 60 3 -eeet- 100 2 -eeee- 52 2 -pee- 100 2 -eeotee- 49 2 -eeee- 99 2 -eetee- 48 2 -p- 60 1 -p- 45 2 -peee- 57 1 -eet- 25 1 -eeetee- 46 1 -peee- 24 1 -eet- 40 1 -teeee- 15 1 -eeete- 24 0 -eeet- 14 1 -eeteee- .... .. ....... .... .. ..... ---- -- ------------------ ---- -- ------------------ 5967 99 TOTAL 2431 99 TOTAL Tails: by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ------------------ ---- -- ------------------ 579 10 -teeo 153 6 -toe 395 7 -eeo 150 6 -tedo 370 6 -eeol 135 6 -teedo 337 6 -eeoe 131 5 -toin 226 4 -teeoe 118 5 -eeedo 212 4 -teeol 92 4 -teeo 197 3 -to 88 4 -tol 189 3 -tol 87 4 -eedo 178 3 -toin 65 3 -to 167 3 -eeeo 52 2 -eeteeo 156 3 -teeeo 51 2 -eeeo 119 2 -toe 41 2 -teeedo 96 2 -eeoin 39 2 -teeeo 91 2 -eeeoe 31 1 -teo Hmm, it seems that scribe A does not use "d" in the suffixes very much. Perhaps if we delete "d" we will get a better resemblance: foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/d//g' \ -e 's/sh/ee/g' \ -e 's/ch/ee/g' \ -e 's/s/e/g' \ -e 's/r/e/g' \ -e 's/k/t/g' \ -e 's/f/p/g' \ -e 's/cth/tee/g' \ -e 's/ckh/tee/g' \ -e 's/cph/pee/g' \ -e 's/cfh/pee/g' \ -e 's/ei/o/g' \ -e 's/a/o/g' \ -e 's/y/o/g' \ -e 's/iiii/i/g' \ -e 's/iii/i/g' \ -e 's/ii/i/g' \ > .he${lang}-f-erf.factored end foreach lang ( a b ) cat .he${lang}-f-erf.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erf-midfs-all.frq cat .he${lang}-f-erf.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erf-suffs-all.frq cat .he${lang}-f-erf.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erf-tails-all.frq end dicio-wc .he{a,b}-f-erf-{midf,suff,tail}s-all.frq lines words bytes file ------ ------- --------- ------------ 162 324 2666 .hea-f-erf-midfs-all.frq 85 170 1159 .hea-f-erf-suffs-all.frq 535 1070 8572 .hea-f-erf-tails-all.frq 125 250 2028 .heb-f-erf-midfs-all.frq 54 108 722 .heb-f-erf-suffs-all.frq 329 658 5186 .heb-f-erf-tails-all.frq foreach elem ( midf suff tail ) foreach lang ( A.a B.b ) set file = "he${lang:e}-f-erf-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by Friedman\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end foreach elem ( midf suff tail ) set tfiles = ( ) foreach lang ( a b ) set file = "he${lang}-f-erf-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end pr -m -t -i' '1 -w 54 ${tfiles} \ | expand \ > .herbal-f-erf-${elem}-cmp.txt end dicio-wc .herbal-f-erf-{midf,suff,tail}-cmp.txt lines words bytes file ------ ------- --------- ------------ 168 893 6707 .herbal-f-erf-midf-cmp.txt 91 449 3316 .herbal-f-erf-suff-cmp.txt 541 2624 20105 .herbal-f-erf-tail-cmp.txt by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ------------------ ---- -- ------------------ 611 10 -teeo 231 10 -teeo 422 7 -eeo 185 8 -teo 374 6 -eeol 172 7 -eeeo 338 6 -eeoe 153 6 -toe 228 4 -teeoe 131 5 -toin 218 4 -teeol 116 5 -eeo 216 4 -to 90 4 -tol 195 3 -tol 84 4 -teeeo 180 3 -toin 69 3 -to 172 3 -eeeo 59 2 -eeteeo 161 3 -teeeo 36 2 -eeeeo 120 2 -toe 34 1 -eeoe 98 2 -eeoin 34 1 -eeol 91 2 -eeeoe 30 1 -peeeo 79 1 -eeoteeo 30 1 -tom 76 1 -eeoo 29 1 -eeeto 70 1 -eeteeo 29 1 -eeoo 69 1 -teeoo 29 1 -peeo 64 1 -eeeeo 26 1 -eeeoe 62 1 -eeoto 26 1 -teoo Good news, at least we got the top entry to match. Now what else can we do? We could map "teeoe" and "teeol" to "teo", but that seems a bit ad-hoc... Let's try again. Let's compare the frequencies of "k" and "t", "sh" and "ch" in each language" foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/ch/C/' \ -e 's/sh/S/' \ -e 's/$/\./' \ | count-digraph-freqs \ -v pad="." \ -v showentropy=0 \ -v chars=".CSaoeilmnrchtpkfsqjdvxyg" end Language A: Digraph counts: TT . C S a o e i l m n r c h t p k f s q d y ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- n 1376 1341 1 . 1 7 . . 3 2 . 1 1 . 1 . . . 2 . 8 8 m 265 261 . . 1 3 . . . . . . . . . . . . . . . . r 1569 1302 32 6 59 59 1 5 4 3 . 2 4 . . . 1 . 2 . 16 73 l 1720 1310 35 14 11 88 7 . . 3 . 2 6 . 6 1 7 1 40 1 120 68 y 3189 2543 91 16 5 16 3 . 3 1 . 2 10 . 183 21 186 2 16 2 89 . s 669 316 51 5 104 107 16 . . 2 . . 4 12 . 1 3 . 1 . . 47 d 2234 160 145 51 1109 175 24 . 14 5 . 2 10 . 3 3 5 . 11 . 7 510 k 1650 24 377 75 223 252 258 . . 1 . . 43 226 . . . . 3 . 2 166 t 1790 17 423 57 161 273 124 . 1 . . . 28 522 1 1 . . 4 1 5 172 p 324 7 117 11 9 35 . . . . . . 16 101 1 . . . . . 3 24 f 106 9 28 5 6 14 . . . . . . 2 30 . . . . . . 4 8 . 7812 . 1507 745 79 1145 26 12 41 16 3 57 615 . 267 95 352 33 356 708 1266 489 c 1001 . . . . . . . . . . . . 122 522 101 226 30 . . . . o 5711 410 59 24 74 18 60 101 1325 91 7 993 141 . 726 83 742 35 117 4 641 60 a 2318 43 4 . 1 4 . 1305 311 131 54 412 3 . 4 2 7 2 10 . 19 6 i 2601 2 1 2 . 1 3 1173 4 6 1300 83 3 . 1 . 14 . 3 . 5 . e 1958 32 11 1 118 529 475 1 1 3 12 4 12 . 34 11 37 1 79 . 10 587 h 1013 10 4 1 93 335 177 1 2 . . 2 . . 1 . 1 . 9 . 4 373 S 1016 15 5 . 47 525 233 . 3 . . . 21 . 6 . 13 1 4 . 6 137 C 2892 10 . 3 217 1427 549 3 7 1 . 9 80 . 32 5 51 1 12 . 28 457 q 716 . 1 . . 698 2 . 1 . . . 2 . 2 . 5 . . . 1 4 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 41930 7812 2892 1016 2318 5711 1958 2601 1720 265 1376 1569 1001 1013 1790 324 1650 106 669 716 2234 3189 Next-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 19 9 1 15 . . 1 . . 1 8 . 3 1 4 . 5 9 16 6 C 99 . . . 7 49 19 . . . . . 3 . 1 . 2 . . . 1 16 S 99 1 . . 5 51 23 . . . . . 2 . 1 . 1 . . . 1 13 a 99 2 . . . . . 56 13 6 2 18 . . . . . . . . 1 . o 99 7 1 . 1 . 1 2 23 2 . 17 2 . 13 1 13 1 2 . 11 1 e 99 2 1 . 6 27 24 . . . 1 . 1 . 2 1 2 . 4 . 1 30 i 99 . . . . . . 45 . . 49 3 . . . . 1 . . . . . l 99 75 2 1 1 5 . . . . . . . . . . . . 2 . 7 4 m 99 98 . . . 1 . . . . . . . . . . . . . . . . n 99 96 . . . 1 . . . . . . . . . . . . . . 1 1 r 99 82 2 . 4 4 . . . . . . . . . . . . . . 1 5 c 99 . . . . . . . . . . . . 12 52 10 22 3 . . . . h 99 1 . . 9 33 17 . . . . . . . . . . . 1 . . 36 t 99 1 23 3 9 15 7 . . . . . 2 29 . . . . . . . 10 p 99 2 36 3 3 11 . . . . . . 5 31 . . . . . . 1 7 k 99 1 23 5 13 15 15 . . . . . 3 14 . . . . . . . 10 f 99 8 26 5 6 13 . . . . . . 2 28 . . . . . . 4 7 s 99 47 8 1 15 16 2 . . . . . 1 2 . . . . . . . 7 q 99 . . . . 97 . . . . . . . . . . 1 . . . . 1 d 99 7 6 2 49 8 1 . 1 . . . . . . . . . . . . 23 y 99 79 3 . . . . . . . . . . . 6 1 6 . . . 3 . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 18 7 2 5 13 5 6 4 1 3 4 2 2 4 1 4 0 2 2 5 8 Previous-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 18 . 52 73 3 20 1 . 2 6 . 4 61 . 15 29 21 31 53 98 56 15 C 7 . . . 9 25 28 . . . . 1 8 . 2 2 3 1 2 . 1 14 S 2 . . . 2 9 12 . . . . . 2 . . . 1 1 1 . . 4 a 5 1 . . . . . 50 18 49 4 26 . . . 1 . 2 1 . 1 . o 13 5 2 2 3 . 3 4 76 34 1 63 14 . 40 25 45 33 17 1 28 2 e 5 . . . 5 9 24 . . 1 1 . 1 . 2 3 2 1 12 . . 18 i 6 . . . . . . 45 . 2 94 5 . . . . 1 . . . . . l 4 17 1 1 . 2 . . . 1 . . 1 . . . . 1 6 . 5 2 m 1 3 . . . . . . . . . . . . . . . . . . . . n 3 17 . . . . . . . 1 . . . . . . . . . . . . r 4 17 1 1 3 1 . . . 1 . . . . . . . . . . 1 2 c 2 . . . . . . . . . . . . 12 29 31 14 28 . . . . h 2 . . . 4 6 9 . . . . . . . . . . . 1 . . 12 t 4 . 14 6 7 5 6 . . . . . 3 51 . . . . 1 . . 5 p 1 . 4 1 . 1 . . . . . . 2 10 . . . . . . . 1 k 4 . 13 7 10 4 13 . . . . . 4 22 . . . . . . . 5 f 0 . 1 . . . . . . . . . . 3 . . . . . . . . s 2 4 2 . 4 2 1 . . 1 . . . 1 . . . . . . . 1 q 2 . . . . 12 . . . . . . . . . . . . . . . . d 5 2 5 5 47 3 1 . 1 2 . . 1 . . 1 . . 2 . . 16 y 8 32 3 2 . . . . . . . . 1 . 10 6 11 2 2 . 4 . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Language B Digraph counts: TT . C S a o e i l m n r c h t p k f s q d x y ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- n 513 500 1 . . 1 . . . . . 4 . . . . . . 1 1 1 . 4 y 1754 1473 18 8 1 6 2 . 5 4 . 2 1 . 75 17 110 8 4 . 20 . . m 113 107 . . 2 . . . . 1 . . . . . . . . . . 3 . . r 670 532 6 2 77 16 2 4 1 . . . 1 . . . . . . . 8 . 21 l 612 315 35 15 36 34 2 . 1 . . 4 3 . 8 . 55 4 10 1 52 . 37 s 191 94 6 . 60 7 5 . 1 . . . . 6 . 2 1 . 1 . 2 . 6 d 1477 71 19 13 421 38 12 1 6 . . 1 2 . . 1 2 . 2 . . . 888 f 86 5 33 1 20 7 2 . . . . . 2 9 . . . . 1 . 1 . 5 p 142 5 65 9 22 12 1 . . . . . 5 12 . . . . . . 4 . 7 . 3223 . 540 256 171 760 14 7 49 5 . 21 42 . 137 53 163 21 75 330 341 2 236 x 4 1 . . . 3 . . . . . . . . . . . . . . . . . c 219 . . . . . . . . . . . . 23 63 12 112 9 . . . . . o 1695 51 5 2 16 8 21 4 297 3 1 174 35 . 216 35 517 28 40 . 230 1 11 a 1368 17 . 1 . 1 . 569 245 99 4 398 4 . 1 1 10 1 7 . 8 . 2 i 1051 . . . . 2 . 464 1 . 508 64 3 . . 1 6 . . . 2 . . k 1106 20 94 21 374 49 330 . 1 . . . 4 112 . . . . 4 . 3 . 94 t 530 5 73 18 128 53 145 . . . . . 5 63 . . . . . . . . 40 h 225 3 1 1 5 10 65 1 . . . . 1 . . . . 1 . . 26 1 110 S 350 2 . . 5 50 206 . . . . . 12 . . . 6 . 3 . 44 . 22 C 909 2 . 1 19 93 406 1 2 1 . 1 71 . 9 3 25 1 6 . 212 . 56 e 1497 20 13 2 11 219 279 . 3 . . 1 27 . 21 17 99 13 37 . 520 . 215 q 332 . . . . 326 5 . . . . . 1 . . . . . . . . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 18067 3223 909 350 1368 1695 1497 1051 612 113 513 670 219 225 530 142 1106 86 191 332 1477 4 1754 Next-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d x y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 17 8 5 23 . . 2 . . 1 1 . 4 2 5 1 2 10 10 . 7 C 99 . . . 2 10 44 . . . . . 8 . 1 . 3 . 1 . 23 . 6 S 99 1 . . 1 14 58 . . . . . 3 . . . 2 . 1 . 12 . 6 a 99 1 . . . . . 41 18 7 . 29 . . . . 1 . 1 . 1 . . o 99 3 . . 1 . 1 . 17 . . 10 2 . 13 2 30 2 2 . 13 . 1 e 99 1 1 . 1 14 18 . . . . . 2 . 1 1 7 1 2 . 34 . 14 i 99 . . . . . . 44 . . 48 6 . . . . 1 . . . . . . l 99 51 6 2 6 6 . . . . . 1 . . 1 . 9 1 2 . 8 . 6 m 99 94 . . 2 . . . . 1 . . . . . . . . . . 3 . . n 99 96 . . . . . . . . . 1 . . . . . . . . . . 1 r 99 79 1 . 11 2 . 1 . . . . . . . . . . . . 1 . 3 c 99 . . . . . . . . . . . . 10 28 5 51 4 . . . . . h 99 1 . . 2 4 29 . . . . . . . . . . . . . 11 . 48 t 99 1 14 3 24 10 27 . . . . . 1 12 . . . . . . . . 7 p 99 3 45 6 15 8 1 . . . . . 3 8 . . . . . . 3 . 5 k 99 2 8 2 33 4 30 . . . . . . 10 . . . . . . . . 8 f 99 6 38 1 23 8 2 . . . . . 2 10 . . . . 1 . 1 . 6 s 99 49 3 . 31 4 3 . 1 . . . . 3 . 1 1 . 1 . 1 . 3 q 99 . . . . 97 1 . . . . . . . . . . . . . . . . d 99 5 1 1 28 3 1 . . . . . . . . . . . . . . . 60 x 99 25 . . . 74 . . . . . . . . . . . . . . . . . y 99 83 1 . . . . . . . . . . . 4 1 6 . . . 1 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 18 5 2 7 9 8 6 3 1 3 4 1 1 3 1 6 0 1 2 8 0 10 Previous-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d x y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 18 . 59 72 12 44 1 1 8 4 . 3 19 . 26 37 15 24 39 98 23 50 13 C 5 . . . 1 5 27 . . 1 . . 32 . 2 2 2 1 3 . 14 . 3 S 2 . . . . 3 14 . . . . . 5 . . . 1 . 2 . 3 . 1 a 7 1 . . . . . 54 40 87 1 59 2 . . 1 1 1 4 . 1 . . o 9 2 1 1 1 . 1 . 48 3 . 26 16 . 40 24 46 32 21 . 15 25 1 e 8 1 1 1 1 13 18 . . . . . 12 . 4 12 9 15 19 . 35 . 12 i 6 . . . . . . 44 . . 98 9 1 . . 1 1 . . . . . . l 3 10 4 4 3 2 . . . . . 1 1 . 1 . 5 5 5 . 3 . 2 m 1 3 . . . . . . . 1 . . . . . . . . . . . . . n 3 15 . . . . . . . . . 1 . . . . . . 1 . . . . r 4 16 1 1 6 1 . . . . . . . . . . . . . . 1 . 1 c 1 . . . . . . . . . . . . 10 12 8 10 10 . . . . . h 1 . . . . 1 4 . . . . . . . . . . 1 . . 2 25 6 t 3 . 8 5 9 3 10 . . . . . 2 28 . . . . . . . . 2 p 1 . 7 3 2 1 . . . . . . 2 5 . . . . . . . . . k 6 1 10 6 27 3 22 . . . . . 2 49 . . . . 2 . . . 5 f 0 . 4 . 1 . . . . . . . 1 4 . . . . 1 . . . . s 1 3 1 . 4 . . . . . . . . 3 . 1 . . 1 . . . . q 2 . . . . 19 . . . . . . . . . . . . . . . . . d 8 2 2 4 30 2 1 . 1 . . . 1 . . 1 . . 1 . . . 50 x 0 . . . . . . . . . . . . . . . . . . . . . . y 10 45 2 2 . . . . 1 4 . . . . 14 12 10 9 2 . 1 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 The relative frequencies of "t" and "k", "sh" and "ch" are as follows: Language A: t = 1790 k = 1650 ratio t/k = 1.085 Language B: t = 530 k = 1106 ratio t/k = 0.479 Language A: S = 1016 C = 2892 ratio S/C = 0.351 Language B: S = 350 C = 909 ratio S/C = 0.385 So it seems we must collapse t and k, otherwise it will be very hard to find a correspondence between the two languages. We could keep ch and sh distinct, but their next-symbol probabilities are so similar that it seems silly to distinguish them. Just to be sure, let's compare the sh and ch contexts in the two languages: foreach lang ( a b ) foreach f ( sh.ch ch.sh ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/k/t/' \ -e 's/p/f/' \ -e 's/ckh/K/' \ -e 's/cph/P/' \ -e 's/'"${f:r}"'/@/' \ -e 's/'"${f:e}"'/~/' \ | grep '@' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .tmp-he${lang}-${f:r}.frq end end dicio-wc .tmp-he{a,b}-{sh,ch}.frq lines words bytes file ------ ------- --------- ------------ 322 966 6438 .tmp-hea-sh.frq 860 2580 17698 .tmp-hea-ch.frq 173 519 3466 .tmp-heb-sh.frq 389 1167 7964 .tmp-heb-ch.frq Language A Language B ---------------------------------- ---------------------------------- contexts of sh contexts of ch contexts of sh contexts of ch ----------------- ---------------- ---------------- ----------------- 98 0.096 @ol 221 0.076 @ol 35 0.100 @edy 60 0.066 @edy 92 0.091 @o 150 0.052 @or 16 0.046 @dy 51 0.056 @dy 63 0.062 @or 95 0.033 @y 11 0.031 @ol 43 0.047 @cthy 61 0.060 @y 88 0.030 qot@y 10 0.029 @ey 22 0.024 @ety 39 0.038 @ey 55 0.019 @ey 10 0.029 @y 22 0.024 t@dy 23 0.023 @ody 53 0.018 t@y 9 0.026 @ody 20 0.022 qot@dy 19 0.019 @eey 44 0.015 ot@y 8 0.023 @eedy 15 0.017 @ey 15 0.015 @eol 37 0.013 @oty 8 0.023 @eody 13 0.014 @ol 14 0.014 @aiin 36 0.012 t@or 7 0.020 @eo 12 0.013 @ecthy 14 0.014 @e 34 0.012 @aiin 6 0.017 @ety 12 0.013 @ody 14 0.014 @odaiin 33 0.011 @eor 6 0.017 @or 12 0.013 ot@dy 12 0.012 @eor 32 0.011 ot@ol 5 0.014 @ed 11 0.012 @eody 11 0.011 t@o 31 0.011 @ody 5 0.014 @eey 10 0.011 @y 10 0.010 @eo 30 0.010 t@ol 5 0.014 @eol 9 0.010 @ty 10 0.010 @octhy 30 0.010 yt@y 5 0.014 d@edy 9 0.010 t@edy 10 0.010 ot@y 29 0.010 @cthy 5 0.014 t@dy 7 0.008 @daiin 9 0.009 @cthy 29 0.010 @o 4 0.011 @cthey 7 0.008 @eol Obviously "sh" and "ch" are very different. Just to make double sure, we can play the same game with t and k: foreach lang ( a b ) foreach f ( t.k k.t ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/p/f/' \ -e 's/'"${f:r}"'/@/' \ -e 's/'"${f:e}"'/~/' \ | grep '@' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .tmp-he${lang}-${f:r}.frq end end dicio-wc .tmp-he{a,b}-{t,k}.frq pr -m -t -i' '1 -w 104 .tmp-he{a,b}-{t,k}.frq \ | expand \ > .tmp-t-k-cmp.txt lines words bytes file ------ ------- --------- ------------ 642 1926 13627 .tmp-hea-t.frq 683 2049 14438 .tmp-hea-k.frq 271 813 5693 .tmp-heb-t.frq 437 1311 9223 .tmp-heb-k.frq Language A Language B ------------------------------------------------ --------------------------------------------- contexts of t contexts of k contexts of t contexts of k --------------------- ---------------------- -------------------- ------------------- 96 0.054 c@hy 39 0.024 qo@chy 18 0.034 o@edy 41 0.037 qo@edy 51 0.029 c@hol 33 0.020 o@y 16 0.030 o@ar 35 0.032 chc@hy 49 0.028 qo@chy 29 0.018 @chy 14 0.027 o@al 35 0.032 o@aiin 42 0.024 c@hor 28 0.017 c@hy 13 0.025 o@aiin 29 0.026 o@ar 38 0.021 o@y 27 0.017 o@aiin 12 0.023 o@y 27 0.025 qo@ar 34 0.019 c@hey 25 0.015 qo@y 12 0.023 qo@edy 25 0.023 o@edy 29 0.016 o@chy 22 0.014 qo@ol 11 0.021 y@edy 23 0.021 o@al 28 0.016 o@ol 21 0.013 o@ol 10 0.019 @edy 20 0.018 qo@aiin 28 0.016 qo@y 20 0.012 @chor 9 0.017 @ar 19 0.017 che@y 27 0.015 o@aiin 19 0.012 @chol 8 0.015 chc@hy 18 0.016 @ar 24 0.014 @chy 18 0.011 @aiin 7 0.013 @chdy 17 0.015 o@y 24 0.014 o@chol 18 0.011 @ol 7 0.013 o@am 16 0.015 o@eedy 20 0.011 c@ho 18 0.011 y@chy 7 0.013 o@chdy 15 0.014 @chdy 20 0.011 cho@y 17 0.010 cho@y 7 0.013 o@eol 15 0.014 qo@chdy 18 0.010 qo@ol 16 0.010 qo@aiin 7 0.013 qo@ar 15 0.014 y@ar 17 0.010 @ol 15 0.009 c@hol 7 0.013 y@eedy 14 0.013 @edy Hm, there is some resemblance, but not as much as I would like. Perhaps it will get better if I delete the [oqy] prefixes and eplace cth,ckh by tch, kch: foreach lang ( a b ) foreach f ( t.k k.t ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/p/f/' \ -e 's/^[qoy]*//' \ -e 's/c\([tkpf]\)h/\1ch/' \ -e 's/'"${f:r}"'/@/' \ -e 's/'"${f:e}"'/~/' \ | grep '@' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .tmp-he${lang}-${f:r}.frq end end dicio-wc .tmp-he{a,b}-{t,k}.frq pr -m -t -i' '1 -w 104 .tmp-he{a,b}-{t,k}.frq \ | expand \ > .tmp-t-k-cmp.txt lines words bytes file ------ ------- --------- ------------ 446 1338 9376 .tmp-hea-t.frq 453 1359 9489 .tmp-hea-k.frq 205 615 4279 .tmp-heb-t.frq 318 954 6630 .tmp-heb-k.frq Language A Language B ------------------------------------------------ --------------------------------------------- contexts of t contexts of k contexts of t contexts of k --------------------- ---------------------- -------------------- ------------------- 218 0.123 @chy 137 0.084 @chy 51 0.097 @edy 89 0.081 @ar 109 0.061 @chol 80 0.049 @y 35 0.067 @ar 89 0.081 @edy 99 0.056 @chor 75 0.046 @aiin 26 0.049 @chdy 77 0.070 @aiin 82 0.046 @y 70 0.043 @ol 23 0.044 @y 44 0.040 @al 73 0.041 @ol 62 0.038 @chol 20 0.038 @aiin 43 0.039 @eedy 70 0.039 @chey 61 0.037 @chor 19 0.036 @al 41 0.037 @chdy 63 0.036 @aiin 50 0.031 @chey 16 0.030 @chedy 35 0.032 ch@chy 47 0.027 @or 49 0.030 @eey 13 0.025 @eedy 30 0.027 @y 40 0.023 @cho 36 0.022 @or 12 0.023 @eey 25 0.023 @chy 29 0.016 @chody 30 0.018 @cho 11 0.021 @am 25 0.023 @eey 26 0.015 cho@chy 25 0.015 @eol 11 0.021 @chy 19 0.017 @eody 23 0.013 @char 23 0.014 @al 10 0.019 @chey 19 0.017 che@y 20 0.011 @eey 21 0.013 ch@chy 10 0.019 @eol 18 0.016 @ain 20 0.011 cho@y 20 0.012 @ey 10 0.019 @ody 18 0.016 @am 19 0.011 @chaiin 20 0.012 cho@chy 9 0.017 @or 14 0.013 @ol 17 0.010 @al 19 0.012 @shy 8 0.015 @ol 13 0.012 @chedy 17 0.010 ch@chy 18 0.011 @chody 8 0.015 ch@chy 11 0.010 @ey Not perfect, but convincing enough... Ok. let's try again to equalize the tail distributions: foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/d//g' \ -e 's/k/t/g' \ -e 's/f/p/g' \ -e 's/cth/tch/g' \ -e 's/ckh/tch/g' \ -e 's/cph/pch/g' \ -e 's/cfh/pch/g' \ -e 's/ei/a/g' \ -e 's/a/o/g' \ > .he${lang}-f-erg.factored end foreach lang ( a b ) cat .he${lang}-f-erg.factored \ | gawk '/./ {print ($1 $2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-words-all.frq cat .he${lang}-f-erg.factored \ | grep -v -e '- -' \ | gawk '/./ {print $1}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-unifs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print $1}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-prefs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-midfs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-suffs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-tails-all.frq end dicio-wc .he{a,b}-f-erg-{word,unif,pref,midf,suff,tail}s-all.frq lines words bytes file ------ ------- --------- ------------ 1563 3126 23155 .hea-f-erg-words-all.frq 193 386 2524 .hea-f-erg-unifs-all.frq 47 94 600 .hea-f-erg-prefs-all.frq 286 572 4647 .hea-f-erg-midfs-all.frq 126 252 1732 .hea-f-erg-suffs-all.frq 888 1776 14121 .hea-f-erg-tails-all.frq 880 1760 12786 .heb-f-erg-words-all.frq 132 264 1735 .heb-f-erg-unifs-all.frq 28 56 345 .heb-f-erg-prefs-all.frq 193 386 3105 .heb-f-erg-midfs-all.frq 76 152 1018 .heb-f-erg-suffs-all.frq 506 1012 7898 .heb-f-erg-tails-all.frq foreach elem ( word unif pref midf suff tail ) foreach lang ( A.a B.b ) set file = "he${lang:e}-f-erg-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by Friedman\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end foreach elem ( word unif pref midf suff tail ) set tfiles = ( ) foreach lang ( a b ) set file = "he${lang}-f-erg-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end pr -m -t -i' '1 -w 54 ${tfiles} \ | expand \ > .herbal-f-erg-${elem}-cmp.txt end dicio-wc .herbal-f-erg-{word,unif,pref,midf,suff,tail}-cmp.txt lines words bytes file ------ ------- --------- ------------ 1569 7361 55938 .herbal-f-erg-word-cmp.txt 199 1007 7275 .herbal-f-erg-unif-cmp.txt 53 257 1901 .herbal-f-erg-pref-cmp.txt 292 1469 11188 .herbal-f-erg-midf-cmp.txt 132 638 4738 .herbal-f-erg-suff-cmp.txt 894 4214 32524 .herbal-f-erg-tail-cmp.txt With these transformations, the prefixes are obviously still the same in both languages: by Friedman by Friedman language A language B freq pc prefix freq pc prefix ---- -- ------------------ ---- -- ------------------ 3857 65 - 1269 52 - 825 14 o- 504 21 o- 607 10 qo- 300 12 qo- 440 7 y- 227 9 y- 56 1 s- 66 3 ol- 42 1 ol- 26 1 l- 21 0 so- 6 0 s- 16 0 l- 5 0 o:i- 13 0 or- 4 0 or- 12 0 r- 3 0 lo- 11 0 oy- 2 0 olo- 7 0 o:i- 2 0 qol- 6 0 yo- 2 0 r- 5 0 os- 1 0 lol- 4 0 ro- 1 0 lqo- 4 0 sol- 1 0 o:ii- 4 0 sy- 1 0 o:n- 3 0 lo- 1 0 oo- 2 0 ls- 1 0 oro- 2 0 o:in- 1 0 orol- 2 0 oo- 1 0 oy- 2 0 oro- 1 0 so:i- 2 0 qoo:i- 1 0 sol- The suffixes are close enough: by Friedman by Friedman language A language B freq pc suffix freq pc suffix ---- -- ------------------ ---- -- ------------------ 1853 31 -y 1173 48 -y 1028 17 -ol 250 10 -or 881 15 -or 202 8 -ol 456 8 -o 179 7 -oiin 370 6 -oiin 122 5 -oy 266 5 -oy 96 4 - 136 2 - 73 3 -o 130 2 -om 47 2 -om 96 2 -ooiin 34 1 -os 84 1 -oly 33 1 -oin 77 1 -s 31 1 -oly 76 1 -os 31 1 -s 53 1 -oin 20 1 -oir 44 1 -ory 11 1 -ooiin 40 1 -oor 10 0 -oor 35 1 -on 9 0 -ory 28 1 -ool 6 0 -orom 12 0 -n 6 0 -oror 12 0 -ols 6 0 -yy 12 0 -yy 5 0 -ool The midfixes are still very different: by Friedman by Friedman language A language B freq pc midfix freq pc midfix ---- -- ------------------ ---- -- ------------------ 1090 18 -tch- 590 24 -t- 1045 18 -ch- 274 11 -te- 913 15 -t- 172 7 -ch- 526 9 -sh- 163 7 -che- 251 4 -che- 141 6 -tch- 191 3 -tche- 135 6 -tee- 181 3 -pch- 110 5 -she- 155 3 -te- 79 3 -sh- 142 2 -she- 64 3 -tche- 131 2 -tee- 57 2 -chtch- 96 2 -chot- 48 2 -chet- 93 2 -tsh- 48 2 -p- 69 1 -chtch- 46 2 -pch- 61 1 -chotch- 39 2 -pche- 60 1 -p- 24 1 -shee- 58 1 -chee- 23 1 -chee- 50 1 -cht- 19 1 -cht- 43 1 -pche- 18 1 -ee- 36 1 -shee- 18 1 -tsh- 30 1 -tchee- 18 1 -tshe- 25 0 -eee- 16 1 -chetch- And the tails, oh my: by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ------------------ ---- -- ------------------ 379 6 -tchy 171 7 -tey 269 5 -chol 149 6 -tor 232 4 -chor 113 5 -tchy 195 3 -tol 108 4 -toiin 192 3 -tchol 99 4 -teey 191 3 -tchor 97 4 -chey 182 3 -ty 89 4 -tol 163 3 -toiin 73 3 -chy 154 3 -chy 63 3 -ty 121 2 -tchey 58 2 -shey 116 2 -sho 52 2 -tchey 114 2 -tor 49 2 -chtchy 104 2 -shol 30 1 -shy 95 2 -tcho 30 1 -tom 88 2 -chey 28 1 -pchey 83 1 -shy 25 1 -pchy 82 1 -shor 25 1 -teoy 75 1 -teey 23 1 -chety 61 1 -cho 23 1 -toin 58 1 -cheor 22 1 -toly 58 1 -choiin 21 1 -teol 53 1 -tchoy 20 1 -chol 47 1 -chotchy 18 1 -toy 45 1 -choy 17 1 -sheey 45 1 -shey 16 1 -chetchy 44 1 -teol 16 1 -chor 43 1 -chtchy 16 1 -choy 42 1 -pchy 14 1 -teo The unifixes are rather OK, I think, except for the inversion between "oiin" and "or": by Friedman by Friedman language A language B freq pc unifix freq pc unifix ---- -- ------------------ ---- -- ------------------ 441 24 oiin 149 19 or 175 10 or 126 16 oiin 145 8 ol 75 10 ol 107 6 y 35 4 y 88 5 s 25 3 soiin 77 4 oin 22 3 om 55 3 om 18 2 oroiin 40 2 soiin 17 2 oly 31 2 ooiin 16 2 oy 30 2 sor 13 2 oloiin 28 2 oir 12 2 oin 28 2 sol 12 2 olor 25 1 o 12 2 s 25 1 sy 10 1 ory 20 1 qooiin 9 1 ooiin The words as a whole are rather different: by Friedman by Friedman language A language B freq pc wordix freq pc wordix ---- -- ------------------ ---- -- ------------------ 441 6 oiin 149 5 or 247 3 chol 126 4 oiin 201 3 chor 80 3 chey 182 2 tchy 75 2 ol 175 2 or 64 2 chy 145 2 ol 56 2 qotey 126 2 chy 54 2 otey 108 1 sho 54 2 otor 107 1 tchol 53 2 shey 107 1 y 49 2 otoiin 104 1 tchor 48 2 chtchy 101 1 qotchy 44 1 otol 100 1 shol 43 1 tchy 88 1 s 35 1 qotor 79 1 otol 35 1 y 79 1 shor 33 1 tor 77 1 oin 31 1 oteey 77 1 oty 30 1 oty 97-11-25 stolfi =============== Checking the contexts of "daiin" cat hea-f-eva.wds \ | sed -e '/[-\/]/d' \ | enum-word-pairs \ | grep -w daiin \ | sort | uniq -c | expand \ | sort +0 -1nr \ > .foo 30 chol daiin 23 daiin = 13 daiin daiin 11 daiin cthy 10 shol daiin 9 chor daiin 8 daiin cthor 7 chy daiin 7 daiin dain 6 cthy daiin 6 daiin chol 6 daiin chor 6 daiin cthol 6 daiin sho 6 dain daiin Hmm, my guess that EVA "daiin" = Chinese "de" needs some improvement... cat chin-mch.txt \ | tr ' ' '\012' \ | egrep -e '.' \ | enum-word-pairs \ | grep -w de \ | sort | uniq -c | expand \ | sort +0 -1nr \ > .foo Denis Mardle posted counts of -iiin, -iin, -in, -n per page in the "stars" section. Here NL = num lines, NP = num paragraphs. page NL NP i3 i2 i1 i0 ----- -- -- --- --- --- --- f105v 38 10 0 4 83 5 f105r 37 10 1 1 47 6 f113v 49 15 0 20 83 5 f114r 45 13 0 22 90 7 f113r 51 17 0 21 75 4 f104r 45 13 1 18 63 3 f107r 51 15 1 31 93 4 f106v 47 15 0 24 65 1 f106r 47 15 0 24 65 1 f114v 41 12 0 24 65 3 f104v 44 13 0 26 59 2 f107v 49 15 0 43 85 1 f108r 50 16 1 22 39 0 f112v 47 14 1 34 59 7 f112r 45 12 0 19 31 3 f108v 53 16 1 39 53 0 f115r 45 13 1 26 34 2 f111r 54 17 0 50 45 1 f115v 45 13 0 38 28 2 f103v 46 14 4 40 31 1 f103r 54 19 2 46 27 0 f111v 51 19 5 113 41 1 f116r 30 10 2 54 13 1 no st.20 2 6 39 8 0 26-11-97 stolfi =============== Using data posted by John Grove, I split several of my textual units (L16-eva/f*) into smaller units, distinguishing real "parags" from his so-called "titles" (which are actually short lines placed at the *end* of a parags block. The files affected were f1r.P -> f1r.P1 f1r.T1 f1r.P2 f1r.T2 f1r.P3 f1r.T3 f1r.P4 f1r.T4 f8r.P -> f8r.P1 f8r.T1 f8r.P2 f8r.T2 f8r.P3 f8r.T3 f9r.P -> f9r.P f9r.T f16r.P -> f16r.P1 f16r.T1 f16r.P2 f18r.P -> f18r.P f18r.T f19v.P -> f19v.P f19v.T f22v.P -> f22v.P f22v.T f24r.P -> f24r.P f24r.T f25r.P -> f25r.P f25r.T f27r.P -> f27r.P f27r.T f28v.P -> f28v.P1 f28v.T1 f28v.P2 f28v.T2 f31r.P -> f31r.P f31r.T f39r.P -> f39r.P f39r.P f40v.P -> f40v.P f40v.T f41v.P -> f41v.P f41v.T f42r.P -> f42r.P1 f42r.T1 f42r.P2 f42r.T2 f42r.P3 f42r.T3 f42v.P -> f42v.P f42v.T (new) -> f57v.T (new) -> f58v.T (new) -> f65r.L (old) -> f66r.W {entered months ago} f82r.P -> f82r.P1 f82r.T1 f82r.P2 (new) -> f85r2.T f85r1.P -> f85r1.P f85r1.T f86v5.P -> f86v5.P f86v5.T f94r.P -> f94r.P f94r.T f101v1.P -> f101v1.P f101v1.T f101v2.P -> f101v2.P f101v2.T f105r.P -> f105r.P1 f105r.T1 f105r.P2 f105r.T2 f108v.P -> f108v.P f108v.T f114r.P -> f114r.P1 f114r.T1 f114r.P2 f114r.T2 Validating it: pushd L16-eva rm -f .bugs foreach f ( f[0-9]* ) echo '=== '$f' ===' >>& .bugs cat $f \ | ../validate-new-evt-format \ -v chars='aoeilmnrchtpkfsqgjdvxy' \ -v location="$f" \ >>& .bugs end popd Must redo Note-010 from scratch. Rene Zandberger sent me corrected -I*D statistics for the "stars" section. (Although Denis says his statistics were already checked against the Yale copyflo). The format is - page code. The first T is the quire (T=20) and the second character the 'page in quire' (A=f103r, ..., X=f116v) - nr of words (not sure how commas were counted) - nr of words containing iiin - nr of words containing iin - nr of words continaing in - nr of words containing n His numbers were cumulative; I reduced them to exclusive counts by piping the table through gawk \ ' /./ { \ printf " %s %s %5d %5d %5d %5d %5d\n", \ $1, $2, $3, $4, $5-$4, $6-$5, $7-$6; \ } \ ' Here is the result: page words -iiin -iin -in -n -------- ----- ----- ----- ----- ----- f103r TA 526 0 33 41 2 f103v TB 454 1 34 37 4 f104r TC 448 1 66 17 1 f104v TD 477 3 59 24 0 f105r TE 379 6 48 1 1 f105v TF 399 5 85 4 0 f106r TG 432 1 65 24 0 f106v TH 444 1 67 23 0 f107r TI 487 4 93 30 1 f107v TJ 462 1 84 43 0 f108r TK 494 0 39 22 1 f108v TL 581 0 52 39 1 f111r TM 623 1 44 51 0 f111v TN 568 1 41 113 6 f112r TO 401 3 32 21 0 f112v TP 420 7 60 33 1 f113r TQ 528 4 79 21 0 f113v TR 502 5 84 20 0 f114r TS 460 5 91 23 0 f114v TT 376 2 68 23 0 f115r TU 461 1 40 21 2 f115v TV 410 2 32 33 0 f116r TW 554 1 25 90 8 Then I ran the table thrice through sort-distr -s 18 -n 4 -d It converged to this stable order after the second iteration: page words -iiin -iin -in -n -------- ----- ----- ----- ----- ----- f105v TF 399 5 85 4 0 f105r TE 379 6 48 1 1 f113v TR 502 5 84 20 0 f114r TS 460 5 91 23 0 f113r TQ 528 4 79 21 0 f104r TC 448 1 66 17 1 f107r TI 487 4 93 30 1 f114v TT 376 2 68 23 0 f106v TH 444 1 67 23 0 f106r TG 432 1 65 24 0 f104v TD 477 3 59 24 0 f107v TJ 462 1 84 43 0 f115r TU 461 1 40 21 2 f108r TK 494 0 39 22 1 f112v TP 420 7 60 33 1 f112r TO 401 3 32 21 0 f108v TL 581 0 52 39 1 f115v TV 410 2 32 33 0 f103v TB 454 1 34 37 4 f111r TM 623 1 44 51 0 f103r TA 526 0 33 41 2 f111v TN 568 1 41 113 6 f116r TW 554 1 25 90 8 f f f f f f f f f f f f f f f f f f f f f f f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 5 5 3 4 3 4 7 4 6 6 4 7 5 8 2 2 8 5 3 1 3 1 6 v r v r r r r v v r v v r r v r v v v r r v r -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- f105v 0 6 13 14 14 14 17 18 18 20 22 26 27 29 28 32 35 42 43 45 46 63 67 f105r 6 0 13 13 14 14 17 18 18 20 21 25 26 28 26 30 34 40 42 43 45 61 65 f113v 13 13 0 1 2 3 5 5 6 7 9 13 15 16 16 19 22 29 30 32 33 50 54 f114r 14 13 1 0 1 2 4 5 5 6 8 12 14 15 15 18 21 28 30 31 33 49 53 f113r 14 14 2 1 0 2 3 4 4 5 7 12 13 14 14 18 20 27 29 30 32 49 52 f104r 14 14 3 2 2 0 4 4 4 6 8 12 14 15 15 19 21 28 29 31 32 49 53 f107r 17 17 5 4 3 4 0 1 2 3 4 9 10 11 11 15 17 24 26 27 29 46 49 f114v 18 18 5 5 4 4 1 0 1 2 4 8 10 10 11 14 16 24 25 27 28 45 49 f106v 18 18 6 5 4 4 2 1 0 1 4 8 10 10 11 15 16 24 25 27 28 45 49 f106r 20 20 7 6 5 6 3 2 1 0 3 6 8 9 10 13 15 22 24 25 27 43 47 f104v 22 21 9 8 7 8 4 4 4 3 0 5 6 7 7 10 13 20 22 23 24 41 45 f107v 26 25 13 12 12 12 9 8 8 6 5 0 4 3 6 8 9 16 18 19 21 37 41 f115r 27 26 15 14 13 14 10 10 10 8 6 4 0 2 4 6 8 15 16 18 19 36 39 f108r 29 28 16 15 14 15 11 10 10 9 7 3 2 0 5 6 6 14 15 16 18 35 38 f112v 28 26 16 15 14 15 11 11 11 10 7 6 4 5 0 4 8 14 15 17 19 35 38 f112r 32 30 19 18 18 19 15 14 15 13 10 8 6 6 4 0 5 10 12 13 15 31 35 f108v 35 34 22 21 20 21 17 16 16 15 13 9 8 6 8 5 0 8 10 10 12 29 32 f115v 42 40 29 28 27 28 24 24 24 22 20 16 15 14 14 10 8 0 4 3 5 21 25 f103v 43 42 30 30 29 29 26 25 25 24 22 18 16 15 15 12 10 4 0 5 4 20 23 f111r 45 43 32 31 30 31 27 27 27 25 23 19 18 16 17 13 10 3 5 0 3 18 22 f103r 46 45 33 33 32 32 29 28 28 27 24 21 19 18 19 15 12 5 4 3 0 17 20 f111v 63 61 50 49 49 49 46 45 45 43 41 37 36 35 35 31 29 21 20 18 17 0 4 f116r 67 65 54 53 52 53 49 49 49 47 45 41 39 38 38 35 32 25 23 22 20 4 0 I created a picture of the distance matrix by piping the data through sort-distr -s 18 -n 4 -d -p - | pnmscale 8 | ppmtogif > .stars-dist.gif The obvious groupings are: W = { f105v f105r } X = { f113v f114r f113r f104r f107r f114v f106v f106r f104v f107v f115r f108r f112v f112r f108v } Y = { f115v f103v f111r f103r } Z = { f111v f116r } The first row of group X is fairly homogeneous; the second row seems to be a gradient, in the order shown. The previous ordering was w = { f105r f105v } x = { f113v f114r f113r f104r f107r f106r f106v f114v f104v f107v f108r f112v f112r f108v f115r } y = { f111r f115v f103v f103r } z = { f111v f116r } So indeed the corrections had no significant effect. Denis later posted the per-paragraph counts: gawk \ ' \ /^#/ {p=$2;n=0;next;} \ /./{n++;printf "%s.P%02d %3d%3d%3d%3d%3d\n", p,n,$1,$2,$3,$4,$5;} \ ' Here is Denis's data, transposed and fitted with paragraph labels. Each row is one paragraph. The data columns are the number of lines in the paragraph and the counts of -D, -ID, -IID, and -IIID endings. The data for page 108v had a missing column, so I filled it with '99's. page ln i0 i1 i2 i3 --------- -- -- -- -- -- f103r.P01 4 0 3 0 0 f103r.P02 3 1 3 0 0 f103r.P03 1 0 1 0 0 f103r.P04 4 0 7 4 0 f103r.P05 3 0 5 1 0 f103r.P06 2 0 1 1 0 f103r.P07 3 0 6 3 0 f103r.P08 3 0 1 1 0 f103r.P09 3 0 6 0 0 f103r.P10 3 0 3 1 0 f103r.P11 3 0 0 6 0 f103r.P12 2 0 0 1 0 f103r.P13 2 0 4 0 0 f103r.P14 3 0 2 0 0 f103r.P15 3 0 0 1 0 f103r.P16 2 0 0 1 0 f103r.P17 3 1 0 2 0 f103r.P18 4 0 4 3 0 f103r.P19 3 0 0 2 0 f103v.P01 4 0 2 2 0 f103v.P02 4 0 3 5 0 f103v.P03 3 0 2 2 0 f103v.P04 2 1 2 0 0 f103v.P05 3 0 4 2 0 f103v.P06 3 0 1 3 0 f103v.P07 1 0 3 0 0 f103v.P08 6 3 7 3 1 f103v.P09 3 0 0 4 0 f103v.P10 3 0 2 2 0 f103v.P11 4 0 2 4 0 f103v.P12 2 0 1 0 0 f103v.P13 4 0 4 2 0 f103v.P14 4 0 7 2 0 f104r.P01 4 0 1 5 2 f104r.P02 5 0 5 4 0 f104r.P03 2 0 0 0 0 f104r.P04 4 0 2 6 0 f104r.P05 3 0 1 0 0 f104r.P06 3 1 1 4 0 f104r.P07 5 0 2 5 0 f104r.P08 3 0 0 7 0 f104r.P09 4 0 2 7 0 f104r.P10 2 0 0 5 0 f104r.P11 3 0 0 6 0 f104r.P12 4 0 1 6 1 f104r.P13 3 0 3 8 0 f104v.P01 5 0 2 9 0 f104v.P02 2 0 2 0 0 f104v.P03 4 0 1 6 0 f104v.P04 3 0 1 5 0 f104v.P05 4 0 3 3 1 f104v.P06 3 0 1 2 0 f104v.P07 5 0 1 11 1 f104v.P08 2 0 0 4 0 f104v.P09 3 0 0 5 0 f104v.P10 3 0 5 4 0 f104v.P11 2 0 1 3 0 f104v.P12 4 0 6 3 0 f104v.P13 4 0 3 4 0 f105r.P01 5 1 0 8 0 f105r.P02 4 0 0 3 2 f105r.P03 4 0 0 4 1 f105r.P04 3 0 0 4 0 f105r.P05 7 0 0 15 0 f105r.P06 2 0 1 2 1 f105r.P07 2 0 0 0 0 f105r.P08 3 0 0 4 1 f105r.P09 4 0 0 4 1 f105r.P10 3 0 0 3 0 f105v.P01 4 0 0 8 0 f105v.P02 3 0 0 5 0 f105v.P03 6 0 0 9 2 f105v.P04 4 0 0 12 2 f105v.P05 2 0 0 5 0 f105v.P06 3 0 1 7 0 f105v.P07 2 0 0 5 1 f105v.P08 4 0 1 6 0 f105v.P09 3 0 0 8 0 f105v.P10 7 0 2 18 0 f106r.P01 3 0 0 0 0 f106r.P02 4 0 0 4 0 f106r.P03 2 0 1 1 0 f106r.P04 3 0 1 4 0 f106r.P05 2 0 0 5 0 f106r.P06 3 0 4 4 0 f106r.P07 4 0 2 8 0 f106r.P08 2 0 1 0 0 f106r.P09 3 0 3 3 0 f106r.P10 5 0 3 12 0 f106r.P11 2 0 1 2 0 f106r.P12 2 0 1 2 0 f106r.P13 2 0 0 3 0 f106r.P14 4 0 0 8 1 f106r.P15 6 0 7 9 0 f106v.P01 4 0 5 5 0 f106v.P02 4 0 4 9 0 f106v.P03 2 0 2 3 0 f106v.P04 4 0 1 5 1 f106v.P05 3 0 1 1 0 f106v.P06 2 0 2 3 0 f106v.P07 2 0 1 3 0 f106v.P08 4 0 2 5 0 f106v.P09 2 0 1 1 0 f106v.P10 4 0 1 7 0 f106v.P11 2 0 0 3 0 f106v.P12 3 0 0 5 0 f106v.P13 3 0 0 3 0 f106v.P14 2 0 0 3 0 f106v.P15 6 0 4 9 0 f107r.P01 3 0 3 4 0 f107r.P02 4 0 3 5 0 f107r.P03 5 1 3 9 0 f107r.P04 3 0 2 8 0 f107r.P05 2 0 0 3 0 f107r.P06 3 0 0 5 0 f107r.P07 3 0 1 8 2 f107r.P08 3 0 2 3 1 f107r.P09 3 0 0 6 0 f107r.P10 4 0 2 9 1 f107r.P11 4 0 2 8 0 f107r.P12 4 0 4 6 0 f107r.P13 3 0 2 6 0 f107r.P14 3 0 1 8 0 f107r.P15 4 0 6 5 0 f107v.P01 4 0 5 8 1 f107v.P02 3 0 2 7 0 f107v.P03 3 0 3 6 0 f107v.P04 5 0 5 9 0 f107v.P05 4 0 2 11 0 f107v.P06 3 0 1 5 0 f107v.P07 2 0 3 2 0 f107v.P08 4 0 3 4 0 f107v.P09 3 0 1 5 0 f107v.P10 3 0 3 9 0 f107v.P11 2 0 2 2 0 f107v.P12 4 0 4 10 0 f107v.P13 2 0 3 0 0 f107v.P14 2 0 0 1 0 f107v.P15 5 0 6 6 0 f108r.P01 4 0 0 2 0 f108r.P02 3 0 0 2 0 f108r.P03 3 0 1 4 0 f108r.P04 3 0 3 8 0 f108r.P05 3 0 3 4 0 f108r.P06 4 0 2 4 0 f108r.P07 3 0 2 2 0 f108r.P08 5 0 3 6 0 f108r.P09 2 1 0 2 0 f108r.P10 4 0 1 0 0 f108r.P11 2 0 0 0 0 f108r.P12 2 0 0 0 0 f108r.P13 2 0 1 2 0 f108r.P14 4 0 3 1 0 f108r.P15 3 0 2 1 0 f108r.P16 3 0 1 1 0 f108v.P01 4 0 99 2 0 f108v.P02 2 0 99 2 0 f108v.P03 5 0 99 4 0 f108v.P04 3 0 99 1 0 f108v.P05 5 0 99 1 0 f108v.P06 3 0 99 3 0 f108v.P07 3 0 99 1 0 f108v.P08 4 1 99 2 0 f108v.P09 3 0 99 4 0 f108v.P10 3 0 99 2 0 f108v.P11 3 0 99 5 0 f108v.P12 3 0 99 4 0 f108v.P13 3 0 99 2 0 f108v.P14 3 0 99 3 0 f108v.P15 2 0 99 1 0 f108v.P16 4 0 99 2 0 f111r.P01 5 0 1 4 0 f111r.P02 2 0 1 2 0 f111r.P03 2 0 3 2 0 f111r.P04 3 0 2 4 1 f111r.P05 4 0 3 4 0 f111r.P06 3 0 0 3 0 f111r.P07 3 0 4 2 0 f111r.P08 3 0 4 2 0 f111r.P09 3 0 2 4 0 f111r.P10 3 0 2 2 0 f111r.P11 2 0 2 0 0 f111r.P12 2 0 3 1 0 f111r.P13 3 0 4 1 0 f111r.P14 5 0 5 4 0 f111r.P15 4 0 4 1 0 f111r.P16 3 0 4 1 0 f111r.P17 4 0 6 8 0 f111v.P01 3 0 5 0 0 f111v.P02 2 0 0 2 0 f111v.P03 2 0 3 4 0 f111v.P04 2 0 3 0 0 f111v.P05 3 0 6 4 0 f111v.P06 3 0 7 5 0 f111v.P07 2 0 6 3 0 f111v.P08 2 0 7 4 0 f111v.P09 4 1 14 4 0 f111v.P10 2 0 3 1 0 f111v.P11 3 0 5 0 0 f111v.P12 2 0 8 2 0 f111v.P13 1 0 5 0 0 f111v.P14 2 4 4 0 0 f111v.P15 3 0 7 2 0 f111v.P16 2 0 3 1 0 f111v.P17 6 0 9 4 0 f111v.P18 3 0 8 1 0 f111v.P19 4 0 10 4 1 f112r.P01 4 0 2 5 0 f112r.P02 6 0 2 2 0 f112r.P03 4 0 1 3 0 f112r.P04 4 0 2 2 0 f112r.P05 4 0 1 5 0 f112r.P06 4 0 2 2 0 f112r.P07 4 0 1 2 1 f112r.P08 3 0 2 1 1 f112r.P09 3 0 2 1 0 f112r.P10 2 0 1 2 0 f112r.P11 3 0 2 2 1 f112r.P12 4 0 1 4 0 f112v.P01 6 0 2 5 2 f112v.P02 4 0 2 6 2 f112v.P03 4 0 4 7 0 f112v.P04 5 0 5 7 1 f112v.P05 3 0 4 4 0 f112v.P06 2 0 3 3 0 f112v.P07 2 0 0 2 0 f112v.P08 3 0 3 3 0 f112v.P09 2 0 0 2 0 f112v.P10 4 0 1 6 0 f112v.P11 3 0 3 5 1 f112v.P12 3 1 3 4 0 f112v.P13 3 0 2 1 0 f112v.P14 3 0 2 4 1 f113r.P01 3 0 1 2 0 f113r.P02 4 0 2 8 0 f113r.P03 2 0 0 5 0 f113r.P04 3 0 1 6 0 f113r.P05 3 0 0 1 2 f113r.P06 4 0 1 6 1 f113r.P07 2 0 2 1 1 f113r.P08 3 0 0 4 0 f113r.P09 2 0 0 2 0 f113r.P10 3 0 2 6 0 f113r.P11 4 0 4 6 0 f113r.P12 3 0 2 4 0 f113r.P13 2 0 0 4 0 f113r.P14 3 0 2 4 0 f113r.P15 3 0 0 3 0 f113r.P16 3 0 1 6 0 f113r.P17 4 0 3 7 0 f113v.P01 3 0 2 3 0 f113v.P02 3 0 1 4 0 f113v.P03 3 0 0 4 0 f113v.P04 3 0 0 2 0 f113v.P05 5 0 2 11 0 f113v.P06 3 0 0 5 0 f113v.P07 4 0 2 10 0 f113v.P08 3 0 0 4 0 f113v.P09 5 0 1 8 4 f113v.P10 3 0 2 6 0 f113v.P11 2 0 0 4 0 f113v.P12 4 0 4 6 1 f113v.P13 3 0 2 8 0 f113v.P14 2 0 1 2 0 f113v.P15 3 0 3 6 0 f114r.P01 3 0 0 3 0 f114r.P02 4 0 1 11 1 f114r.P03 3 0 0 7 0 f114r.P04 3 0 0 3 1 f114r.P05 5 0 3 7 0 f114r.P06 3 0 0 5 1 f114r.P07 2 0 2 4 0 f114r.P08 4 0 0 9 0 f114r.P09 4 0 5 7 1 f114r.P10 3 0 2 11 1 f114r.P11 4 0 3 8 1 f114r.P12 3 0 2 7 1 f114r.P13 4 0 4 8 0 f114v.P01 5 0 8 2 0 f114v.P02 2 0 0 3 1 f114v.P03 3 0 3 7 0 f114v.P04 3 0 3 5 1 f114v.P05 4 0 1 5 0 f114v.P06 2 0 1 3 0 f114v.P07 3 0 2 4 0 f114v.P08 3 0 1 6 0 f114v.P09 3 0 1 4 0 f114v.P10 4 0 2 7 0 f114v.P11 3 0 0 6 0 f114v.P12 6 0 2 13 1 f115r.P01 3 0 1 0 0 f115r.P02 4 0 4 0 0 f115r.P03 2 0 1 2 0 f115r.P04 3 0 3 4 0 f115r.P05 6 0 6 4 0 f115r.P06 3 0 1 2 0 f115r.P07 4 0 1 2 0 f115r.P08 3 0 1 1 1 f115r.P09 6 0 0 10 1 f115r.P10 2 1 2 2 0 f115r.P11 3 0 1 4 0 f115r.P12 2 0 1 1 0 f115r.P13 4 0 4 2 0 f115v.P01 5 0 4 4 0 f115v.P02 2 0 2 3 0 f115v.P03 3 0 2 3 0 f115v.P04 2 0 0 2 0 f115v.P05 5 0 7 1 0 f115v.P06 3 0 5 2 0 f115v.P07 3 0 3 2 0 f115v.P08 5 0 5 2 0 f115v.P09 2 0 0 1 0 f115v.P10 3 0 2 1 0 f115v.P11 3 0 2 2 0 f115v.P12 4 0 5 2 2 f115v.P13 5 0 1 3 0 f116r.P01 3 0 6 0 1 f116r.P02 3 0 4 2 0 f116r.P03 3 0 5 0 0 f116r.P04 3 0 7 3 0 f116r.P05 2 0 5 0 0 f116r.P06 3 1 2 1 0 f116r.P07 3 0 7 2 0 f116r.P08 3 0 8 0 0 f116r.P09 3 0 5 1 0 f116r.P10 4 1 5 4 0 f116r.P11 15 3 31 8 0 f116r.P12 5 3 8 0 0 Here is the same data, minus the line counts and the "f1" prefix, with identical entries fused together: page i0 i1 i2 i3 pages with same counts ------- -- -- -- -- ------------------------- 03r.P01 0 3 0 0 03r.P01 03v.P07 07v.P13 11v.P04 03r.P02 1 3 0 0 03r.P02 03r.P03 0 1 0 0 03r.P03 03v.P12 04r.P05 06r.P08 08r.P10 15r.P01 03r.P04 0 7 4 0 03r.P04 11v.P08 03r.P05 0 5 1 0 03r.P05 08v.P04 16r.P09 03r.P06 0 1 1 0 03r.P06,08 06r.P03 06v.P05,09 08r.P16 15r.P12 03r.P07 0 6 3 0 03r.P07 04v.P12 11v.P07 03r.P09 0 6 0 0 03r.P09 03r.P10 0 3 1 0 03r.P10 08r.P14 11r.P12 11v.P10 11v.P16 03r.P11 0 0 6 0 03r.P11 04r.P11 07r.P09 14v.P11 03r.P12 0 0 1 0 03r.P12,15,16 07v.P14 15v.P09 03r.P13 0 4 0 0 03r.P13 15r.P02 03r.P14 0 2 0 0 03r.P14 04v.P02 11r.P11 03r.P17 1 0 2 0 03r.P17 08r.P09 03r.P18 0 4 3 0 03r.P18 03r.P19 0 0 2 0 03r.P19 08r.P01,02 08v.P13 11v.P02 12v.P07,09 13r.P09 13v.P04 15v.P04 03v.P01 0 2 2 0 03v.P01,03,10 07v.P11 08r.P07 08v.P02 11r.P10 12r.P02,04,06 15v.P11 03v.P02 0 3 5 0 03v.P02 07r.P02 03v.P04 1 2 0 0 03v.P04 03v.P05 0 4 2 0 03v.P05 03v.P13 08v.P01 11r.P07 11r.P08 15r.P13 16r.P02 03v.P06 0 1 3 0 03v.P06 04v.P11 06v.P07 08v.P14 12r.P03 14v.P06 15v.P13 03v.P08 3 7 3 1 03v.P08 03v.P09 0 0 4 0 03v.P09 04v.P08 05r.P04 06r.P02 13r.P08,13 13v.P03,08,11 03v.P11 0 2 4 0 03v.P11 08r.P06 08v.P09 11r.P09 13r.P12,14 14r.P07 14v.P07 03v.P14 0 7 2 0 03v.P14 11v.P15 16r.P07 04r.P01 0 1 5 2 04r.P01 04r.P02 0 5 4 0 04r.P02 04v.P10 11r.P14 04r.P03 0 0 0 0 04r.P03 05r.P07 06r.P01 08r.P11,12 04r.P04 0 2 6 0 04r.P04 07r.P13 13r.P10 13v.P10 04r.P06 1 1 4 0 04r.P06 04r.P07 0 2 5 0 04r.P07 06v.P08 12r.P01 04r.P08 0 0 7 0 04r.P08 14r.P03 04r.P09 0 2 7 0 04r.P09 07v.P02 14v.P10 04r.P10 0 0 5 0 04r.P10 04v.P09 05v.P02,05 06r.P05 06v.P12 07r.P06 13r.P03 13v.P06 04r.P12 0 1 6 1 04r.P12 13r.P06 04r.P13 0 3 8 0 04r.P13 08r.P04 04v.P01 0 2 9 0 04v.P01 04v.P03 0 1 6 0 04v.P03 05v.P08 12v.P10 13r.P04,16 14v.P08 04v.P04 0 1 5 0 04v.P04 07v.P06 07v.P09 08v.P11 12r.P05 14v.P05 04v.P05 0 3 3 1 04v.P05 04v.P06 0 1 2 0 04v.P06 06r.P11,12 08r.P13 08v.P10 11r.P02 12r.P10 13r.P01 13v.P14 15r.P03,06,07 04v.P07 0 1 11 1 04v.P07 14r.P02 04v.P13 0 3 4 0 04v.P13 07r.P01 07v.P08 08r.P05 11r.P05 11v.P03 15r.P04 05r.P01 1 0 8 0 05r.P01 05r.P02 0 0 3 2 05r.P02 05r.P03 0 0 4 1 05r.P03 05r.P08 05r.P09 05r.P05 0 0 15 0 05r.P05 05r.P06 0 1 2 1 05r.P06 12r.P07 05r.P10 0 0 3 0 05r.P10 06r.P13 06v.P11,13,14 07r.P05 13r.P15 14r.P01 05v.P01 0 0 8 0 05v.P01 05v.P09 05v.P03 0 0 9 2 05v.P03 05v.P04 0 0 12 2 05v.P04 05v.P06 0 1 7 0 05v.P06 06v.P10 05v.P07 0 0 5 1 05v.P07 14r.P06 05v.P10 0 2 18 0 05v.P10 06r.P04 0 1 4 0 06r.P04 08r.P03 11r.P01 12r.P12 13v.P02 14v.P09 15r.P11 06r.P06 0 4 4 0 06r.P06 08v.P03,12 12v.P05 15v.P01 06r.P07 0 2 8 0 06r.P07 07r.P04,11 13r.P02 13v.P13 06r.P09 0 3 3 0 06r.P09 12v.P06 12v.P08 06r.P10 0 3 12 0 06r.P10 06r.P14 0 0 8 1 06r.P14 06r.P15 0 7 9 0 06r.P15 06v.P01 0 5 5 0 06v.P01 06v.P02 0 4 9 0 06v.P02 06v.P15 06v.P03 0 2 3 0 06v.P03,06 13v.P01 15v.P02,03 06v.P04 0 1 5 1 06v.P04 07r.P03 1 3 9 0 07r.P03 07r.P07 0 1 8 2 07r.P07 07r.P08 0 2 3 1 07r.P08 07r.P10 0 2 9 1 07r.P10 07r.P12 0 4 6 0 07r.P12 13r.P11 07r.P14 0 1 8 0 07r.P14 07r.P15 0 6 5 0 07r.P15 07v.P01 0 5 8 1 07v.P01 07v.P03 0 3 6 0 07v.P03 08r.P08 13v.P15 07v.P04 0 5 9 0 07v.P04 07v.P05 0 2 11 0 07v.P05 13v.P05 07v.P07 0 3 2 0 07v.P07 08v.P16 11r.P03 15v.P07 07v.P10 0 3 9 0 07v.P10 07v.P12 0 4 10 0 07v.P12 07v.P15 0 6 6 0 07v.P15 08r.P15 0 2 1 0 08r.P15 08v.P05,15 12r.P09 12v.P13 15v.P10 08v.P07 0 9 1 0 08v.P07 08v.P08 1 6 2 0 08v.P08 11r.P04 0 2 4 1 11r.P04 12v.P14 11r.P13 0 4 1 0 11r.P13 11r.P15 11r.P16 11r.P17 0 6 8 0 11r.P17 11v.P01 0 5 0 0 11v.P01 11v.P11 11v.P13 16r.P03 16r.P05 11v.P05 0 6 4 0 11v.P05 15r.P05 11v.P06 0 7 5 0 11v.P06 11v.P09 1 14 4 0 11v.P09 11v.P12 0 8 2 0 11v.P12 14v.P01 11v.P14 4 4 0 0 11v.P14 11v.P17 0 9 4 0 11v.P17 11v.P18 0 8 1 0 11v.P18 11v.P19 0 10 4 1 11v.P19 12r.P08 0 2 1 1 12r.P08 13r.P07 12r.P11 0 2 2 1 12r.P11 12v.P01 0 2 5 2 12v.P01 12v.P02 0 2 6 2 12v.P02 12v.P03 0 4 7 0 12v.P03 12v.P04 0 5 7 1 12v.P04 14r.P09 12v.P11 0 3 5 1 12v.P11 14v.P04 12v.P12 1 3 4 0 12v.P12 13r.P05 0 0 1 2 13r.P05 13r.P17 0 3 7 0 13r.P17 14r.P05 14v.P03 13v.P07 0 2 10 0 13v.P07 13v.P09 0 1 8 4 13v.P09 13v.P12 0 4 6 1 13v.P12 14r.P04 0 0 3 1 14r.P04 14v.P02 14r.P08 0 0 9 0 14r.P08 14r.P10 0 2 11 1 14r.P10 14r.P11 0 3 8 1 14r.P11 14r.P12 0 2 7 1 14r.P12 14r.P13 0 4 8 0 14r.P13 14v.P12 0 2 13 1 14v.P12 15r.P08 0 1 1 1 15r.P08 15r.P09 0 0 10 1 15r.P09 15r.P10 1 2 2 0 15r.P10 15v.P05 0 7 1 0 15v.P05 15v.P06 0 5 2 0 15v.P06 15v.P08 15v.P12 16r.P01 0 6 0 1 16r.P01 16r.P04 0 7 3 0 16r.P04 08v.P06 16r.P06 1 2 1 0 16r.P06 16r.P08 0 8 0 0 16r.P08 16r.P10 1 5 4 0 16r.P10 16r.P11 3 31 8 0 16r.P11 16r.P12 3 8 0 0 16r.P12 Here it is again, piped through sort-distr -s 12 -n 4 -d -g -fs -bs -r 1 -p .stars-para-dist.ppm The justification for "-g" is that there may be confusion between -i^nd and -i^(n+1)d, so the suffixes are in a sense arranged in a line. page i0 i1 i2 i3 pages with same counts ------- -- -- -- -- ------------------------- 03r.P01 0 3 0 0 03r.P01 03v.P07 07v.P13 11v.P04 03r.P02 1 3 0 0 03r.P02 03r.P03 0 1 0 0 03r.P03 03v.P12 04r.P05 06r.P08 08r.P10 15r.P01 03r.P04 0 7 4 0 03r.P04 11v.P08 03r.P05 0 5 1 0 03r.P05 08v.P04 16r.P09 03r.P06 0 1 1 0 03r.P06,08 06r.P03 06v.P05,09 08r.P16 15r.P12 03r.P07 0 6 3 0 03r.P07 04v.P12 11v.P07 03r.P09 0 6 0 0 03r.P09 03r.P10 0 3 1 0 03r.P10 08r.P14 11r.P12 11v.P10 11v.P16 03r.P11 0 0 6 0 03r.P11 04r.P11 07r.P09 14v.P11 03r.P12 0 0 1 0 03r.P12,15,16 07v.P14 15v.P09 03r.P13 0 4 0 0 03r.P13 15r.P02 03r.P14 0 2 0 0 03r.P14 04v.P02 11r.P11 03r.P17 1 0 2 0 03r.P17 08r.P09 03r.P18 0 4 3 0 03r.P18 03r.P19 0 0 2 0 03r.P19 08r.P01,02 08v.P13 11v.P02 12v.P07,09 13r.P09 13v.P04 15v.P04 03v.P01 0 2 2 0 03v.P01,03,10 07v.P11 08r.P07 08v.P02 11r.P10 12r.P02,04,06 15v.P11 03v.P02 0 3 5 0 03v.P02 07r.P02 03v.P04 1 2 0 0 03v.P04 03v.P05 0 4 2 0 03v.P05 03v.P13 08v.P01 11r.P07 11r.P08 15r.P13 16r.P02 03v.P06 0 1 3 0 03v.P06 04v.P11 06v.P07 08v.P14 12r.P03 14v.P06 15v.P13 03v.P08 3 7 3 1 03v.P08 03v.P09 0 0 4 0 03v.P09 04v.P08 05r.P04 06r.P02 13r.P08,13 13v.P03,08,11 03v.P11 0 2 4 0 03v.P11 08r.P06 08v.P09 11r.P09 13r.P12,14 14r.P07 14v.P07 03v.P14 0 7 2 0 03v.P14 11v.P15 16r.P07 04r.P01 0 1 5 2 04r.P01 04r.P02 0 5 4 0 04r.P02 04v.P10 11r.P14 04r.P03 0 0 0 0 04r.P03 05r.P07 06r.P01 08r.P11,12 04r.P04 0 2 6 0 04r.P04 07r.P13 13r.P10 13v.P10 04r.P06 1 1 4 0 04r.P06 04r.P07 0 2 5 0 04r.P07 06v.P08 12r.P01 04r.P08 0 0 7 0 04r.P08 14r.P03 04r.P09 0 2 7 0 04r.P09 07v.P02 14v.P10 04r.P10 0 0 5 0 04r.P10 04v.P09 05v.P02,05 06r.P05 06v.P12 07r.P06 13r.P03 13v.P06 04r.P12 0 1 6 1 04r.P12 13r.P06 04r.P13 0 3 8 0 04r.P13 08r.P04 04v.P01 0 2 9 0 04v.P01 04v.P03 0 1 6 0 04v.P03 05v.P08 12v.P10 13r.P04,16 14v.P08 04v.P04 0 1 5 0 04v.P04 07v.P06 07v.P09 08v.P11 12r.P05 14v.P05 04v.P05 0 3 3 1 04v.P05 04v.P06 0 1 2 0 04v.P06 06r.P11,12 08r.P13 08v.P10 11r.P02 12r.P10 13r.P01 13v.P14 15r.P03,06,07 04v.P07 0 1 11 1 04v.P07 14r.P02 04v.P13 0 3 4 0 04v.P13 07r.P01 07v.P08 08r.P05 11r.P05 11v.P03 15r.P04 05r.P01 1 0 8 0 05r.P01 05r.P02 0 0 3 2 05r.P02 05r.P03 0 0 4 1 05r.P03 05r.P08 05r.P09 05r.P05 0 0 15 0 05r.P05 05r.P06 0 1 2 1 05r.P06 12r.P07 05r.P10 0 0 3 0 05r.P10 06r.P13 06v.P11,13,14 07r.P05 13r.P15 14r.P01 05v.P01 0 0 8 0 05v.P01 05v.P09 05v.P03 0 0 9 2 05v.P03 05v.P04 0 0 12 2 05v.P04 05v.P06 0 1 7 0 05v.P06 06v.P10 05v.P07 0 0 5 1 05v.P07 14r.P06 05v.P10 0 2 18 0 05v.P10 06r.P04 0 1 4 0 06r.P04 08r.P03 11r.P01 12r.P12 13v.P02 14v.P09 15r.P11 06r.P06 0 4 4 0 06r.P06 08v.P03,12 12v.P05 15v.P01 06r.P07 0 2 8 0 06r.P07 07r.P04,11 13r.P02 13v.P13 06r.P09 0 3 3 0 06r.P09 12v.P06 12v.P08 06r.P10 0 3 12 0 06r.P10 06r.P14 0 0 8 1 06r.P14 06r.P15 0 7 9 0 06r.P15 06v.P01 0 5 5 0 06v.P01 06v.P02 0 4 9 0 06v.P02 06v.P15 06v.P03 0 2 3 0 06v.P03,06 13v.P01 15v.P02,03 06v.P04 0 1 5 1 06v.P04 07r.P03 1 3 9 0 07r.P03 07r.P07 0 1 8 2 07r.P07 07r.P08 0 2 3 1 07r.P08 07r.P10 0 2 9 1 07r.P10 07r.P12 0 4 6 0 07r.P12 13r.P11 07r.P14 0 1 8 0 07r.P14 07r.P15 0 6 5 0 07r.P15 07v.P01 0 5 8 1 07v.P01 07v.P03 0 3 6 0 07v.P03 08r.P08 13v.P15 07v.P04 0 5 9 0 07v.P04 07v.P05 0 2 11 0 07v.P05 13v.P05 07v.P07 0 3 2 0 07v.P07 08v.P16 11r.P03 15v.P07 07v.P10 0 3 9 0 07v.P10 07v.P12 0 4 10 0 07v.P12 07v.P15 0 6 6 0 07v.P15 08r.P15 0 2 1 0 08r.P15 08v.P05,15 12r.P09 12v.P13 15v.P10 08v.P07 0 9 1 0 08v.P07 08v.P08 1 6 2 0 08v.P08 11r.P04 0 2 4 1 11r.P04 12v.P14 11r.P13 0 4 1 0 11r.P13 11r.P15 11r.P16 11r.P17 0 6 8 0 11r.P17 11v.P01 0 5 0 0 11v.P01 11v.P11 11v.P13 16r.P03 16r.P05 11v.P05 0 6 4 0 11v.P05 15r.P05 11v.P06 0 7 5 0 11v.P06 11v.P09 1 14 4 0 11v.P09 11v.P12 0 8 2 0 11v.P12 14v.P01 11v.P14 4 4 0 0 11v.P14 11v.P17 0 9 4 0 11v.P17 11v.P18 0 8 1 0 11v.P18 11v.P19 0 10 4 1 11v.P19 12r.P08 0 2 1 1 12r.P08 13r.P07 12r.P11 0 2 2 1 12r.P11 12v.P01 0 2 5 2 12v.P01 12v.P02 0 2 6 2 12v.P02 12v.P03 0 4 7 0 12v.P03 12v.P04 0 5 7 1 12v.P04 14r.P09 12v.P11 0 3 5 1 12v.P11 14v.P04 12v.P12 1 3 4 0 12v.P12 13r.P05 0 0 1 2 13r.P05 13r.P17 0 3 7 0 13r.P17 14r.P05 14v.P03 13v.P07 0 2 10 0 13v.P07 13v.P09 0 1 8 4 13v.P09 13v.P12 0 4 6 1 13v.P12 14r.P04 0 0 3 1 14r.P04 14v.P02 14r.P08 0 0 9 0 14r.P08 14r.P10 0 2 11 1 14r.P10 14r.P11 0 3 8 1 14r.P11 14r.P12 0 2 7 1 14r.P12 14r.P13 0 4 8 0 14r.P13 14v.P12 0 2 13 1 14v.P12 15r.P08 0 1 1 1 15r.P08 15r.P09 0 0 10 1 15r.P09 15r.P10 1 2 2 0 15r.P10 15v.P05 0 7 1 0 15v.P05 15v.P06 0 5 2 0 15v.P06 15v.P08 15v.P12 16r.P01 0 6 0 1 16r.P01 16r.P04 0 7 3 0 16r.P04 08v.P06 16r.P06 1 2 1 0 16r.P06 16r.P08 0 8 0 0 16r.P08 16r.P10 1 5 4 0 16r.P10 16r.P11 3 31 8 0 16r.P11 16r.P12 3 8 0 0 16r.P12 totals 54 2 46 27 0 46 4 40 31 1 45 1 18 63 3 44 0 26 59 2 37 1 1 47 6 38 0 4 83 5 47 0 24 65 1 47 0 24 65 1 51 1 31 93 4 49 0 43 85 1 50 1 22 39 0 53 1 ? 39 0 54 0 50 45 1 51 5 b3 41 1 45 0 19 31 3 47 1 34 59 7 51 0 21 75 4 49 0 20 83 5 45 0 22 90 7 41 0 24 65 3 45 1 26 34 2 45 0 38 28 2 30 2 54 13 1 20 6 39 8 0 97-11-29 stolfi =============== Created a sed script "fnum-to-pnum" that maps "f" page numbers (like f66r2) to sequential numbers 001-266. Note that missing pages are included too. gawk '/@/{n++; printf "%s p%03d\n", $1, n; next} /-/{print; next}' 97-11-30 stolfi =============== Discovered that the smooth gradient in Denis's page counts is not surprising: since two of the counts dominate, and my routine normalizes them to unit sum, the data is inherently unidimensional. Here is an attempt to reorder the stars pages by hand so as to make the ratio count(-iin)/count(-in) more uniform: page words -iiin -iin -in -n ratio -------- ----- ----- ----- ----- ----- ----- f103r TA 526 0 33 41 2 0.446 f103v TB 454 1 34 37 4 0.479 f108r TK 494 0 39 22 1 0.639 f108v TL 581 0 52 39 1 0.571 f104r TC 448 1 66 17 1 0.795 f104v TD 477 3 59 24 0 0.711 f107r TI 487 4 93 30 1 0.756 f107v TJ 462 1 84 43 0 0.661 f114r TS 460 5 91 23 0 0.798 f114v TT 376 2 68 23 0 0.747 f106r TG 432 1 65 24 0 0.730 f106v TH 444 1 67 23 0 0.744 f113r TQ 528 4 79 21 0 0.790 f113v TR 502 5 84 20 0 0.808 f105r TE 379 6 48 1 1 0.980 f105v TF 399 5 85 4 0 0.955 f112r TO 401 3 32 21 0 0.604 f112v TP 420 7 60 33 1 0.645 f115r TU 461 1 40 21 2 0.656 f115v TV 410 2 32 33 0 0.492 f111r TM 623 1 44 51 0 0.463 f111v TN 568 1 41 113 6 0.266 f116r TW 554 1 25 90 8 0.217 f116v TW 0 0 0 0 0 0.000 Creating a picture of this sorted data: sort-distr -s 18 -n 4 -d -p - -r 0 | pnmscale 8 | ppmtogif > .stars-bh-dist.gif xv .stars-bh-dist.gif Another attempt: page words -iiin -iin -in -n ratio -------- ----- ----- ----- ----- ----- ----- f103r TA 526 0 33 41 2 0.446 f103v TB 454 1 34 37 4 0.479 f108r TK 494 0 39 22 1 0.639 f108v TL 581 0 52 39 1 0.571 f104r TC 448 1 66 17 1 0.795 f104v TD 477 3 59 24 0 0.711 f107r TI 487 4 93 30 1 0.756 f107v TJ 462 1 84 43 0 0.661 f113r TQ 528 4 79 21 0 0.790 f113v TR 502 5 84 20 0 0.808 f105r TE 379 6 48 1 1 0.980 f105v TF 399 5 85 4 0 0.955 f114r TS 460 5 91 23 0 0.798 f114v TT 376 2 68 23 0 0.747 f106r TG 432 1 65 24 0 0.730 f106v TH 444 1 67 23 0 0.744 f112r TO 401 3 32 21 0 0.604 f112v TP 420 7 60 33 1 0.645 f115r TU 461 1 40 21 2 0.656 f115v TV 410 2 32 33 0 0.492 f111r TM 623 1 44 51 0 0.463 f111v TN 568 1 41 113 6 0.266 f116r TW 554 1 25 90 8 0.217 f116v TW 0 0 0 0 0 0.000 sort-distr -s 18 -n 4 -d -p - -r 0 | pnmscale 8 | ppmtogif > .stars-h2-dist.gif xv .stars-h2-dist.gif Yet nother attempt: page words -iiin -iin -in -n ratio -------- ----- ----- ----- ----- ----- ----- f103r TA 526 0 33 41 2 0.446 f103v TB 454 1 34 37 4 0.479 f108r TK 494 0 39 22 1 0.639 f108v TL 581 0 52 39 1 0.571 f104r TC 448 1 66 17 1 0.795 f104v TD 477 3 59 24 0 0.711 f112r TO 401 3 32 21 0 0.604 f112v TP 420 7 60 33 1 0.645 f113r TQ 528 4 79 21 0 0.790 f113v TR 502 5 84 20 0 0.808 f105r TE 379 6 48 1 1 0.980 f105v TF 399 5 85 4 0 0.955 f114r TS 460 5 91 23 0 0.798 f114v TT 376 2 68 23 0 0.747 f106r TG 432 1 65 24 0 0.730 f106v TH 444 1 67 23 0 0.744 f107r TI 487 4 93 30 1 0.756 f107v TJ 462 1 84 43 0 0.661 f115r TU 461 1 40 21 2 0.656 f115v TV 410 2 32 33 0 0.492 f111r TM 623 1 44 51 0 0.463 f111v TN 568 1 41 113 6 0.266 f116r TW 554 1 25 90 8 0.217 f116v TW 0 0 0 0 0 0.000 sort-distr -s 18 -n 4 -d -p - -r 0 | pnmscale 8 | ppmtogif > .stars-h3-dist.gif xv .stars-h3-dist.gif Let's look at f58r/f58v too: foreach s ( n in iin iiin ) cat L16-eva/f58r.P | egrep '[^i]'"$s"'[-,. =]' > .f58r-$s.evt end page words -iiin -iin -in -n ratio -------- ----- ----- ----- ----- ----- ----- f058r HB 362 0 29 1 0 0.967 Let's have a closer look at the occurrences of "daiin" in the stars section: rm -f .daiin-stars.occs foreach f ( L16-eva/f{103,104,105,106,107,108,111,112,113,114,115,116}{r,v}.P* ) echo $f echo '# '$f >> .daiin-stars.occs cat $f | egrep '[-= ,.]daiin|^#' >> .daiin-stars.occs end Edited .daiin-stars.occs by hand, removing/adding adjacent words until each occurrence of "daiin" is on a separate line with 2 words on either side. Result: 208 occurrences of "daiin" in the stars section. Let's look also at "saiin": rm -f .saiin-stars.occs foreach f ( L16-eva/f{103,104,105,106,107,108,111,112,113,114,115,116}{r,v}.P* ) echo $f echo '# '$f >> .saiin-stars.occs cat $f | egrep '[-= ,.]saiin|^#' >> .saiin-stars.occs end Many of the "daiin" and "saiin" are at the beginning of a line (but not the first of the paragraph). Some of them are at the end of paragraph. These are the words that occur near "daiin": ct rfreq cfreq word ct rfreq cfreq word ct rfreq cfreq word -- ----- ----- ----------- -- ----- ----- ----------- -- ----- ----- ----------- 7 0.034 0.034 cheo 8 0.040 0.040 chedy 6 0.029 0.029 chedy 5 0.024 0.058 chedy 7 0.035 0.075 chey 5 0.024 0.053 okeey 5 0.024 0.082 qokeey 5 0.025 0.100 cheey 5 0.024 0.077 qokeey 3 0.014 0.096 oteo 5 0.025 0.125 shey 4 0.019 0.096 daiin 3 0.014 0.111 sheeo 3 0.015 0.140 al 3 0.014 0.111 ar 2 0.010 0.120 chckhy 3 0.015 0.155 ar 3 0.014 0.125 chol 2 0.010 0.130 chdy 3 0.015 0.170 daiin 3 0.014 0.139 lchedy 2 0.010 0.139 cheeo 3 0.015 0.185 sheol 3 0.014 0.154 lshey 2 0.010 0.149 chey 2 0.010 0.195 char 3 0.014 0.168 okaiin 2 0.010 0.159 daiin 2 0.010 0.205 chedar 3 0.014 0.183 qokaiin 2 0.010 0.168 dal 2 0.010 0.215 cheol 3 0.014 0.197 qokal 2 0.010 0.178 dalam 2 0.010 0.225 chl 3 0.014 0.212 qokeedy 2 0.010 0.188 keeo 2 0.010 0.235 okar 3 0.014 0.226 qotchedy 2 0.010 0.197 llchey 2 0.010 0.245 okeey 2 0.010 0.236 aiin 2 0.010 0.207 okal 2 0.010 0.255 ol 2 0.010 0.245 chedal 2 0.010 0.216 ol 2 0.010 0.265 otal 2 0.010 0.255 chodaiin 2 0.010 0.226 otal 2 0.010 0.275 otaral 2 0.010 0.264 dal 2 0.010 0.236 qokchedy 2 0.010 0.285 otedy 2 0.010 0.274 lkeey 2 0.010 0.245 qokeeal 2 0.010 0.295 oteey 2 0.010 0.284 lshedy 2 0.010 0.255 qokeeo 2 0.010 0.305 qokchdy 2 0.010 0.293 oky 2 0.010 0.264 qopchedy 2 0.010 0.315 qotalal 2 0.010 0.303 otar 2 0.010 0.274 sheey 2 0.010 0.325 shaiin 2 0.010 0.312 otedy 2 0.010 0.284 shockhy 2 0.010 0.335 shedy 2 0.010 0.322 oteey 2 0.010 0.293 ycheo 2 0.010 0.345 sheed 2 0.010 0.332 oteody 1 0.005 0.298 acthy 2 0.010 0.355 sheey 2 0.010 0.341 qodaiin 1 0.005 0.303 aiin 2 0.010 0.365 shek 2 0.010 0.351 qokar 1 0.005 0.308 ainkam 2 0.010 0.375 sheody 2 0.010 0.361 qokedy 1 0.005 0.312 al 2 0.010 0.385 shody 2 0.010 0.370 qokeol 1 0.005 0.317 alky 2 0.010 0.395 shol 2 0.010 0.380 qoky 1 0.005 0.322 alol 1 0.005 0.400 aiin 2 0.010 0.389 qoty 1 0.005 0.327 am 1 0.005 0.405 airols 2 0.010 0.399 saiin 1 0.005 0.332 ar 1 0.005 0.410 aky 2 0.010 0.409 sheey 1 0.005 0.337 aralary 1 0.005 0.415 alaiin 2 0.010 0.418 tchedy 1 0.005 0.341 archcthy 1 0.005 0.420 alal 2 0.010 0.428 teeedy 1 0.005 0.346 chcphydy 1 0.005 0.425 aldair 1 0.005 0.433 *asor 1 0.005 0.351 chdaly 1 0.005 0.430 alsar 1 0.005 0.438 akaiin 1 0.005 0.356 chea 1 0.005 0.435 aral 1 0.005 0.442 chckhaiin 1 0.005 0.361 chedaiin 1 0.005 0.440 aroteey 1 0.005 0.447 chcphedy 1 0.005 0.365 chedal 1 0.005 0.445 chckhy 1 0.005 0.452 chdar 1 0.005 0.370 chedyrl 1 0.005 0.450 chcthar 1 0.005 0.457 chdor 1 0.005 0.375 cheeey 1 0.005 0.455 chcthdy 1 0.005 0.462 chdy 1 0.005 0.380 cheey 1 0.005 0.460 chcthed 1 0.005 0.466 cheal 1 0.005 0.385 cheky 1 0.005 0.465 chcthy 1 0.005 0.471 chear 1 0.005 0.389 cheoda* 1 0.005 0.470 cheaiin 1 0.005 0.476 checkhey 1 0.005 0.394 cheody 1 0.005 0.475 cheal 1 0.005 0.481 checkhy 1 0.005 0.399 cheol 1 0.005 0.480 checkhy 1 0.005 0.486 cheeal 1 0.005 0.404 cheot 1 0.005 0.485 checthal 1 0.005 0.490 cheedy 1 0.005 0.409 chllkeey 1 0.005 0.490 ched 1 0.005 0.495 cheeky 1 0.005 0.413 cho 1 0.005 0.495 chedaiin 1 0.005 0.500 cheocthy 1 0.005 0.418 chockhey 1 0.005 0.500 chedal 1 0.005 0.505 cheodaiin 1 0.005 0.423 chodeeal 1 0.005 0.505 cheedy 1 0.005 0.510 chey 1 0.005 0.428 chody 1 0.005 0.510 cheeeo 1 0.005 0.514 chocthy 1 0.005 0.433 chotam 1 0.005 0.515 cheeir 1 0.005 0.519 chody 1 0.005 0.438 chotchedy 1 0.005 0.520 cheeteey 1 0.005 0.524 chokedair 1 0.005 0.442 chy 1 0.005 0.525 chekeek 1 0.005 0.529 choty 1 0.005 0.447 cphaiin 1 0.005 0.530 cheo 1 0.005 0.534 dchedy 1 0.005 0.452 dail 1 0.005 0.535 cheocthy 1 0.005 0.538 deeedy 1 0.005 0.457 dala 1 0.005 0.540 cheodaiin 1 0.005 0.543 dol 1 0.005 0.462 dched 1 0.005 0.545 cheodar 1 0.005 0.548 dsheeo 1 0.005 0.466 dcheo 1 0.005 0.550 cheolor 1 0.005 0.553 eedol 1 0.005 0.471 dchol 1 0.005 0.555 chkaiin 1 0.005 0.558 eeykeody 1 0.005 0.476 decthdy 1 0.005 0.560 choaiin 1 0.005 0.562 kair 1 0.005 0.481 eedy 1 0.005 0.565 chocfhdy 1 0.005 0.567 kal 1 0.005 0.486 kar 1 0.005 0.570 chody 1 0.005 0.572 kchdy 1 0.005 0.490 kchedy 1 0.005 0.575 chol 1 0.005 0.577 kchedy 1 0.005 0.495 kcheo 1 0.005 0.580 cholchey 1 0.005 0.582 keedal 1 0.005 0.500 keesho 1 0.005 0.585 chopchy 1 0.005 0.587 keeo 1 0.005 0.505 keol 1 0.005 0.590 chotaiin 1 0.005 0.591 kolkair 1 0.005 0.510 ky 1 0.005 0.595 chsd 1 0.005 0.596 lcheeol 1 0.005 0.514 l 1 0.005 0.600 ckheol 1 0.005 0.601 lechody 1 0.005 0.519 larorol 1 0.005 0.605 dal 1 0.005 0.606 lkar 1 0.005 0.524 lchedam 1 0.005 0.610 dam 1 0.005 0.611 lkeeol 1 0.005 0.529 lkaiiir 1 0.005 0.615 dar 1 0.005 0.615 lkol 1 0.005 0.534 lkal 1 0.005 0.620 daram 1 0.005 0.620 lky 1 0.005 0.538 lkam 1 0.005 0.625 daryom 1 0.005 0.625 oain 1 0.005 0.543 lkeeeady 1 0.005 0.630 dchdos 1 0.005 0.630 oar 1 0.005 0.548 lkeo 1 0.005 0.635 dchedar 1 0.005 0.635 ocheey 1 0.005 0.553 lkeol 1 0.005 0.640 dckhy 1 0.005 0.639 octhd 1 0.005 0.558 lklor 1 0.005 0.645 dshedal 1 0.005 0.644 odair 1 0.005 0.562 llod 1 0.005 0.650 lkchedy 1 0.005 0.649 okchey 1 0.005 0.567 lm 1 0.005 0.655 lor 1 0.005 0.654 okechey 1 0.005 0.572 lteedy 1 0.005 0.660 ochedaiin 1 0.005 0.659 okedy 1 0.005 0.577 ochedaiin 1 0.005 0.665 ockhedy 1 0.005 0.663 okeedaiin 1 0.005 0.582 ochedal 1 0.005 0.670 octhd 1 0.005 0.668 okeedy 1 0.005 0.587 ocheey 1 0.005 0.675 octhdy 1 0.005 0.673 okeeedy 1 0.005 0.591 ofam 1 0.005 0.680 octhy 1 0.005 0.678 okeeshy 1 0.005 0.596 ofar 1 0.005 0.685 ofchedaiin 1 0.005 0.683 okol 1 0.005 0.601 okaiin 1 0.005 0.690 okaiin 1 0.005 0.688 ol 1 0.005 0.606 okchedy 1 0.005 0.695 okairdy 1 0.005 0.692 oldaiin 1 0.005 0.611 okchey 1 0.005 0.700 okal 1 0.005 0.697 olkchey 1 0.005 0.615 okchy 1 0.005 0.705 okchey 1 0.005 0.702 olkeedaiin 1 0.005 0.620 okeedy 1 0.005 0.710 okedal 1 0.005 0.707 olkeeey 1 0.005 0.625 okey 1 0.005 0.715 okedy 1 0.005 0.712 olshy 1 0.005 0.630 oleedy 1 0.005 0.720 okeedaky 1 0.005 0.716 opailo 1 0.005 0.635 olkaey 1 0.005 0.725 okeedy 1 0.005 0.721 opchedaiin 1 0.005 0.639 olky 1 0.005 0.730 okey 1 0.005 0.726 opcheed 1 0.005 0.644 olr 1 0.005 0.735 olaiin 1 0.005 0.731 oraiin 1 0.005 0.649 oly 1 0.005 0.740 olam 1 0.005 0.736 otaiin 1 0.005 0.654 om 1 0.005 0.745 olkaiin 1 0.005 0.740 otair 1 0.005 0.659 opaiin 1 0.005 0.750 olkaiir 1 0.005 0.745 otarar 1 0.005 0.663 opaik 1 0.005 0.755 oly 1 0.005 0.750 otchedy 1 0.005 0.668 opalam 1 0.005 0.760 opairam 1 0.005 0.755 otchod 1 0.005 0.673 opam 1 0.005 0.765 opal 1 0.005 0.760 otechdy 1 0.005 0.678 opchdy 1 0.005 0.770 or 1 0.005 0.764 otedal 1 0.005 0.683 opchy 1 0.005 0.775 oraiin 1 0.005 0.769 oteor 1 0.005 0.688 or 1 0.005 0.780 otar 1 0.005 0.774 oteoy 1 0.005 0.692 oram 1 0.005 0.785 oteedaiin 1 0.005 0.779 pcheol 1 0.005 0.697 ore 1 0.005 0.790 oteedo 1 0.005 0.784 pchor 1 0.005 0.702 orkchdy 1 0.005 0.795 oteol 1 0.005 0.788 pdal 1 0.005 0.707 os 1 0.005 0.800 por 1 0.005 0.793 pdaro 1 0.005 0.712 osh*o 1 0.005 0.805 qkair 1 0.005 0.798 qckheey 1 0.005 0.716 oshey 1 0.005 0.810 qkeodaiin 1 0.005 0.803 qlky 1 0.005 0.721 otaiin 1 0.005 0.815 qoair 1 0.005 0.808 qoeedaiin 1 0.005 0.726 otaiinodaly 1 0.005 0.820 qoeedaiin 1 0.005 0.812 qoek 1 0.005 0.731 otaik 1 0.005 0.825 qoek 1 0.005 0.817 qokairar 1 0.005 0.736 otam 1 0.005 0.830 qofchdar 1 0.005 0.822 qokchdy 1 0.005 0.740 otar 1 0.005 0.835 qokaiin 1 0.005 0.827 qokchey 1 0.005 0.745 otary 1 0.005 0.840 qokchedy 1 0.005 0.832 qokechy 1 0.005 0.750 otaryly 1 0.005 0.845 qokchey 1 0.005 0.837 qokedar 1 0.005 0.755 otcham 1 0.005 0.850 qokeeo 1 0.005 0.841 qokeeey 1 0.005 0.760 otcheo 1 0.005 0.855 qokeeody 1 0.005 0.846 qokeeo 1 0.005 0.764 otchey 1 0.005 0.860 qokeey 1 0.005 0.851 qotaiin 1 0.005 0.769 oteeey 1 0.005 0.865 qopol 1 0.005 0.856 qotal 1 0.005 0.774 oteey 1 0.005 0.870 saiin 1 0.005 0.861 qotar 1 0.005 0.779 oteol 1 0.005 0.875 shal 1 0.005 0.865 qotchy 1 0.005 0.784 otey 1 0.005 0.880 shechy 1 0.005 0.870 qotear 1 0.005 0.788 oto 1 0.005 0.885 sheckhy 1 0.005 0.875 qoteey 1 0.005 0.793 pcha 1 0.005 0.890 shecthey 1 0.005 0.880 qoteody 1 0.005 0.798 pchal 1 0.005 0.895 shecthy 1 0.005 0.885 qoteol 1 0.005 0.803 pcheo 1 0.005 0.900 shedaiin 1 0.005 0.889 r 1 0.005 0.808 qckhey 1 0.005 0.905 sheeal 1 0.005 0.894 raiin 1 0.005 0.812 qekor 1 0.005 0.910 sheedy 1 0.005 0.899 rain 1 0.005 0.817 qodaiin 1 0.005 0.915 sheekchy 1 0.005 0.904 ralom 1 0.005 0.822 qokairy 1 0.005 0.920 sheeky 1 0.005 0.909 sair 1 0.005 0.827 qokam 1 0.005 0.925 sheet 1 0.005 0.913 sar 1 0.005 0.832 qokaram 1 0.005 0.930 sheor 1 0.005 0.918 saraiin 1 0.005 0.837 qokchy 1 0.005 0.935 shl 1 0.005 0.923 sheckhy 1 0.005 0.841 qokeedaram 1 0.005 0.940 tair 1 0.005 0.928 shedar 1 0.005 0.846 qokol 1 0.005 0.945 tchar 1 0.005 0.933 sheed 1 0.005 0.851 qopchdy 1 0.005 0.950 teodaiin 1 0.005 0.938 sheedy 1 0.005 0.856 qotaiin 1 0.005 0.955 ychedal 1 0.005 0.942 sheeky 1 0.005 0.861 qotam 1 0.005 0.960 ycheeo 1 0.005 0.947 sheeodar 1 0.005 0.865 qotar 1 0.005 0.965 ydaiin 1 0.005 0.952 sheeol 1 0.005 0.870 qotchy 1 0.005 0.970 ykchedy 1 0.005 0.957 solpchd 1 0.005 0.875 qotedar 1 0.005 0.975 ykeedan 1 0.005 0.962 tchar 1 0.005 0.880 qoteody 1 0.005 0.980 ykeedy 1 0.005 0.966 teeoar 1 0.005 0.885 qotey 1 0.005 0.985 yokoey 1 0.005 0.971 teody 1 0.005 0.889 qoty 1 0.005 0.990 ytam 1 0.005 0.976 ty 1 0.005 0.894 r 1 0.005 0.995 ytar 1 0.005 0.981 ykchedy 1 0.005 0.899 raiin 1 0.005 1.000 yteedy 1 0.005 0.986 ykeey 1 0.005 0.904 rodam 1 0.005 0.990 yteedy 1 0.005 0.909 rol 1 0.005 0.995 yteeody 1 0.005 0.913 ry 1 0.005 1.000 yteody 1 0.005 0.918 sham 1 0.005 0.923 shchy 1 0.005 0.928 sheal 1 0.005 0.933 sheoked 1 0.005 0.938 shey 1 0.005 0.942 shod 1 0.005 0.947 ssheo 1 0.005 0.952 tchedaiin 1 0.005 0.957 tedam 1 0.005 0.962 teeo 1 0.005 0.966 tolpchy 1 0.005 0.971 tsho 1 0.005 0.976 ycheeo 1 0.005 0.981 yka*om 1 0.005 0.986 ykcheo 1 0.005 0.990 ykeeo 1 0.005 0.995 ykeo 1 0.005 1.000 ysheo Second word before "daiin", sorted by shape: -- ---------------------------------------------------------------- 7 aiin 6 okaiin 6 qokaiin 2 lkaiin 6 chedy 5 lchedy 3 cheey 2 lfchedy 4 otar 3 qotar 4 oteedy 3 qokey 2 okeey 3 daiin 2 dair 3 dar 3 otchedy first 10 words account for 23% of all "daiin"s first 20 words account for 34% of all "daiin"s First word before "daiin", sorted by "shape" -- ---------------------------------------------------------------- 23 cheo sheeo cheeo chedy chdy chey llchey 9 qokeey qokeeo keeo 3 oteo 2 daiin 2 qokeeal 2 okal 2 otal 2 qokchedy first 10 words account for 15% of all "daiin"s first 20 words account for 25% of all "daiin"s First word after "daiin", Sorted by "shape" -- ---------------------------------------------------------------- 40 chedy chey cheey shey shedy sheed sheey sheody shody sheol cheol 8 al ar ol 3 daiin first 10 words account for 20% of all "daiin"s first 20 words account for 30% of all "daiin"s Second word after "daiin", Sorted by "shape" -- ---------------------------------------------------------------- 30 okeey qokeey qokeedy qotchedy qokedy qokeol lkeey otedy oteey oteody teeedy 16 chedy chedal lchedy lshey lshedy 7 qokal qokar otar 6 daiin qodaiin 6 okaiin qokaiin 6 oky qoky qoty 3 ar 3 chol 2 saiin first 10 words account for 18% of all "daiin"s first 20 words account for 30% of all "daiin"s These words are tentative members of the "daiin constellation: chedy chey shey cheo cheey okeey oteey qokeey qokeedy otedy chedar sheol chol okaiin qokaiin And these may be associate: al ar ol 97-12-01 stolfi =============== While trying to redo the label occurrence maps, I noticed this strong correlation between "shedy" and "okal/qokal": shedy B p150 427 . . . 4 7 6 10 4 2 15 44 46 25 32 44 59 5 1 3 1 31 7 4 14 12 17 9 3 9 13 f77v.L.4;U [q]okal Z p138 314 2 2 4 1 5 5 4 7 9 15 21 22 11 27 28 21 9 4 15 7 13 7 4 17 20 7 10 8 6 3 f72r2.S.5;K Now "shedy" is part of a label on f77v (the bottom left tube), and "okal" is Rene/Robert's conjecture for the name of the Sun Here are some families of labels with clearly similar patterns of reference: 1 otaiin Z p137 346 7 4 8 5 3 8 1 7 1 1 4 10 19 23 10 7 18 4 6 5 11 22 15 32 10 11 28 27 28 11 f72r1.S.13;K 1 aiin A p127 378 1 3 1 3 11 13 5 4 12 6 4 7 12 6 5 5 37 16 6 9 7 16 32 25 20 22 22 23 19 26 f68v2.R.9;C 1 oteey A p119 133 1 2 3 . 1 2 2 1 2 2 3 2 3 4 5 3 1 . 6 6 13 6 3 9 11 11 11 7 5 8 f67r1.S.6;C 3 okaiin P p181 758 4 2 9 7 12 24 5 6 12 9 28 33 41 69 42 31 24 6 17 4 28 34 35 44 50 23 54 48 21 36 f89r1.m.3;K 3 otar Z p136 184 . 2 . 1 4 3 3 6 5 4 8 5 7 7 5 8 19 2 9 . 6 11 10 9 4 3 5 15 13 10 f71v.S2.4;K 3 otal Z p138 198 1 2 . 3 2 4 3 6 4 9 3 9 9 12 14 6 19 2 5 2 6 9 5 15 15 1 9 7 7 9 f72r2.S.1;K 9 tar Z p134 47 1 . . . 1 2 . 1 3 2 . . 3 1 4 1 7 1 5 . 1 2 . 2 2 3 3 1 1 . f70v1.S.6;K 9 okar T p114 255 . 1 1 1 9 16 6 9 11 9 19 1 10 12 9 6 29 5 17 2 2 11 3 7 14 2 11 13 7 12 f58v.T.1;U 9 saiin T p042 238 4 3 5 4 7 8 5 3 5 6 17 15 12 2 11 10 5 19 9 7 8 6 6 4 5 15 20 4 7 6 f22v.T.16;F 9 orar P p206 13 . . . . . . . . 1 . 1 . . 2 . 1 . . . 3 . 1 . 1 . . 1 . . 2 f101v1.R1.2;C 9 otchdy A p127 47 . . . 1 1 . 2 5 1 2 2 . . . 1 6 3 . 3 . . 6 3 2 . 1 . 2 4 2 f68v2.R.8;C 2 cheody P p182 76 2 . 2 2 2 . 3 1 3 4 1 . . . . . 2 11 4 2 4 7 4 2 6 1 . 4 6 3 f89r2.m2.4;L 2 arody T p078 9 . . . . . . . . . . . . . . . . 1 . . 1 . . 3 . 1 . 1 1 1 . f40v.T.19;F 2 okair Z p138 36 . . . . . 3 . 1 2 1 . . . 2 2 . . 1 2 . . 3 5 4 4 1 . 1 3 1 f72r2.S.18;K 4 okam Z p138 48 1 . 1 2 3 5 2 3 2 1 1 1 . . . . 1 1 2 1 2 2 3 . 4 1 1 1 1 6 f72r2.S.3;K 4 oaiin Z p137 65 3 2 1 2 2 3 3 4 4 2 1 . 1 . . . 1 5 . 1 4 2 1 1 2 3 6 4 5 2 f72r1.S.8;K 6 otarar P p206 6 . . . . . . . . . . . . . . . . 1 . . . . 1 . 1 1 . . 1 . 1 f101v1.R1.3;C 6 oram P p179 9 . . . . . . . . . 1 . . 1 . . 1 . . . . . 1 1 . . 1 2 1 . . f88r.m.2;L 6 otam Z p138 53 1 . 1 2 3 . 1 4 2 2 2 . . 1 . 1 6 1 2 . 2 3 2 3 2 2 3 5 1 1 f72r2.S.7;K 6 opchey Z p134 35 . . 3 1 . . 1 2 1 1 2 . . 1 . 2 4 . . 1 1 2 3 . 3 1 2 3 1 . f70v1.S.7;K 7 otody Z p133 19 . . 1 . 2 . 1 1 . . . . . . . . 1 2 3 1 . 1 . . . . 1 2 2 1 f70v2.S1.4;K 7 chos P p202 27 . . 1 2 2 . 1 . 1 2 . . . . . . . 3 2 3 2 . . 1 . . . 2 4 1 f100v.B.13;K 7 oteol T p075 42 3 . 1 . 2 1 5 2 . . . 1 . 1 . 1 . 2 2 6 3 1 1 1 1 2 . 5 . 1 f39r.T.16;F 7 chodar S p124 11 1 2 . . . . . . . . . . . . . . . . 3 1 . 1 1 1 . . . . 1 . f68r2.S.22;R 7 odaiin P p184 113 7 4 4 5 1 . 2 6 9 1 1 . 1 1 . 1 7 5 6 3 2 11 9 6 5 1 1 . 9 5 f89v1.b.3;L a tol A p121 53 . 1 2 3 1 . 3 4 3 2 3 3 . 5 2 1 2 2 2 3 2 1 . 2 2 1 . 1 . 2 f67v2.C1.1;C a or A p121 296 9 6 12 10 14 17 5 8 17 5 12 8 20 9 2 13 26 9 13 15 8 6 13 1 5 4 10 4 8 7 f67v2.C1.2;C a qor I p117 296 9 6 12 10 14 17 5 8 17 5 12 8 20 9 2 13 26 9 13 15 8 6 13 1 5 4 10 4 8 7 f66r.L.3;F d otain Z p137 21 1 . 1 . 1 1 1 . . 1 5 2 1 . 3 . 1 . . . . . . . . . 2 . 1 . f72r1.S.13;K d chety A p127 20 1 . . . . 1 2 2 2 2 1 . 1 2 1 . . 1 . . 1 1 . . . . . . . 2 f68v2.R.4;C e daly I p117 20 . 1 . . 1 1 3 1 . 4 1 2 . 4 . . . 1 . . . . . 1 . . . . . . f66r.L.6;C e oky Z p133 227 3 10 8 11 7 13 9 7 6 13 12 9 7 10 21 8 8 4 13 3 12 2 2 4 4 2 4 4 4 7 f70v2.S1.1;K e dar T p166 291 8 10 13 17 13 14 15 8 7 10 17 7 11 4 11 18 22 12 14 7 7 2 9 7 1 5 . 4 10 8 f85r2.T.1;U e oty Z p133 180 5 13 10 10 3 6 6 10 8 5 7 5 8 15 7 3 5 3 7 2 7 9 1 6 1 2 4 3 3 6 f70v2.S1.2;K e otol T p165 104 2 6 9 6 5 3 7 5 5 2 1 3 2 7 1 3 7 4 2 4 1 5 2 1 2 . . 3 5 1 f85r1.T.34;F g ody Z p133 56 2 2 3 . 3 4 5 3 3 . . . 3 . 2 . 6 3 4 2 2 . 2 3 3 . 1 . . . f70v2.S1.1;K g okol P p182 137 8 5 6 10 . 4 7 6 3 4 . 8 7 4 2 5 5 11 9 12 2 5 2 1 6 . . . 4 1 f89r2.m1.2;Q h dal Z p133 252 8 3 7 12 7 6 5 3 15 20 14 18 9 6 15 14 10 27 9 8 6 4 4 2 8 3 1 1 5 2 f70v2.S2.10;K h okain T p159 99 . . 1 3 2 3 . 7 2 9 16 16 11 4 6 2 1 . 1 1 2 2 1 4 1 . . 3 . 1 f82r.T1.18;F i y T p229 41 4 . 3 3 3 2 3 2 4 . 3 1 6 2 1 . . 2 1 . . . . . . . . 1 . . f114r.T1.34;G i cham Z p138 20 4 1 . 1 1 . 2 1 3 2 . 1 1 . . . . . . . . 1 . . . . 1 . . 1 f72r2.S.18;K i cphy A p121 13 2 2 2 . . . 1 2 1 . . . . . . . 1 . 1 . . . . . . . . . 1 . f67v2.C2.2;C i dan P p181 10 4 . 1 1 . 2 1 . . . . . . . . . . . 1 . . . . . . . . . . . f89r1.m.3;K i daiin T p117 980 41 69 54 60 51 70 34 45 46 20 17 19 14 10 24 31 17 63 29 52 22 15 24 13 21 31 11 17 39 21 f66r.W.1;U i dchol S p124 21 2 3 2 2 1 1 . 4 . 1 . . 1 1 . . 1 1 . . . . . . . . . . . 1 f68r2.S.5;R i otor P p184 50 2 7 7 3 1 6 2 3 . 2 . . 1 . . . 1 . 3 . . 1 3 3 . . . 2 1 2 f89v1.b.2;L i chol T p206 390 40 30 42 20 12 11 20 21 27 7 2 2 4 2 3 1 8 15 14 33 6 12 3 8 9 6 6 9 11 6 f101v1.T.10;F i okchor S p124 26 2 7 6 . 2 2 2 . 1 1 . . . . . . . . 1 . . . . . . 1 . . . 1 f68r2.S.8;R i shol T p081 180 17 18 12 9 1 7 25 11 6 9 2 1 6 3 5 2 2 4 7 7 4 4 3 3 5 1 1 4 1 . f42r.T3.23;F i shy A p127 105 6 10 11 14 7 5 8 3 3 5 3 . 3 4 2 1 2 5 3 1 . . . 4 1 . 1 . . 3 f68v2.R.3;C i shor Z p136 93 10 15 6 6 8 5 7 7 2 2 . 1 . 1 . 1 6 1 3 4 . 2 2 . 1 . . 1 . 2 f71v.S2.4;C i dy Z p138 207 6 32 12 15 9 11 13 10 10 7 10 10 6 6 9 6 7 5 3 9 4 . . 3 1 . 2 1 . . f72r2.S.5;K I decided to improve my find-occurrences script so that it reports the actual string matched, as well as the pattern. Then we can capture all variants of interesting labels, such as "otolor"... 97-12-05 stolfi =============== Still working on the new label reference maps. Rene sent me his VTX text-extraction tool, and a a set of page-header lines of the form {$I=T $Q=A $P=A $L=A $H=1} {$I=H $Q=A $P=B $L=A $H=1} {$I=H $Q=A $P=C $L=A $H=1} {$I=H $Q=A $P=D $L=A $H=1} that are used by VTX to find the requested pages. I added those lines in front of all the relevant files in L16-eva (the "page comments" files such as "f1r", not the text unit files such as "f1r.T"). I also compared his data against my own index (L16-eva/INDEX), fixed some errors in the latter, and noted some discrepancies in the section codes. (Basically, some of his sectins were assigned on the basis of the page's location in the bound book, rather than its contents.). 97-12-21 stolfi =============== While preparing the new label location maps (Note-010.html), I got curious about the colocates of some words. Let's start with "daiin" which is very common and almost as frequent in both languages: compare-word-colocates \ '\bdaiin\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 84 daiin / 19 / daiin 19 / daiin 46 / daiin 5 daiin / 9 daiin / 30 chol daiin 4 - daiin 7 daiin chey 23 daiin = 4 daiin chedy 6 daiin ol 12 - daiin 3 chckhy daiin 5 daiin chedy 11 daiin cthy 3 chedy daiin 5 shey daiin 10 daiin daiin 3 daiin or 4 daiin daiin 10 shol daiin 3 daiin otal 4 daiin shedy 9 chor daiin 2 ar daiin 4 daiin shey 8 daiin - 2 daiin chcthy 4 qokal daiin It is reassuring that in both languages "daiin" likes line-start and line-end positions. However it is curious that in language B "daiin" favors line-starts, while in language A it prefers line-ends. Let's modify the code so that it ignores line breaks. Let's also map 't' to 'k', final [ao] to y, initial y or qy to o or qo: compare-word-colocates \ '\bd[ao]iin\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 30 chol daiin 4 chckhy daiin 7 daiin chey 23 daiin = 4 daiin chedy 6 daiin ol 15 daiin ckhy 4 daiin okal 5 daiin chedy 13 daiin daiin 3 ar daiin 5 daiin okedy 11 daiin qokchy 3 chedy daiin 5 qokal daiin 10 shol daiin 3 daiin okaiin 5 qoky daiin 9 chor daiin 3 daiin or 5 shey daiin 9 okol daiin 3 daiin shody 4 daiin daiin 8 chy daiin 3 okaiin daiin 4 daiin okaiin 8 ckhy daiin 2 daiin chckhy 4 daiin shedy Perhaps some of A's chol, shol, ckhy corresponds to B's chey, shey, chedy, shedy. Let's try with "okaiin": compare-word-colocates \ '\b[q]*[oy]k[ao]iin\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 7 daiin okaiin 5 okaiin okaiin 22 shedy qokaiin 5 chol okaiin 4 okaiin = 17 chedy qokaiin 4 okaiin = 3 daiin okaiin 17 qokaiin chedy 4 okaiin ckhy 3 okaiin chckhy 13 qokaiin shedy 4 okaiin okaiin 3 okaiin daiin 12 qokaiin ol 4 or okaiin 3 okaiin okar 10 shey qokaiin 3 okaiin daiin 2 aiin okaiin 9 chey qokaiin 3 okaiin s 2 chckhy okaiin 9 okaiin shedy 2 ckhor okaiin 2 chdy qokaiin 9 qokaiin checkhy 2 ckhy qokaiin 2 dain okaiin 8 qokaiin chckhy It may be that A's chol is B's chedy/shedy. Another word that is common in both languages is "okal": compare-word-colocates \ '\b[q]*[oy][tk][oa][l]\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 9 okol daiin 4 daiin okal 15 qokal chedy 8 qokol daiin 3 chdy okal 12 qokal shedy 7 okol chol 3 okal dar 9 chedy qokal 5 daiin qokol 2 aiin okal 9 qokeedy qokal 3 ckhor okol 2 chckhy okal 9 shedy qokal 3 dain okol 2 okaiin okal 7 qokal dar 3 okal chol 2 okal chedy 7 qokedy qokal 3 okol dol 2 okal chody 6 okal chedy 3 shor okol 2 okal dam 5 qokal daiin 2 chody okol 2 okal okair 5 qokal dy Here are the counts withot the k/t and o/y fixes: count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 6 otol chol 3 daiin otal 11 qokal chedy 5 qokol daiin 3 okal dar 9 qokal shedy 4 daiin qotol 2 aiin okal 8 shedy qokal 4 otol daiin 2 chckhy okal 6 chedy qokal 3 okol daiin 2 chdy ykal 6 qokeedy qokal 3 qotol daiin 2 okal chedy 5 qokal daiin 2 cho qokol 2 okal okair 4 okal chedy 2 cthor otol 2 okal shdy 4 qokal dar 2 daiin otal 2 qokol chedy 4 qotal chedy 2 odaiin okal 1 chcfhol okal 3 chedy qotal So it seems that A uses otol/okol where B uses okal/otal. It is tempting to identify A's chol with B's chedy/shedy. Let's try with "okar", which is also distributed fairly uniformly: compare-word-colocates \ '\b[q]*[oy][tk][oa]r\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 3 daiin qokor 6 qokar okar 6 qokar shedy 3 okor chor 5 okar chdy 6 qokeedy qokar 3 qokor chor 4 okar ar 5 chedy qokar 2 dain qokor 4 okar or 5 qokar ol 2 dy qokor 3 ar okar 4 chckhy okar 2 okor chey 3 okaiin okar 4 okar okedy 2 oky okor 3 okar chedy 4 okar ol 2 qokchy qokor 3 okar okedy 4 okar shedy 2 qokor chol 3 okar ol 4 shey qokar 2 qokor daiin 3 qokar chckhy 3 okar chedy Again, where A uses "or", B uses "ar". Perhaps A's qokchy is B's qokeedy ? Another fairly uniform word is "qokeey": compare-word-colocates \ '\b[q]*[oy][kt][cse][eh][yo]\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 11 daiin qokchy 3 chedy okeey 8 qokeey qokedy 8 okchy kchy 2 keedy okeey 6 qokeedy qokeey 7 daiin okchy 2 okchy okar 6 shedy qokeey 7 qokchy qokchy 2 okeey daiin 4 chedy qokeey 5 ckhy okchy 2 okeey dar 4 qokeey okeey 5 okchy daiin 2 r okeey 4 qokeey raiin 5 okeey daiin 1 alfshe? okshy 3 dar qokeey 5 qokchy daiin 1 arar okeey 3 okeey qol 5 qokchy kchy 1 chees okeey 3 qokeey daiin 4 qokchy qoky 1 chek qokchy 3 qokeey qokaiin Here are the counts without the k/t and y/o fixes: count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 4 daiin qokchy 2 chedy okeey 6 qokeey qokedy 4 daiin qotchy 2 r yteey 6 shedy qokeey 4 qotchy qokchy 1 alfshe? okshy 4 chedy qokeey 3 cthy otchy 1 arar oteey 3 dar qokeey 3 okeey daiin 1 chedy ykeey 3 qokeey daiin 3 qotchy daiin 1 chees oteey 3 qokeey qokaiin 3 qoteey daiin 1 chek qokchy 3 qokeey raiin 2 aiin qotchy 1 cheody okeey 2 oteey qol 2 choty qokchy 1 chfalas qokeey 2 pchedy qokeey 2 daiin otchy 1 cthy qokeey 2 qokedy qokeey Note the near-repetition "qotchy qokchy" in A, and "qokeey qokedy" or "qokedy qokeey" in B. Now "otam", also fairly uniform: compare-word-colocates \ '\b[q]*[oy][tk][ao][mjg]\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 1 char okam 2 chdam qokam 1 chcphey qokam 1 chol qokom 2 daiin okam 1 chedy qokam 1 ckham okom 2 qokar okam 1 lchey qokam 1 ckhor okam 1 aiin okam 1 okam olaiin 1 dar okom 1 akedy okam 1 okar okam 1 kal okam 1 ar okam 1 qokam chedy 1 kchody qokam 1 chdar okam 1 qokam okal 1 okam = 1 chdy okam 1 qokam qokaiin 1 okam chckh 1 checkhy okam 1 qokam s 1 okam chol 1 chekeedy okam 1 qokam sol Can't say much... Next is "chey" also very uniform: compare-word-colocates \ '\b[q]*[cse][eh]ey\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 5 chey kchy 3 chey = 10 shey qokaiin 3 cheor chey 2 dar chey 9 chey qokaiin 3 chey keey 2 qoky chey 7 daiin chey 3 dar shey 2 shey daiin 7 qol chey 3 dy shey 2 shey qokaiin 6 qokaiin shey 2 chey dam 1 ar shey 5 chey qokeedy 2 chey kchol 1 chdain shey 5 ol shey 2 chey keor 1 chdar shey 5 shey daiin 2 chey kor 1 che?dy chey 5 shey qokedy 2 chey kshey 1 chedy chey 5 shey qoky Not clear... The word "chckhey" is also fairly uniform: compare-word-colocates \ '\b[cse][eh][ce][kt][he]ey\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 1 chain chckhey 1 ???in shckhey 2 daiin chckhey 1 chckhey chor 1 chckhey = 2 qokaiin chckhey 1 chckhey daiin 1 chckhey choky 1 chckhey cheor 1 chckhey okaiin 1 chckhey okchdy 1 chckhey dar 1 chckhey okshy 1 chckhey or 1 chckhey kedy 1 chckhey ol 1 dair shckhey 1 chckhey lchey 1 chckhey qod 1 kodaiin shckhey 1 chckhey ldy 1 chkaiin shckhey 1 odain chckhey 1 chckhey qokeedy 1 ckhol chckhey 1 okain chckhey 1 chckhey qokeeol 1 ckhy chckhey 1 okam chckhey 1 chckhey saiin The uses of this word are too scattered for us to say anything useful. Another uniform word is "yshey": compare-word-colocates \ '\b[q]*[oy][cse][eh]ey\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 1 chey qoeeey 1 chdy ochey 1 chealy oshey 1 chydaiin ochey 1 dy ochey 1 cheedy oshey 1 ckhar ochey 1 kchodain oeeey 1 dy ochey 1 ckhy ochey 1 lor ochey 1 lcheey qochey 1 daiin ochey 1 ochey dar 1 lor oshey 1 dy ochey 1 ochey kamar 1 ochey kal 1 ochey chol 1 ochey oly 1 ochey qokain 1 ochey ckhos 1 oeeey okaiin 1 okar oshey 1 ochey kchokchy 1 ols oshey 1 ochey kchos 1 oroly ochey Finally, let's try "or": compare-word-colocates \ '\b[oya]r\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 5 ckhy or 6 or aiin 8 or shedy 4 or okaiin 4 okar ar 4 or aiin 3 ar al 4 okar or 3 or al 3 chol or 3 ar daiin 2 chedy or 3 or chol 3 ar okar 2 chekar or 3 or chor 3 daiin or 2 dal or 2 daiin or 3 dar ar 2 dar ar 2 dol or 3 kor or 2 or chey 2 okaiin or 3 or ar 2 or sheey 2 or aiin 2 ar aiin 2 or shey Again, it seems that A's chol is B's chedy. Also A's okaiin seems to be B's aiin. Now a few random words: compare-word-colocates \ '\b[cs]ho[rl]\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 30 chol daiin 3 chol kar 2 qokaiin chol 20 chol chol 2 or chol 2 qokol chol 10 shol daiin 1 arakaiin shol 2 shol kedy 9 chor daiin 1 chdaiin chol 1 ?chor or 8 chol ckhol 1 ches chol 1 chcphey chol 8 chol shol 1 chkaiin chol 1 chey chol 8 chor chol 1 chol alaiin 1 chol ar 7 chol ckhy 1 chol chckhy 1 chol chedcheydaiin 7 okol chol 1 chol chky 1 chol cheky 6 chol chor 1 chol dar 1 chol chy Note the curious numeric coincidence in the first file. compare-word-colocates \ '\b[cse][he]edy\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 4 daiin chedy 22 shedy qokaiin 3 chedy daiin 18 qol chedy 3 chedy okedy 17 chedy qokaiin 3 chedy okeey 17 qokaiin chedy 3 okar chedy 17 shedy qokedy 3 shedy qokedy 15 chedy qol 2 chedy chckhy 15 ol shedy 2 chedy dal 15 qokal chedy 2 chedy dar 15 shedy qokeedy 2 chedy kedy 13 ol chedy compare-word-colocates \ '\b[ce][tk][eh][ao][rl]\b' \ hea-f-eva.wds heb-f-eva.wds bio-f-eva.wds count hea-f-eva.wds count heb-f-eva.wds count bio-f-eva.wds ----- ---------------------- ----- ---------------------- ----- ---------------------- 8 chol ckhol 1 aiin ckhar 1 ckhal saiin 8 daiin ckhor 1 ckhar od 1 ckhol chedy 7 daiin ckhol 1 ckhol ol 1 ckhol skar 6 ckhol chol 1 okaiin ckhol 1 ckhor chey 5 ckhol daiin 1 ckhor olchdy 3 chor ckhol 1 daiin ckhal 3 chor ckhor 1 iin ckhor 3 ckhol dy 1 olshey ckhor 3 ckhor chol 1 qokal ckhol 3 ckhor okol 1 rkaiin ckhol 97-12-22 stolfi =============== It occurred to me that the labels should tell us a lot about valid word prefixes and suffixes, since their word boundaries shoudl be more reliable than those defined by spaces in the manuscript. Another possible source for that information is the words in line-initial and line-final position. Quick test: cat Note-010/labtit.evt \ | egrep -v '\.T[0-9]*\.[0-9].*>' \ > .labels-eva.evt Edited .labels-eva.evt, removing some non-labels and garbage, producing .labels-s-eva.evt. extract-words-from-interlin \ -chars 'aoeilmnrchtpkfsqjdvxyg' \ .labels-s-eva.evt \ .labels-s-eva lines words bytes file ------ ------- --------- ------------ 327 745 3485 .labels-s-eva.txt 754 754 3503 .labels-s-eva.wds 357 357 2500 .labels-s-eva.dic 410 410 2760 .labels-s-eva-gut.wds 344 344 2419 .labels-s-eva-gut.dic 334 334 668 .labels-s-eva-fun.wds 3 3 6 .labels-s-eva-fun.dic 10 10 75 .labels-s-eva-bad.wds 10 10 75 .labels-s-eva-bad.dic Sample from .labels-s-eva.txt: otaik dak alak = otaldy = otoky = seeyar = ykas asy = sosainr = oteey dar = ytodaiir = Digraph counts: TT . / = a o e i l m n r c h t p k f s q j d y g ? - ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 86 . . . 32 13 . . . . . . 11 . 2 . . . 9 . . 12 4 2 1 . / 3 . . . . 2 . . . . . . . . . . . . . . . . 1 . . . = 324 . . . 4 217 2 1 . . . 1 26 . 1 . 2 . 27 2 . 22 18 . 1 . a 327 . . 4 . 1 1 51 105 28 5 107 . . 1 . 6 . 5 . 4 4 2 1 2 . o 424 2 . 5 10 1 8 4 74 2 1 46 13 . 105 14 69 11 21 . 2 32 2 . 2 . e 130 . . . 8 39 32 1 . 1 . 2 1 . 5 3 4 2 6 . . 5 19 1 1 . i 107 . . . . . . 39 1 . 42 20 . . 1 . 2 1 . . . . . . 1 . l 184 14 . 45 31 9 6 3 . . . 1 10 . . . 5 . 14 . . 19 22 4 . 1 m 32 2 . 28 . . . . . . . . 1 . . . . . . . . . . 1 . . n 48 10 1 27 3 . . . . . . 1 . . . . . . . . 1 1 3 . . 1 r 180 27 . 56 47 11 2 2 . . . . 7 . . . . . 2 . . 1 21 2 1 1 c 109 . . . . . . . . . . . . 95 4 4 4 2 . . . . . . . . h 138 . . 1 17 41 36 . . . . 1 2 . 1 . . 1 3 . 1 18 15 . 1 . t 129 1 . . 51 31 21 . . . . 1 9 4 . . . . 1 . . 1 8 . 1 . p 23 . . . 5 4 1 . . . . . 8 4 . . . . 1 . . . . . . . k 104 2 . 1 37 22 17 . 2 . . . 8 4 . . . . 1 . . . 9 . 1 . f 18 1 . 1 8 2 . . . . . . 3 2 . . . . . . . . 1 . . . s 102 8 . 12 21 14 3 3 . . . . 1 29 . . 1 . . . . 1 7 . . 2 q 2 . . . . 1 . . . . . . . . . . 1 . . . . . . . . . j 8 1 . 6 . . . . . . . . . . . . . . . . . . 1 . . . d 130 1 . 7 49 13 . 3 . 1 . . 7 . . . . . 1 . . . 48 . . . y 192 17 1 127 1 1 . . . . . . 1 . 8 2 10 1 7 . . 13 1 . . 2 g 11 . . 3 2 . . . . . . . . . . . . . . . . . 6 . . . ? 17 . . 2 1 1 1 . 2 . . . . . 1 . . . 1 . . . 3 . 5 . - 7 . . . . 1 . . . . . . 1 . . . . . 3 . . 1 1 . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 2835 86 3 324 327 424 130 107 184 32 48 180 109 138 129 23 104 18 102 2 8 130 192 11 17 7 Next-symbol probability (× 99): TT . / = a o e i l m n r c h t p k f s q j d y g ? - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . . . 37 15 . . . . . . 13 . 2 . . . 10 . . 14 5 2 1 . / 99 . . . . 66 . . . . . . . . . . . . . . . . 33 . . . = 99 . . . 1 66 1 . . . . . 8 . . . 1 . 8 1 . 7 6 . . . a 99 . . 1 . . . 15 32 8 2 32 . . . . 2 . 2 . 1 1 1 . 1 . o 99 . . 1 2 . 2 1 17 . . 11 3 . 25 3 16 3 5 . . 7 . . . . e 99 . . . 6 30 24 1 . 1 . 2 1 . 4 2 3 2 5 . . 4 14 1 1 . i 99 . . . . . . 36 1 . 39 19 . . 1 . 2 1 . . . . . . 1 . l 99 8 . 24 17 5 3 2 . . . 1 5 . . . 3 . 8 . . 10 12 2 . 1 m 99 6 . 87 . . . . . . . . 3 . . . . . . . . . . 3 . . n 99 21 2 56 6 . . . . . . 2 . . . . . . . . 2 2 6 . . 2 r 99 15 . 31 26 6 1 1 . . . . 4 . . . . . 1 . . 1 12 1 1 1 c 99 . . . . . . . . . . . . 86 4 4 4 2 . . . . . . . . h 99 . . 1 12 29 26 . . . . 1 1 . 1 . . 1 2 . 1 13 11 . 1 . t 99 1 . . 39 24 16 . . . . 1 7 3 . . . . 1 . . 1 6 . 1 . p 99 . . . 22 17 4 . . . . . 34 17 . . . . 4 . . . . . . . k 99 2 . 1 35 21 16 . 2 . . . 8 4 . . . . 1 . . . 9 . 1 . f 99 6 . 6 44 11 . . . . . . 17 11 . . . . . . . . 6 . . . s 99 8 . 12 20 14 3 3 . . . . 1 28 . . 1 . . . . 1 7 . . 2 q 99 . . . . 50 . . . . . . . . . . 50 . . . . . . . . . j 99 12 . 74 . . . . . . . . . . . . . . . . . . 12 . . . d 99 1 . 5 37 10 . 2 . 1 . . 5 . . . . . 1 . . . 37 . . . y 99 9 1 65 1 1 . . . . . . 1 . 4 1 5 1 4 . . 7 1 . . 1 g 99 . . 27 18 . . . . . . . . . . . . . . . . . 54 . . . ? 99 . . 12 6 6 6 . 12 . . . . . 6 . . . 6 . . . 17 . 29 . - 99 . . . . 14 . . . . . . 14 . . . . . 42 . . 14 14 . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 3 0 11 11 15 5 4 6 1 2 6 4 5 5 1 4 1 4 0 0 5 7 0 1 0 Previous-symbol probability (× 99): TT . / = a o e i l m n r c h t p k f s q j d y g ? - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 3 . . . 10 3 . . . . . . 10 . 2 . . . 9 . . 9 2 18 6 . / 0 . . . . . . . . . . . . . . . . . . . . . 1 . . . = 11 . . . 1 51 2 1 . . . 1 24 . 1 . 2 . 26 99 . 17 9 . 6 . a 11 . . 1 . . 1 47 56 87 10 59 . . 1 . 6 . 5 . 50 3 1 9 12 . o 15 2 . 2 3 . 6 4 40 6 2 25 12 . 81 60 66 61 20 . 25 24 1 . 12 . e 5 . . . 2 9 24 1 . 3 . 1 1 . 4 13 4 11 6 . . 4 10 9 6 . i 4 . . . . . . 36 1 . 87 11 . . 1 . 2 6 . . . . . . 6 . l 6 16 . 14 9 2 5 3 . . . 1 9 . . . 5 . 14 . . 14 11 36 . 14 m 1 2 . 9 . . . . . . . . 1 . . . . . . . . . . 9 . . n 2 12 33 8 1 . . . . . . 1 . . . . . . . . 12 1 2 . . 14 r 6 31 . 17 14 3 2 2 . . . . 6 . . . . . 2 . . 1 11 18 6 14 c 4 . . . . . . . . . . . . 68 3 17 4 11 . . . . . . . . h 5 . . . 5 10 27 . . . . 1 2 . 1 . . 6 3 . 12 14 8 . 6 . t 5 1 . . 15 7 16 . . . . 1 8 3 . . . . 1 . . 1 4 . 6 . p 1 . . . 2 1 1 . . . . . 7 3 . . . . 1 . . . . . . . k 4 2 . . 11 5 13 . 1 . . . 7 3 . . . . 1 . . . 5 . 6 . f 1 1 . . 2 . . . . . . . 3 1 . . . . . . . . 1 . . . s 4 9 . 4 6 3 2 3 . . . . 1 21 . . 1 . . . . 1 4 . . 28 q 0 . . . . . . . . . . . . . . . 1 . . . . . . . . . j 0 1 . 2 . . . . . . . . . . . . . . . . . . 1 . . . d 5 1 . 2 15 3 . 3 . 3 . . 6 . . . . . 1 . . . 25 . . . y 7 20 33 39 . . . . . . . . 1 . 6 9 10 6 7 . . 10 1 . . 28 g 0 . . 1 1 . . . . . . . . . . . . . . . . . 3 . . . ? 1 . . 1 . . 1 . 1 . . . . . 1 . . . 1 . . . 2 . 29 . - 0 . . . . . . . . . . . 1 . . . . . 3 . . 1 1 . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 3.995 Next-symbol entropy: 2.452 Splitting prefix/midfix/suffix: cat .labels-s-eva-gut.wds \ | sed \ -e 's/sh/X/g' \ -e 's/$/}/' \ -e 's/^/{/' \ -e 's/{\([qoaydirslmngj][qoaydirslmngj]*\)/\1{/' \ -e 's/\([qoaydirslmngj][qoaydirslmngj]*\)}/}\1/' \ -e 's/X/sh/g' \ -e 's/{}/\./' \ -e 's/\.//g' \ -e 's/{/- -/' \ -e 's/}/- -/' \ > .labels-s-eva.fwd cat .labels-s-eva.fwd \ | grep -v -e '- -' \ > .labels-s-unifs-all.wds cat .labels-s-eva.fwd \ | grep -e '- -' \ | gawk '/./ {print $1}' \ > .labels-s-prefs-all.wds cat .labels-s-eva.fwd \ | grep -e '- -' \ | gawk '/./ {print $2}' \ > .labels-s-midfs-all.wds cat .labels-s-eva.fwd \ | grep -e '- -' \ | gawk '/./ {print $3}' \ > .labels-s-suffs-all.wds dicio-wc .labels-s-{prefs,midfs,suffs,unifs}-all.wds lines words bytes file ------ ------- --------- ------------ 312 312 940 .labels-s-prefs-all.wds 312 312 1673 .labels-s-midfs-all.wds 312 312 1488 .labels-s-suffs-all.wds 98 98 531 .labels-s-unifs-all.wds foreach f ( prefs midfs suffs unifs ) cat .labels-s-${f}-all.wds \ | sort | uniq -c | expand | sort +0 -1nr \ > .labels-s-${f}-all.frq end dicio-wc .labels-s-{prefs,midfs,suffs,unifs}-all.frq lines words bytes file ------ ------- --------- ------------ 32 64 396 .labels-s-prefs-all.frq 87 174 1317 .labels-s-midfs-all.frq 118 236 1621 .labels-s-suffs-all.frq 79 158 1095 .labels-s-unifs-all.frq pr -m -w 64 -e -t \ .labels-s-{prefs,midfs,suffs,unifs}-all.frq \ | expand \ > .labels-s-joint-all.frq freq prefix freq midfix freq suffix freq unifix ---- -------- ---- -------- ---- -------- ---- -------- 194 o- 66 -t- 41 -y 6 am 54 - 55 -k- 19 -ol 6 ar 19 y- 25 -ch- 17 -ar 3 ary 8 ol- 13 -che- 16 -al 2 dy 4 d- 13 -te- 14 -or 2 gy 3 dy- 8 -sh- 11 -ody 2 odor 2 a- 7 -f- 10 -dy 2 sal 2 da- 7 -tch- 9 -aly 2 sar 2 dar- 6 -ke- 7 - 2 sary 2 so- 6 -kee- 6 -os 2 siiir 1 adair- 6 -pch- 5 -alar 1 aiin 1 al- 5 -p- 4 -aiin 1 ainaly 1 ala- 5 -she- 4 -ain 1 ainam 1 alam- 4 -kch- 4 -air 1 airar 1 ali- 3 -tok- 4 -aram 1 al 1 arar- 3 -tolch- 4 -ary 1 alols 1 aro- 2 -chet- 4 -dal 1 aly 1 do- 2 -cph- 4 -oldy 1 araly 1 dol- 2 -cth- 4 -orain 1 arar 1 il- 2 -ee- 3 -am 1 araydy 1 oal- 2 -fch- 3 -o 1 arody 1 oar- 2 -pche- 3 -odar 1 asy 1 or- 2 -talsh- 3 -oly 1 daiin 1 oyd- 2 -tare- 3 -r 1 daiindy 1 q- 2 -tee- 3 -s 1 dainy 1 qo- 1 -cfh- 2 -alaiin 1 dal 1 s- 1 -chckhe 2 -aldy 1 dalary 1 siiir- 1 -chee- 2 -alody 1 daliir 1 soi- 1 -cheeee 2 -an 1 dalsy 1 sol- 1 -chek- 2 -araiin 1 dan 1 yd- 1 -cheoct 2 -aral 1 dar 1 yy- 1 -chep- 2 -as 1 daramga 1 -chete- 2 -d 1 dararai 1 -chf- 2 -oaiin 1 dariiir 1 -choee- 2 -oaly 1 dary 1 -chof- 2 -olar 1 diin 1 -chok- 2 -ols 1 dolaj 1 -cholsh 2 -om 1 dolaram 1 -chosar 2 -yd 1 dolary 1 -chotee 1 -aday 1 dolory 1 -ckhe- 1 -ainy 1 doly 1 -cphe- 1 -airdy 1 dydarii 1 -e- 1 -airy 1 oaiin 1 -eep- 1 -aj 1 odaiin 1 -ekeee- 1 -ala 1 odaiir 1 -eoe- 1 -alain 1 odiiir 1 -eolale 1 -alal 1 odory 1 -et- 1 -alalg 1 ody 1 -faef- 1 -alaly 1 oin 1 -fche- 1 -alam 1 olaran 1 -fysk- 1 -ald 1 olaras 1 -karch- 1 -aldar 1 oldam 1 -kche- 1 -aldm 1 oldar 1 -kchoch 1 -aldo 1 onary 1 -kchsh- 1 -algar 1 oral 1 -keech- 1 -aloiir 1 orald 1 -keee- 1 -alrar 1 oram 1 -keeep- 1 -alsain 1 orar 1 -kocfh- 1 -alsy 1 oraraly 1 -kocth- 1 -alyd 1 oroj 1 -koee- 1 -any 1 orol 1 -kolsh- 1 -ao 1 osaro 1 -kshdch 1 -aralar 1 salal 1 -kydse- 1 -araldy 1 saldam 1 -pee- 1 -aralgy 1 saloiin 1 -pocph- 1 -arar 1 salols 1 -psh- 1 -aro 1 soaiin 1 -shch- 1 -dagy 1 sodar 1 -shockh 1 -daiir 1 solsy 1 -sholsh 1 -dajy 1 soly 1 -taik- 1 -dar 1 sorala 1 -tak- 1 -din 1 sororal 1 -takaik 1 -dorgy 1 sorory 1 -talch- 1 -g 1 sosainr 1 -talef- 1 -iir 1 sydarar 1 -talek- 1 -lairgy 1 sysam 1 -tche- 1 -ldam 1 y 1 -tchosh 1 -m 1 yorain 1 -teee- 1 -oaldy 1 ys 1 -tockh- 1 -odady 1 -toee- 1 -odaiin 1 -tolcht 1 -odaiir 1 -tooee- 1 -odals 1 -torche 1 -odol 1 -tose- 1 -oj 1 -tosh- 1 -olaiin 1 -tshsh- 1 -olam 1 -olarol 1 -oldain 1 -olg 1 -olinj 1 -oloara 1 -olor 1 -ora 1 -orad 1 -oraj 1 -oraldy 1 -oram 1 -orol 1 -ory 1 -osal 1 -osam 1 -osar 1 -osarar 1 -osdy 1 -oys 1 -ral 1 -sas 1 -sody 1 -sos 1 -sy 1 -yar 1 -yda 1 -ydal 1 -ydary 1 -ydy 1 -ys 1 -ysam labels herbal-A herbal-B ----------- ----------- ----------- {o-,y-,a-} 215 (69%) 1234 (21%) 715 (29%) {-} 54 (17%) 3656 (61%) 1234 (50%) {qo-} 1 (0.3%) 603 (10%) 300 (12%) {ol-,al-} 9 (2.8%) 35 (0.6%) 62 (2.5%) {dy-,da-,do} 6 (1.9%) 22 (0.4%) 13 (0.5%) {d-} 4 (1.3%) 201 (3.3%) 35 (1.4%) There are also a few "micro-complex" prefixes with 1-2 occurrences each. The frequencies are roughly similar to the text, except that * the empty prefix got supplanted by {o-,y-,a-}: the frequencies are 50% and 29% in B text, 61% and 21% in A text, 17% and 67% in labels. * On the other hand, the qo- prefix ios practically non-existent in labels: 1 occurrence. (There is also an occurrence of "q-" alone, perhaps a transcription error?) qokoaiin.ockhey={Label on line West from central square} qkol={pharma label} In contrast, qo- occurs on 10% of herbal-A words, and 12% of herbal-B words. This is all the more remarkable given the increased frequency of the o- prefix in labels. The low frequency of "qo-"s in labels confirms the thesis that "q" is not part of the word, but a prefixed particle (article, conjuntion, preposition). The enhanced frequency of {o-,y-,a-} suggests that words with that prefix are nouns, or that the prefix is an article. We should compare the occurrences of a tailfix with and without the q-, o-, and qo- prefixes... The midfixes are dominated by {-t-,-k-} which again is a characteristic of herbal-B. (In herbal-A the midfixes {-ch-,-sh-} are twice as common as {-k-,-t-}. Compare: by many by Friedman by Currier by Friedman by Currier Labels language B language B language A language A freq midfix freq midfix freq midfix freq midfix freq pc midfix ---- ------ ---- ------ ---- --------- ---- ------ ---- -- ------ 66 -t- 407 -k- 288 -k- 1045 -ch- 985 19 -ch- 55 -k- 183 -t- 155 -ke- 526 -sh- 470 9 -sh- 25 -ch- 179 -ke- 138 -che- 469 -k- 438 8 -k- 13 -che- 172 -ch- 127 -ch- 444 -t- 427 8 -t- 13 -te- 163 -che- 116 -t- 353 -cth- 298 6 -cth- 8 -sh- 110 -she- 88 -kee- 335 -tch- 280 5 -tch- 7 -f- 101 -kee- 85 -te- 297 -kch- 260 5 -kch- 7 -tch- 95 -te- 75 -she- 251 -che- 201 4 -che- ... ... ... ... ... ... ... ... ... ... Note however the difference in the relative frequencies of -k- versus -t-: 10:12 in labels, 10:5 in B text, 10:9 in A text. Moreover, -te- replaces -ke- as the most common "e"-modified midfix. These numbers supports the thesis that -t- and -k- are merely variant shapes of the same letter, with -t- being more formal and -k- more cursive. As for suffixes, here is a summary: labels A-text B-text ----------- ----------- ----------- {-y,-o} 44 (14%) 2200 (37%) 583 (23%) {-ol,-al,-or,-ar} 67 (21%) 1886 (32%) 400 (16%) - 7 (2.2%) 124 (2.1%) 63 (2.5%) -ody 11 (3.5%) 218 (3.7%) 111 (4.5%) -dy 10 (3.2%) 23 (0.4%) 639 (11%) -aiin 4 (1.2%) 316 (5.3%) 143 (5.8%) The frequencies of {-ol,-al,-or,-ar}, {-}, and {-dy} seem roughly consistent with a mixture of A and B text. In particular the -y:-dy ratio is 90:21, which lies between the ratios for A text (90:1) and B text (90:105). However, the frequencies of {-y,-o} and {-ody} are a bit too low, and that of {-aiin} is significantly lower. In fact the tail of the distribution is longer than that of midfixes, whereas in the text the midfixes have a much longer tail. Perhaps these observations can be explained by selective omission or insertion of spaces in the text vs. labels. 97-12-23 stolfi =============== Now extract line-initial and line-final words: foreach lang ( a b ) cat he${lang}-f-eva.wds \ | gawk 'BEGIN {s=1} /[\/=]/ {s=1;next}; /./ {if(s)print; s=0}' \ > he${lang}-f-bol.wds cat he${lang}-f-eva.wds \ | gawk 'BEGIN {w=""} /[\/=]/ {if(w!=""){print w;w=""};next}; /./ {w=$0}' \ > he${lang}-f-eol.wds end dicio-wc he{a,b}-f-{bol,eol}.wds lines words bytes file ------ ------- --------- ------------ 1216 1216 7610 hea-f-bol.wds 1216 1216 7010 hea-f-eol.wds 362 362 2353 heb-f-bol.wds 362 362 2039 heb-f-eol.wds foreach lang ( a b ) foreach ext ( bol eol ) cat he${lang}-f-${ext}.wds \ | sed \ -e 's/sh/X/g' \ -e 's/$/}/' \ -e 's/^/{/' \ -e 's/{\([qoaydirslmngj][qoaydirslmngj]*\)/\1{/' \ -e 's/\([qoaydirslmngj][qoaydirslmngj]*\)}/}\1/' \ -e 's/X/sh/g' \ -e 's/{}/\./' \ -e 's/\.//g' \ -e 's/{/- -/' \ -e 's/}/- -/' \ > .he${lang}-f-${ext}.fwd end end lines words bytes file ------ ------- --------- ------------ 1216 3152 13418 .hea-f-bol.fwd 1216 2668 11366 .hea-f-eol.fwd 362 926 4045 .heb-f-bol.fwd 362 792 3329 .heb-f-eol.fwd foreach lang ( a b ) foreach ext ( bol eol ) cat .he${lang}-f-${ext}.fwd \ | grep -v -e '- -' \ > .he${lang}-f-${ext}-unifs-all.wds cat .he${lang}-f-${ext}.fwd \ | grep -e '- -' \ | gawk '/./ {print $2}' \ > .he${lang}-f-${ext}-midfs-all.wds end cat .he${lang}-f-bol.fwd \ | grep -e '- -' \ | gawk '/./ {print $1}' \ > .he${lang}-f-bol-prefs-all.wds cat .he${lang}-f-eol.fwd \ | grep -e '- -' \ | gawk '/./ {print $3}' \ > .he${lang}-f-eol-suffs-all.wds end dicio-wc .he{a,b}-f-[be]ol-{prefs,midfs,suffs,unifs}-all.wds lines words bytes file ------ ------- --------- ------------ 968 968 2737 .hea-f-bol-prefs-all.wds 282 282 754 .heb-f-bol-prefs-all.wds 968 968 5645 .hea-f-bol-midfs-all.wds 282 282 1673 .heb-f-bol-midfs-all.wds 726 726 4085 .hea-f-eol-midfs-all.wds 215 215 1166 .heb-f-eol-midfs-all.wds 726 726 3128 .hea-f-eol-suffs-all.wds 215 215 909 .heb-f-eol-suffs-all.wds 248 248 1176 .hea-f-bol-unifs-all.wds 490 490 2302 .hea-f-eol-unifs-all.wds 80 80 400 .heb-f-bol-unifs-all.wds 147 147 665 .heb-f-eol-unifs-all.wds foreach f ( .he[ab]-f-[eb]ol-{prefs,midfs,suffs,unifs}-all.wds ) cat ${f} \ | sort | uniq -c | expand | sort +0 -1nr \ > ${f:r}.frq end dicio-wc .he[ab]-f-[eb]ol-{prefs,midfs,suffs,unifs}-all.frq lines words bytes file ------ ------- --------- ------------ 169 338 2661 .hea-f-bol-midfs-all.frq 34 68 419 .hea-f-bol-prefs-all.frq 119 238 1852 .hea-f-eol-midfs-all.frq 110 220 1486 .hea-f-eol-suffs-all.frq 77 154 1180 .heb-f-bol-midfs-all.frq 22 44 259 .heb-f-bol-prefs-all.frq 59 118 880 .heb-f-eol-midfs-all.frq 54 108 722 .heb-f-eol-suffs-all.frq 93 186 1203 .hea-f-bol-unifs-all.frq 149 298 1986 .hea-f-eol-unifs-all.frq 35 70 464 .heb-f-bol-unifs-all.frq 82 164 1061 .heb-f-eol-unifs-all.frq pr -m -w 80 -e -t \ .labels-s-prefs-all.frq \ .he{a,b}-f-bol-prefs-all.frq \ Note-009/he{a,b}-f-prefs-all.frq \ | expand \ > .prefs-joint.frq all labels herbal-A bol herbal-B bol herbal-A all herbal-B all freq prefix freq prefix freq prefix freq prefix freq prefix ---- --------- ---- --------- ---- --------- ---- --------- ---- --------- 194 o- 384 - 136 - 3656 - 1234 - 54 - 154 o- 58 y- 807 o- 490 o- 19 y- 141 qo- 22 qo- 603 qo- 300 qo- 8 ol- 138 y- 19 d- 424 y- 216 y- 4 d- 71 d- 17 o- 201 d- 57 ol- 3 dy- 17 s- 9 l- 55 s- 35 d- 2 a- 11 so- 6 ol- 33 ol- 26 l- 2 da- 7 oy- 1 a- 20 so- 10 dy- 2 dar- 6 ol- 1 al- 15 l- 9 a- 2 so- 4 l- 1 ara- 13 dy- 6 s- 1 adair- 4 yo- 1 dy- 12 r- 5 al- 1 al- 3 dy- 1 lo- 10 oy- 3 a:i- 1 ala- 3 or- 1 lqo- 9 or- 3 dal- 1 alam- 2 od- 1 q- 6 da:i- 3 lo- 1 ali- 2 os- 1 qol- 6 do- 2 da:i- 1 arar- 2 q- 1 r- 6 od- 2 do- 1 aro- 2 yd- 1 s- 5 os- 2 or- 1 do- 1 dain- 1 sol- 5 yo- 2 qol- 1 dol- 1 dls- 1 ss- 4 qod- 2 r- 1 il- 1 dor- 1 yd- 4 ro- 1 a:ii- 1 oal- 1 i- 1 yo- 4 sol- 1 ad- 1 oar- 1 lor- 1 yol- 3 a- 1 ao- 1 or- 1 oldai- 3 da- 1 ar- 1 oyd- 1 ols- 3 dol- 1 ara- 1 q- 1 oor- 3 lo- 1 da- 1 qo- 1 oqo- 3 sy- 1 dalo- 1 s- 1 oso- 3 yd- 1 dol- 1 siiir- 1 qod- 2 al- 1 dor- 1 soi- 1 qoo- 2 da:in- 1 lol- 1 sol- 1 qy- 2 dal- 1 lqo- 1 yd- 1 ro- 2 dor- 1 o:n- 1 yy- 1 syd- 2 old- 1 od- 1 ydarai- 2 qoda:i- 1 olo- 1 yol- 1 :i- 1 orol- 1 :iiin- 1 oy- 1 a:i- 1 sa:i- 1 ar- 1 say- 1 da:iinr 1 sol- 1 dao- 1 ss- 1 dar- 1 sy- 1 darod- 1 yd- 1 day- 1 yo- 1 dl- 1 yol- 1 dls- 1 ds- 1 dyo- 1 lol- 1 lor- 1 ls- 1 oda:i- 1 olda:i- 1 ols- 1 oly- 1 oo- 1 oor- 1 oqo- 1 ora- 1 ory- 1 oso- 1 qol- 1 qoo- 1 qor- 1 qos- 1 qoy- 1 rolo- 1 sa- 1 sa:i- 1 soo- 1 syd- 1 ydara:i 1 yol- 1 yr- Comparing the beginning-of-line statistics with those of all words, we can see that: * In language A, the ratio y-/o- changes from 0.90 to 0.53; whereas in language B the ratio changes from 3.41 to 0.44. Yet another argument for the thesis that y- is merely a more ornate form of o-. * Otherwise, the major prefix frequencies seem roughly the same. Which is encouraging, since it says that line breaks and word spaces are similar. If line breaks are true word boundaries, then the same is true of most spaces. * The bol sample has a smaller set of prefixes, but that seems to be just about the expected number given the ratio of sample sizes (1:6). * The prefix frequencies in labels are significantly different from those in any of the four word samples. Now for the suffixes: pr -m -w 80 -e -t \ .labels-s-suffs-all.frq \ .he{a,b}-f-eol-suffs-all.frq \ Note-009/he{a,b}-f-suffs-all.frq \ | expand \ > .suffs-joint.frq all labels herbal-A eol herbal-B eol herbal-A all herbal-B all freq suffix freq suffix freq suffix freq suffix freq suffix ---- --------- ---- --------- ---- --------- ---- --------- ---- --------- 41 -y 190 -y 53 -y 1816 -y 639 -dy 19 -ol 58 -aiin 31 -am 903 -ol 533 -y 17 -ar 41 -ody 30 -dy 705 -or 168 -ar 16 -al 29 -am 9 -aiin 360 -o 143 -aiin 14 -or 29 -ol 8 - 316 -aiin 111 -ody 11 -ody 28 -om 8 -ar 218 -ody 97 -ol 10 -dy 28 -or 6 -ain 174 -ar 84 -al 9 -aly 20 - 6 -al 124 - 63 - 7 - 20 -al 6 -ody 104 -al 51 -or 6 -os 18 -ar 5 -ary 76 -odaiin 44 -o 5 -alar 14 -oldy 3 -daiin 76 -s 40 -am 4 -aiin 13 -ory 3 -dam 71 -os 34 -daiin 4 -ain 12 -dy 2 -ald 69 -od 33 -d 4 -air 11 -ain 2 -ardam 66 -om 31 -os 4 -aram 11 -an 2 -d 61 -am 30 -s 4 -ary 11 -oly 2 -ol 49 -oiin 28 -dar 4 -dal 9 -od 2 -or 48 -ain 26 -ain 4 -oldy 8 -odaiin 1 -a 40 -oldy 19 -od 4 -orain 8 -os 1 -aiily 35 -oy 15 -air 3 -am 8 -s 1 -ainqod 29 -an 14 -dal 3 -o 6 -a 1 -alaiin 29 -oly 11 -aly 3 -odar 6 -o 1 -alam 27 -ory 10 -aldy 3 -oly 5 -aldy 1 -alas 26 -odar 9 -odaiin 3 -r 5 -old 1 -aldy 24 -a 7 -dain 3 -s 5 -ordy 1 -aly 23 -dy 7 -dam 2 -alaiin 5 -yd 1 -amdy 15 -odal 6 -a 2 -aldy 5 -ydy 1 -ara 14 -oaiin 6 -ary 2 -alody 4 -ald 1 -aram 14 -yd 6 -odar 2 -an 4 -ary 1 -arar 12 -d 6 -oy 2 -araiin 4 -d 1 -ardy 12 -n 5 -dair 2 -aral 4 -m 1 -aro 12 -ydy 4 -dol 2 -as 4 -oaiin 1 -aros 10 -air 4 -oar 2 -d 4 -oy 1 -da 10 -l 4 -odal 2 -oaiin 4 -ys 1 -daly 10 -odain 4 -oldy 2 -oaly 3 -odain 1 -dol 10 -ols 4 -sy 2 -olar 3 -odal 1 -dolaii 9 -ordy 3 -araiin 2 -ols 3 -odar 1 -dydy 9 -sy 3 -aral 2 -om 3 -oiin 1 -dym 8 -aldy 3 -arar 2 -yd 3 -olo 1 -m 8 -odol 3 -ardy 1 -aday 3 -ols 1 -o 8 -olol 3 -as 1 -ainy 3 -yds 1 -oam 8 -r 3 -daly 1 -airdy 2 -dm 1 -odydy 7 -ady 3 -dor 1 -airy 2 -n 1 -odys 7 -ald 3 -dydy 1 -aj 2 -odody 1 -old 7 -old 3 -oly 1 -ala 2 -olm 1 -olody 7 -olo 3 -ydy 1 -alain 2 -olol 1 -ols 6 -aiir 2 -adaiin 1 -alal 1 -adaiin 1 -oram 6 -aly 2 -ady 1 -alalg 1 -aiiin 1 -orar 6 -ary 2 -ainr 1 -alaly 1 -aiim 1 -rodal 6 -oar 2 -ald 1 -alam 1 -aiind 1 -saiin 6 -odam 2 -amdy 1 -ald 1 -aiiny 1 -san 6 -olor 2 -an 1 -aldar 1 -air 1 -ym 6 -ydaiin 2 -aram 1 -aldm 1 -alod 1 -yom 5 -as 2 -ardam 1 -aldo 1 -alody 1 -yoram 5 -m 2 -da 1 -algar 1 -arar 5 -odaly 2 -dody 1 -aloiir 1 -ardl 5 -odan 2 -l 1 -alrar 1 -ariin 5 -on 2 -oiin 1 -alsain 1 -arm 5 -oror 2 -ols 1 -alsy 1 -aro 5 -ys 2 -oody 1 -alyd 1 -aroiin 4 -ay 2 -oram 1 -any 1 -arom 4 -daiin 2 -orar 1 -ao 1 -aryd 4 -dal 2 -so 1 -aralar 1 -as 4 -oal 2 -yl 1 -araldy 1 -da 4 -oiiin 1 -ad 1 -aralgy 1 -daiin 4 -oraiin 1 -ai:dy 1 -arar 1 -dal 4 -osy 1 -aiiin 1 -aro 1 -dam 3 -adaiin 1 -aiily 1 -dagy 1 -ds 3 -dain 1 -aiiny 1 -daiir 1 -l 3 -iin 1 -aiir 1 -dajy 1 -ld 3 -odl 1 -airaii 1 -dar 1 -oas 3 -olody 1 -airy 1 -din 1 -odaiin 3 -ooiin 1 -alaiin 1 -dorgy 1 -odam 3 -oor 1 -alaiin 1 -g 1 -odan 3 -orody 1 -alam 1 -iir 1 -odary 3 -orory 1 -alas 1 -lairgy 1 -odd 3 -osaiin 1 -aldaii 1 -ldam 1 -oddal 3 -ydal 1 -aldar 1 -m 1 -odoldy 3 -yds 1 -alody 1 -oaldy 1 -oiiin 2 -aiiin 1 -als 1 -odady 1 -olal 2 -alod 1 -amar 1 -odaiin 1 -oldam 2 -als 1 -ara 1 -odaiir 1 -oldar 2 -dam 1 -aro 1 -odals 1 -oloaii 2 -dm 1 -arodai 1 -odol 1 -olodal 2 -oary 1 -aror 1 -oj 1 -olom 2 -odody 1 -aros 1 -olaiin 1 -olsy 2 -odor 1 -asal 1 -olam 1 -on 2 -odys 1 -ay 1 -olarol 1 -ora 2 -olaiin 1 -daiiin 1 -oldain 1 -oraiin 2 -oldam 1 -dalo 1 -olg 1 -orar 2 -oldar 1 -dalor 1 -olinj 1 -ord 2 -olm 1 -dly 1 -oloara 1 -ordm 2 -olom 1 -do 1 -olor 1 -orly 2 -orar 1 -dolaii 1 -ora 1 -orm 2 -orol 1 -ds 1 -orad 1 -orods 2 -orom 1 -dsairy 1 -oraj 1 -oroiin 2 -osar 1 -dyd 1 -oraldy 1 -orom 1 -aar 1 -dyldy 1 -oram 1 -oror 1 -ad 1 -dym 1 -orol 1 -orory 1 -adam 1 -i:dy 1 -ory 1 -oross 1 -adar 1 -lain 1 -osal 1 -oryd 1 -aii:dy 1 -lal 1 -osam 1 -osory 1 -aii:m 1 -ls 1 -osar 1 -osy 1 -aii:od 1 -ly 1 -osarar 1 -oyd 1 -aii:s 1 -m 1 -osdy 1 -r 1 -aiilm 1 -oal 1 -oys 1 -sm 1 -aiind 1 -oam 1 -ral 1 -sordy 1 -aiinda 1 -odain 1 -sas 1 -sy 1 -aiiny 1 -odair 1 -sody 1 -yddor 1 -aind 1 -odalai 1 -sos 1 -yqoldy 1 -ainos 1 -odaly 1 -sy 1 -airin 1 -odam 1 -yar 1 -alody 1 -odody 1 -yda 1 -aor 1 -odydy 1 -ydal 1 -arar 1 -odys 1 -ydary 1 -arasy 1 -olal 1 -ydy 1 -ardl 1 -olar 1 -ys 1 -ariin 1 -old 1 -ysam 1 -arm 1 -olody 1 -aro 1 -olol 1 -aroiin 1 -olor 1 -arom 1 -oraiin 1 -aryd 1 -oyl 1 -da 1 -riin 1 -dan 1 -rodal 1 -doly 1 -saiin 1 -dom 1 -san 1 -draird 1 -sar 1 -ds 1 -sdy 1 -i:s 1 -ym 1 -ir 1 -yom 1 -ld 1 -yoram 1 -lol 1 -yr 1 -lor 1 -lsy 1 -ly 1 -oain 1 -oair 1 -oan 1 -oarom 1 -oas 1 -oda 1 -odaiin 1 -odaiir 1 -odair 1 -odairo 1 -odals 1 -odary 1 -odd 1 -oddal 1 -oddy 1 -odo 1 -odoaly 1 -odoldy 1 -odoral 1 -odr 1 -oii:s 1 -oiir 1 -oin 1 -olal 1 -olda 1 -oldain 1 -oldal 1 -oldm 1 -oldom 1 -oloaii 1 -olodal 1 -oloiin 1 -ololor 1 -olols 1 -ololy 1 -olr 1 -olraii 1 -olsy 1 -oo 1 -ooaiin 1 -ora 1 -orain 1 -oral 1 -oraly 1 -orari: 1 -orary 1 -ord 1 -ordaii 1 -ordm 1 -orl 1 -orly 1 -orm 1 -orodo 1 -orods 1 -oroiin 1 -oross 1 -ors 1 -oryd 1 -osory 1 -oyd 1 -ra 1 -raiin 1 -rrr 1 -ry 1 -saiin 1 -sal 1 -sm 1 -so 1 -sody 1 -sor 1 -sordy 1 -soy 1 -yaiin 1 -yays 1 -ydain 1 -ydainy 1 -yddor 1 -ydlo 1 -ydm 1 -yl 1 -yly 1 -yoar 1 -yol 1 -ysaiin 1 -ysaiin This is intriguing: the ratio -dy/-y, that is the clearest indicator of language B, is more marked on general words than on line-final words. The ratios are 0.012 and 1.199 for text, 0.566 and 0.063 for line-final words. The frequencies of {-al,-ol,-ar,-or} at end-of-line are about half of their overall frequencies, for both languages. The frequency of these suffixes in labels is intermediate between the general frequencies, but higher than the end-of-line frequency. Of the occurrences of -am in language A, 48% are at end of line; in language B, 78% are at end-of-line. Presumably -am is an abbreviation (used at e-o-l to avoid a line break), or something that occurs mostly at end-of-sentence. The {-o,-a}/-y ratio is 0.073 for labels; 0.063 and 0.037 for end-of-line (A and B, respectively); and 0.211 and 0.093 for all herbal words (A and B). Yet one more argument that -y is a fancy version of -o or -a. Here is a summary of the most important suffix classes: labels herbal-A eol herbal-B eol herbal-A all herbal-B all ---- ------ --- -------- ---- ------- ---- ------- ---- ------- -[yoa] 44 (14%) 202 (28%) 55 (26%) 2200 (37%) 583 (24%) -[oa][lr] 66 (21%) 95 (13%) 18 (8.4%) 1886 (32%) 400 (16%) -[aoy]d[yoa] 12 (3.8%) 41 (5.6%) 6 (2.8%) 232 (3.9%) 114 (4.7%) -d[yoa] 10 (3.2%) 13 (1.8%) 31 (14%) 24 (0.4%) 642 (26%) - 7 (2.2%) 20 (2.8%) 8 (3.7%) 124 (2.1%) 63 (2.6%) -aiin 4 (1.2%) 58 (8.0%) 9 (4.2%) 316 (5.3%) 143 (5.9%) Labels seem to use generally less -aiin, -dy, -y, -o Now let's look at unifixes at either extremity: pr -m -w 64 -e -t \ .labels-s-unifs-all.frq \ .hea-f-bol-unifs-all.frq \ .hea-f-eol-unifs-all.frq \ Note-009/hea-f-unifs-all.frq \ | expand \ > .unifs-a-joint.frq all labels herbal-A bol herbal-A eol herbal-A all freq unifix freq unifix freq unifix freq unifix ---- --------- ---- --------- ---- --------- ---- --------- 6 am 46 daiin 107 daiin 412 daiin 6 ar 16 m 30 dy 88 dy 3 ary 12 or 20 dam 87 s 2 dy 11 dain 19 dar 74 dain 2 gy 7 dar 18 dal 71 dar 2 odor 7 dor 18 s 52 or 2 sal 7 sor 12 d 47 dal 2 sar 6 oaiin 11 dain 45 ol 2 sary 6 saiin 11 sy 40 dol 2 siiir 5 dol 8 saiin 34 dam 1 aiin 5 iin 7 am 34 dor 1 ainaly 5 ol 6 da 33 saiin 1 ainam 5 sol 6 dan 26 dair 1 airar 4 doiin 6 dom 24 sy 1 al 4 soiin 6 or 19 odaiin 1 alols 3 in 6 sal 18 ar 1 aly 3 l 5 aiin 17 d 1 araly 3 odaiin 5 ar 16 m 1 arar 3 olor 5 dary 16 sor 1 araydy 3 sar 5 ody 16 y 1 arody 3 y 5 ol 15 aiin 1 asy 3 ydaiin 4 dair 15 r 1 daiin 2 dair 4 r 15 sol 1 daiindy 2 lor 4 raiin 14 sar 1 dainy 2 oain 4 sos 13 qodaiin 1 dal 2 odar 4 y 12 sal 1 dalary 2 qo 3 daiiin 10 al 1 daliir 2 qoaiin 3 dol 10 oaiin 1 dalsy 2 qody 3 n 10 ody 1 dan 2 qor 3 sar 9 am 1 dar 2 soy 3 sol 9 dan 1 daramgal 2 yol 3 ydaiin 9 do Label unifixes and text unifixes seem largey disjoint, except for {am,al,ary,dy,sal,sar}. The longer label unifixes occur once each; exceptions are sary and siir. The unifix "iin" occurs with nonzero frequency at b-o-l but almost absent elsewere. Checking the text one can see that almost all occurrences are due to a common word (usually "daiin") being split across a line break. The "m" unifix which is common at beginning of line is a Friedman transcription bug: in the Currier transcription most of those "m" are actually "g"s attached to the end of the previous line. pr -m -w 64 -e -t \ .labels-s-unifs-all.frq \ .heb-f-bol-unifs-all.frq \ .heb-f-eol-unifs-all.frq \ Note-009/heb-f-unifs-all.frq \ | expand \ > .unifs-b-joint.frq all labels herbal-B bol herbal-B eol herbal-B all freq unifix freq unifix freq unifix freq unifix ---- --------- ---- --------- ---- --------- ---- --------- 6 am 19 daiin 11 dam 86 daiin 6 ar 7 saiin 7 dy 55 or 3 ary 6 dar 6 daiin 50 dar 2 dy 4 iin 6 dal 40 aiin 2 gy 4 m 4 am 39 ar 2 odor 3 dair 4 dar 31 ol 2 sal 3 ol 3 aiin 28 dy 2 sar 3 or 3 ain 25 dal 2 sary 3 r 3 ar 25 saiin 2 siiir 2 dor 3 daly 17 dam 1 aiin 2 sor 3 ldy 14 ody 1 ainaly 1 aiin 3 ody 12 oraiin 1 ainam 1 ain 3 ol 12 s 1 airar 1 ar 3 or 9 al 1 al 1 daiir 3 s 8 olaiin 1 alols 1 darar 3 saiin 8 r 1 aly 1 iir 2 al 7 ain 1 araly 1 laiin 2 aram 7 dair 1 arar 1 lodaiin 2 d 7 odaiin 1 araydy 1 oaiin 2 da 7 y 1 arody 1 odair 2 daram 6 dol 1 asy 1 olair 2 dary 6 raiin 1 daiin 1 qoaiin 2 od 5 am 1 daiindy 1 rodaiin 2 oldam 5 dain 1 dainy 1 saiir 2 oraiin 5 dor 1 dal 1 sair 2 r 5 m 1 dalary 1 sairain 2 raiin 5 sair 1 daliir 1 sar 2 sa 5 sar 1 dalsy 1 saraiir 2 y 4 araiin 1 dan 1 sol 1 a 4 daly 1 dar 1 solaiin 1 aldy 4 iin 1 daramgal 1 y 1 alod 4 ldy The disjointness of label and text unifixes apparently holds for language B too.