Hacking at the Voynich manuscript - Side notes 022 Analyzing QOKOKOKO element frequencies per section Last edited on 1999-02-01 16:28:09 by stolfi 1998-06-20 stolfi ================= [ Originally part of Notes/023 ] [ First version done on 1998-05-04, now redone with fresher data. ] [ Split off from Notes/023 as Notes/022 on 1999-01-31. ] The purpose of this note is to compare the frequencies of the QOKOKOKO elements (see Notes/017 and Notes/018) among the various sections of the VMS. Since I am not still clear on how to group the O's with K's (with the following K, with the preceding K, with both, with neither), I will leave them as separate elements. Also, for simplicity (without any conviction at all) I will split every double-letter O into two elements. Also, given Grove's observations on anomalous "p" and "t" distributions at beginning-of-line, and the well-known attraction of certain elements for end-of-line, it seems advisable to discard the first few and last few elements of every line. I. EXTRACTING AND COUNTING ELEMENTS We will prepare two sets of statistics, one using raw words ("-r") and one using word equivalence classes ("-c"). elem-to-class -describe element equivalence: map_ee_to_ch ignore_gallows_eyes join_ei equate_aoy collapse_ii equate_eights equate_pt erase_word_spaces crush_invalid_words append_tilde This mapping will hopefully reduce transcription and sampling noise. The source data will be majority edition derived from interlinear release 1.6e6, already chopped into pages and sections: ln -s ../045/pages-m text-pages ln -s ../045/subsecs-m text-subsecs Creating a combined file of the source text for archiving: ( cd text-pages && cat `cat all.names | sed -e 's/$/.evt/'` ) \ > all.evt cat all.evt \ | validate-new-evt-format \ -v chars='abcdefghijklmnopqrstuvxyz?*' \ -v requireUnitHeaders=0 \ -v requirePageHeaders=0 Selecting the plain text and factoring it into elements: foreach utype ( pages subsecs ) foreach f ( `cat text-${utype}/all.names` ) set ofile = "/tmp/${utype}-${f}.etx" echo ${ofile} cat text-${utype}/${f}.evt \ | select-units \ -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \ -v table=unit-to-type.tbl \ | lines-from-evt | egrep '.' \ | factor-line-OK | egrep '.' \ > ${ofile} end end Separating the elements and mapping them to classes: foreach ep ( cat.RAW elem-to-class.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} foreach utype ( pages subsecs ) foreach f ( `cat text-${utype}/all.names` ) set ofile = "/tmp/${utype}-${f}-${etag}.els" echo ${ofile} cat /tmp/${utype}-${f}.etx \ | trim-three-from-ends \ | tr '{}._' '\012\012\012\012' \ | ${ecmd} | egrep '.' \ > ${ofile} end end end Counting elements and computing relative frequencies: mkdir -p RAW EQV foreach ep ( elem-to-clean.RAW elem-to-class.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} /bin/rm -rf ${etag}/efreqs mkdir -p ${etag}/efreqs foreach utype ( pages subsecs ) set frdir = "${etag}/efreqs/${utype}" mkdir -p ${frdir} cp -p text-${utype}/all.names ${frdir}/ foreach f ( `cat text-${utype}/all.names` ) set ofile = "${frdir}/$f.frq" echo ${ofile} cat /tmp/${utype}-${f}-${etag}.els \ | sort | uniq -c | expand \ | sort -b +0 -1nr \ | compute-freqs \ > ${ofile} end end end /bin/rm /tmp/{pages,subsecs}-*{,-RAW,-EQV}.{etx,els} Computing total frequencies: foreach ep ( cat.RAW elem-to-class.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} foreach utype ( pages subsecs ) set fmt = "${etag}/efreqs/${utype}/%s.frq" set frfiles = ( \ `cat text-${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \ ) echo ${frfiles} cat ${frfiles} \ | gawk '/./{print $1, $3;}' \ | combine-counts \ | sort -b +0 -1nr \ | compute-freqs \ > ${etag}/efreqs/${utype}/tot.frq end end II. TABULATING ELEMENT FREQUENCIES PER SUBSECTION set sectags = ( `cat text-subsecs/all.names` ) echo $sectags foreach etag ( RAW EQV ) tabulate-frequencies \ -dir ${etag}/efreqs/subsecs \ -title "elem" \ tot ${sectags} end Elements sorted by frequency (× 99), per subsection: tot unk pha str hea heb bio ast cos zod ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- 17 o 16 o 23 o 15 o 20 o 14 y 15 o 18 o 16 o 20 o 12 y 12 a 9 y 11 y 11 y 14 o 15 y 13 y 13 y 13 a 9 a 11 y 8 l 11 a 9 ch 11 d 10 d 9 a 11 a 8 y 8 d 8 d 6 a 8 d 7 a 10 a 8 l 7 d 9 l 8 l 7 l 7 l 6 d 6 l 7 d 6 k 7 a 5 ch 6 d 7 t 5 k 5 r 4 r 6 k 6 l 5 l 6 q 5 ee 4 ch 5 r 4 ch 5 k 4 k 4 q 5 r 4 ch 6 k 5 r 4 k 5 d 4 r 4 ch 4 ch 4 ee 4 k 4 r 3 ee 4 k 4 r 5 ee 4 q 3 t 3 q 4 r 3 t 3 iin 3 che 3 s 4 t 3 ch 3 ee 3 iin 3 ee 3 iin 3 iin 3 che 3 r 3 l 3 iin 3 k 3 iin 3 q 3 che 3 ch 2 sh 2 q 2 she 3 t 2 sh 2 te 3 t 2 che 3 iin 3 che 2 q 2 t 2 t 2 che 2 q 2 iin 3 che 2 sh 2 ke 3 t 2 s 2 ee 2 ch 2 iin 2 ee 1 s 1 sh 1 ee 1 s 1 she 1 che 1 ke 1 in 2 ke 2 che 1 che 1 she 1 she 1 ? 1 sh 1 cth 1 sh 1 ke 1 q 1 s 1 sh 1 ke 1 p 1 sh 1 ke 1 ee 1 she 1 iin 1 ? 1 she 1 she 1 s 1 s 1 she 1 in 0 she 1 s 1 sh 1 e? 0 ke 0 p 0 in 0 ke 1 t 0 p 0 p 0 te 1 s 1 she 0 e? 0 ? 0 p 0 te 0 ckh 0 s 0 ckh 0 p 0 te 1 te 0 p 0 in 0 te 0 ir 0 ckhe 0 te 0 in 0 ckh 0 ckh 1 sh 0 te 0 ke 0 cth 0 cth 0 te 0 ir 0 m 0 f 0 p 0 p 0 ? 0 eee 0 ckh 0 f 0 p 0 eee 0 ke 0 in 0 cth 0 ir 0 cth 0 e? 0 ir 0 in 0 e? 0 ckh 0 cph 0 ir 0 ckhe 0 cth 0 in 0 m 0 ? 0 m 0 iiir 0 cth 0 te 0 m 0 f 0 ckh 0 m 0 cth 0 eee 0 ckh 0 in 0 f 0 cthe 0 cth 0 eee 0 eee 0 ckh 0 cthe 0 f 0 cph 0 cth 0 e? 0 f 0 e? 0 e? 0 cthe 0 eee 0 iir 0 m 0 eee 0 cthe 0 ckhe 0 ? 0 cthe 0 ir 0 in 0 cthe 0 q 0 e? 0 e? 0 ir 0 ? 0 ckhe 0 ? 0 cthe 0 iir 0 ir 0 ckhe 0 ? 0 f 0 m 0 e? 0 ckhe 0 iiin 0 i? 0 iir 0 cthe 0 cfh 0 m 0 iir 0 eee 0 eee 0 h? 0 il 0 cfh 0 cph 0 cthe 0 eee 0 i? 0 ir 0 iir 0 cph 0 j 0 ckhe 0 iir 0 cphe 0 j 0 cthe 0 n 0 iiin 0 cphe 0 ckhe 0 ij 0 iiin 0 ckhe 0 cphe 0 iiin 0 cfh 0 cphe 0 ? 0 cph 0 n 0 iir 0 cph 0 cph 0 cphe 0 cph 0 ck 0 f 0 i? 0 im 0 iiin 0 il 0 i? 0 cfhe 0 ikh 0 im 0 cphe 0 n 0 iir 0 n 0 iiin 0 i? 0 il 0 m 0 cfh 0 i? 0 i? 0 cphe 0 iir 0 im 0 m 0 iiir 0 x 0 cfhe 0 de 0 ct 0 cfh 0 n 0 de 0 iil 0 de 0 x 0 de 0 n 0 cfh 0 il 0 il 0 is 0 is 0 im 0 x 0 de 0 im 0 n 0 im 0 cfhe 0 de 0 i? 0 cfhe 0 id 0 cfhe 0 b 0 id 0 is 0 x 0 pe 0 cfh 0 ck 0 ith 0 is 0 iil 0 id 0 c? 0 j 0 iid 0 iil 0 iir 0 h? 0 id 0 cf 0 b 0 ck 0 iiid 0 g 0 cp 0 ct 0 iiil 0 h? 0 ct 0 iil 0 iim 0 iiil 0 g 0 id 0 iis 0 iiir 0 iil I have compared these counts with those obtained by removing two, one, or zero elements from each line end. The conclusion is that the ordering of the first six entries in each column is quite stable; it is probably not an artifact. Some quick observations: there seem to be three "extremal" samples: hea ("ch" abundant), bio ("q" important), and zod ("t" important). There are too many "e?" elements; I must check where they come from and perhaps modify the set of elements to account for them. [ It seems that many came from groups of the form "e[ktpf]e", "e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without ligatures. Most of the remaining come from Friedman's transcription; there are practically none in the more careful transcriptions. ] All valid elements that occur at least 10 times in the text: o y a q n in iin iiin r ir iir iiir d s is l il m im j de k t ke te p f cth ckh cthe ckhe cph cfh cphe cfhe ch che sh she ee eee x Valid elements that occur less than 10 times in the whole text: iil ij pe ct ck id Created a file "RAW/plots/vald/keys.dic" with all the valid elements. Equiv-reduced elements sorted by frequency (× 99), per subsection: tot unk pha str hea heb bio ast cos zod -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- 38 o~ 40 o~ 40 o~ 38 o~ 39 o~ 40 o~ 37 o~ 40 o~ 41 o~ 42 o~ 10 t~ 10 t~ 8 l~ 11 t~ 11 ch~ 12 d~ 10 d~ 10 ch~ 9 t~ 11 t~ 8 d~ 8 d~ 7 ch~ 8 d~ 9 t~ 10 t~ 9 t~ 8 t~ 9 l~ 8 ch~ 8 ch~ 7 l~ 7 d~ 8 ch~ 7 d~ 6 ch~ 8 l~ 7 d~ 7 ch~ 8 l~ 7 l~ 6 ch~ 6 t~ 6 l~ 6 l~ 5 l~ 6 q~ 5 r~ 6 d~ 5 d~ 4 r~ 5 r~ 4 r~ 4 q~ 5 r~ 4 r~ 5 ch~ 3 s~ 4 r~ 5 r~ 4 q~ 3 in~ 3 q~ 4 in~ 4 in~ 3 in~ 3 che~ 3 l~ 3 in~ 3 te~ 4 in~ 3 q~ 3 in~ 4 r~ 2 sh~ 3 che~ 3 in~ 3 te~ 2 sh~ 3 in~ 3 che~ 2 che~ 3 che~ 4 che~ 2 cth~ 2 q~ 3 r~ 3 che~ 2 q~ 2 che~ 1 te~ 2 sh~ 2 te~ 1 te~ 2 q~ 2 te~ 2 she~ 2 in~ 2 che~ 1 s~ 1 sh~ 1 te~ 1 s~ 1 she~ 2 s~ 1 sh~ 2 te~ 1 q~ 1 te~ 1 she~ 1 she~ 1 she~ 1 ?~ 1 sh~ 1 che~ 1 she~ 1 sh~ 1 ?~ 1 s~ 1 sh~ 1 cth~ 1 cth~ 1 cth~ 0 s~ 0 te~ 1 s~ 1 cth~ 1 e?~ 1 she~ 0 ?~ 1 s~ 1 s~ 1 sh~ 0 cth~ 0 she~ 1 cth~ 1 s~ 1 she~ 0 cth~ 0 e?~ 0 ir~ 0 ir~ 1 she~ 0 ir~ 0 cthe~ 0 ir~ 0 cthe~ 1 cth~ 0 e?~ 0 cthe~ 0 cthe~ 0 cthe~ 1 cthe~ 0 cthe~ 0 ir~ 0 cthe~ 0 e?~ 1 sh~ 0 ?~ 0 cth~ 0 ?~ 0 e?~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ir~ 0 ir~ 0 ir~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 h?~ 0 cthe~ 0 cthe~ 0 q~ 0 n~ 0 id~ 0 i?~ 0 i?~ 0 n~ 0 id~ 0 ith~ 0 i?~ 0 id~ 0 i?~ 0 n~ 0 de~ 0 il~ 0 i?~ 0 i?~ 0 ct~ 0 il~ 0 il~ 0 i?~ 0 is~ 0 n~ 0 ct~ 0 n~ 0 ?~ 0 id~ 0 id~ 0 il~ 0 n~ 0 de~ 0 id~ 0 x~ 0 il~ 0 de~ 0 x~ 0 id~ 0 id~ 0 de~ 0 de~ 0 n~ 0 ct~ 0 x~ 0 il~ 0 de~ 0 is~ 0 is~ 0 b~ 0 i?~ 0 x~ 0 is~ 0 is~ 0 h?~ 0 h?~ 0 c?~ 0 ith~ 0 b~ 0 b~ 0 c?~ There are 23 valid elements with frequency > 20 under the equivalence: o t te cth cthe ch che sh she d de id l r q s m n in ir im il Valid elements with frequency below 20: ct is g b x Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the above equivalence. IV. "ED"'S STORY Rene observed that the EVA digraph is a marker for the A/B language split. He produced some plots where the horizontal axis is page number, with subsections distinguished by colors. Let's count the word frequencies per page: zcat ../037/vms-17-ok.soc.gz \ | tr '/' '-' \ | gawk \ ' \ (($2 ~ /[A]/) && ($6 \!~ /[-=., ]/)){ \ gsub(/[.].*$/,"",$1); print $9, substr($10,2), $1, $6; \ } \ ' \ | sort | uniq -c | expand \ | sort -b +1 -2 +2 -3 +0 -1nr \ > .all.fpw cat .all.fpw \ | list-page-champs -v maxChamps=4 \ > .all.chpw Let's count the total word occurrences per page: cat .all.fpw \ | gawk \ ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \ END { for(w in ct) { print ct[w], w; } } \ ' \ | sort -b +2 -3 +0 -1nr \ > .all.tpw Let's now count the -containing words per page: cat .all.fpw \ | gawk \ ' ($5 ~ /ed/){ print; } ' \ > .ed.fpw cat .ed.fpw \ | list-page-champs -v maxChamps=6 \ > .ed.chpw cat .ed.fpw \ | gawk '//{ print $1,$2,$3,$4; }' \ | combine-counts \ | sort -b +2 -3 \ > .ed.tpw Let's plot the ratio of -words to total words per page: plot-freqs .ed.tpw .all.tpw The plots of the -ratio R show that "hea" and "pha" are virtually -free (R < 0.03, below the erro level); "cos-1" (the part before the "zod") and "zod" begin with slightly higher ratios than "hea"/"pha" (R ~ 0.04); R then increases sharply along "zod", from R ~ 0.03 to R ~ 0.11, and jumps to R ~ 0.20 in "cos-2" (the part after "zod"). "heb-2" (after "zod") has R ~ 0.17 just below that of "cos-2". "heb-1" (before "zod") has widely variable R, with mean R ~ 0.20. "str" has R ~ 0.20 like "heb", but more uniform (except for the two pages before "zod", which have R ~ 0.02). "bio" has R ~ 0.20 in the middle, R ~ 0.32 at both ends So, based only on these plots, the writing sequence would be hea + pha (no obvious order) cos-1 + zod heb-2 str + heb-1 + cos-2 bio V. ABOUT "ED" AND LADY "DY" It seems that most of the words in language B are actually words that end with . In fact there seems to be a very small number of words involved. Let's plot the per-page frequencies of the ending: cat .all.fpw \ | gawk ' ($5 ~ /dy$/){ print $1,$2,$3,$4; } ' \ | combine-counts \ | sort -b +2 -3 \ > .dy.tpw plot-freqs .dy.tpw .all.tpw This plot shows the same trends as the frequency, except that the data for language-A is noisier and the distinction between languages A and B is less marked (because the counts for language A are no longer zero). Here "cos-1" and "zod" are practically equal. Curiously, pharma has a slightly higher R than herbal-A; and R actually decreases as we go down herbal-A. This decrease is strange since the trend in the zodiac pages establishes that increases frm older to newer, hence language A should be earlier than language B. Let's try again with the ending proper: cat .all.fpw \ | gawk ' ($5 ~ /edy$/){ print $1,$2,$3,$4; } ' \ | combine-counts \ | sort -b +2 -3 \ > .edy.tpw plot-freqs .edy.tpw .all.tpw These plots are like the plots but cleaner. The "bio" subsection has R ~ 0.25 with a dip in the middle. Subsections "str-2" and "heb-1" have almost the same R ~ 0.15. Subsections "cos-2" and "heb-2" have R ~ 0.10. Subsections "cos-1" and "zod" have R ~ 0.03 (barely significant) and the trend in "zod" is not so clear. Finally subsections "hea-1", "hea-2", "pha", and "str-1" have hardly any "edy". VI. THE "EDY" WORDS Let's compute the overall frequency of each word per subsection, removing the prefix and mapping [ktpf] to : cat .all.fpw \ | gawk \ ' /./{ \ gsub(/^q/,"",$5); gsub(/[ktpf]/,"k",$5); \ print $1,$2,"000","000",$5 \ } \ ' \ | combine-counts \ | sort -b +1 -2 +0 -1nr \ > .all.ftw cat .all.ftw \ | gawk '/./{print $1,$2,$3,$4 } ' \ | combine-counts \ | sort -b +2 -3 \ > .all.ttw Now let's look at the words specifically: cat .all.ftw \ | gawk '($5 ~ /edy$/){ print; }' \ > .edy.ftw cat .edy.ftw \ | list-page-champs -v maxChamps=6 \ > .edy.chtw Here are the six most common words in each subsection, manually sorted: sec totwd champions --- ----- ---------------------------------------------------------------------- str 10783 okedy(180) okeedy(271) chedy(193) shedy(119) lchedy(56) okchedy(131) bio 6716 okedy(310) okeedy(252) chedy(218) shedy(252) lchedy(59) okchedy(44) heb 3337 okedy(101) okeedy(31) chedy(67) shedy(36) kedy(26) ykedy(25) cos 2590 okeedy(11) chedy(12) okedy(23) shedy(11) okchedy(10) kchedy(5) zod 997 okeedy(5) chedy(4) okedy(3) shedy(5) okshedy(2) eeedy(2) hea 7553 chedy(1) ykchedy(1) okeedy(2) esedy(1) pha 2401 chedy(1) ockhedy(1) cheedy(2) cholkeedy(1) ckhedy(1) okedy(1) unk 1847 okedy(21) chedy(19) okchedy(14) shedy(14) okeedy(7) olkeedy(7) --- ----- ---------------------------------------------------------------------- As it can be seen, the words (there are many of them!) are characteristic of language B ("bio", "heb", "str"), and also a bit of "cos" and "zod". The frequency of "okedy" (and its k/t/q variants) is 1: 22 "bio" 1: 33 "heb" 1: 55 "str" 1: 220 "cos" 1: 200 "zod" and practically nil in "hea", "pha". Let's look at the words that DON'T end in : cat .all.ftw \ | gawk ' ($5 \!~ /edy$/){ print; } ' \ > .not-edy.ftw cat .not-edy.ftw \ | list-page-champs -v maxChamps=6 \ > .not-edy.chtw These are the six most common non-edy words in each subsection, also manually sorted: sec totwd champions --- ----- ---------------------------------------------------------------------- str 10783 okaiin(350) okal(198) okeey(341) aiin(199) okain(173) okar(184) bio 6716 okaiin(145) okal(185) okeey(128) okain(240) ol(363) oky(124) heb 3337 okaiin(67) okal(56) aiin(68) okar(92) daiin(79) or(68) cos 2590 aiin(44) ar(57) okeey(45) or(43) dar(40) daiin(39) zod 997 aiin(29) ar(28) okeey(24) al(30) okaiin(21) okal(17) hea 7553 daiin(393) chol(215) chor(144) okchy(142) ckhy(138) oky(131) pha 2401 daiin(105) chol(47) okeol(62) okol(52) okeey(51) ol(41) unk 1847 okar(58) daiin(42) okaiin(40) okal(32) aiin(31) or(31) --- ----- ---------------------------------------------------------------------- Note that is the most common word in herbal-A and pharma, but it shows up also in the other subsections, at 1/2 to 1/4 the frequency: "hea" 1: 18 "pha" 1: 24 "heb" 1: 40 "str" 1: 75 "bio" 1: 80 "cos" 1: 60 "zod" 1: 80 So perhaps is a function word that got less and less used as the author's vocabulary expanded. The most popular non- words in language B are okaiin okal okeey aiin okain okar They are fairly uniform across subsections, except perhaps okar which is more concentrated in herbal-B. It is hard to get any conclusion from these lists (other than `it strongly suggests Chinese' 8-). Let's try with the words /: cat .all.fpw \ | gawk \ ' ($5 ~ /^[cs]hedy$/){ \ print $1,$2,$3,$4 \ } \ ' \ | combine-counts \ | sort -b +2 -3 \ > .chedy.tpw plot-freqs .chedy.tpw .all.tpw Predictably the R values are smaller overall, and only the "str-2", "heb-1", and "bio" are significantly greater than 0. The "bio" pages still show the dip in the middle. Let's try /, which should show the reverse trend: cat .all.fpw \ | gawk \ ' ($5 ~ /^d[ao]i+n$/){ \ print $1,$2,$3,$4 \ } \ ' \ | combine-counts \ | sort -b +2 -3 \ > .dain.tpw plot-freqs .dain.tpw .all.tpw Predictably again these pages show the opposite trends. In "hea" R is large and decreasing ("hea-1" has R ~ 0.07, "hea-2" has R ~ 0.04). Next is "pha" at R ~ 0.04, then "heb-2" and "heb-1" a R ~ 0.03, then "str", "cos", "zod" and "bio" all at R ~ 0.02. The "unk" pages f1r and f49v have R ~ 0.08, which is right in the middle of the herbal-A range. The others have lower R, which is consistent with language B material. Let's compare the frequencies of "Ke" elements relative to total non-[aoy] (mostly "K" and "Ke") elements. cat .all.fpw \ | ../017/factor-field-OK \ -v inField=5 \ -v outField=6 \ | gawk \ ' /./{ \ f = $6; \ gsub(/^[^{}]*/,"",f); gsub(/[^{}]*$/,"",f); \ gsub(/}[^{}]*{/,"} {",f); n = split(f, ff); \ for(i=1;i<=n;i++){ print $1,$2,$3,$4,ff[i]; } \ } \ ' \ | grep -v '{_}' \ | combine-counts \ | sort -b +1 -2 +2 -3 +0 -1nr \ > .all.fpe dicio-wc .all.fpw .all.fpe lines words bytes file ------- ------- --------- ------------ 24921 124605 688935 .all.fpw 5632 28160 147899 .all.fpe And let's compute the total non-[aoy] elements per page: cat .all.fpe \ | gawk \ ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \ END { for(w in ct) { print ct[w], w; } } \ ' \ | sort -b +2 -3 +0 -1nr \ > .all.tpe dicio-wc .all.tpw .all.tpe lines words bytes file ------- ------- --------- ------------ 227 908 3798 .all.tpw 227 908 3924 .all.tpe Let's now count the "Ke" elements only: cat .all.fpe \ | gawk \ ' ($5 ~ /{([ice][ktpf]?[he]|[ktpf])e}/){ print; } ' \ > .Ke.fpe cat .Ke.fpe \ | gawk '//{print $1,$2,$3,$4; }' \ | combine-counts \ | sort -b +2 -3 \ > .Ke.tpe plot-freqs .Ke.tpe .all.tpe Strangely the plots show little change from language-A to language-B, less than the variation within the same subsection. The ratio for "hea-1" is lowest (R ~ 0.03) and is minimum around page p025. Curiously it seems to oscillate with a period of 1-2 pages. The ratios for all other subsections are about the same, around 0.10. The "zod" pages show again a sharp increasing trend, except for the first couple of pages. Observations: If languages A and B are indeed different languages, it is hard to explain why some letter group statistics are so uniform, and why some are so variable. If languages A and B are different spellings of the same language, then the spelling change must not have affected the use of "Ke" elements relative to the "K" elements. If the difference between languages A and B is merely due to vocabulary (including tense/person/etc.), then the difference again must not favor "Ke" words over "K" words. Let's try the gallows elements only: cat .all.fpe \ | gawk \ ' ($5 ~ /{.*[ktpf].*}/){ print; } ' \ > .ktpf.fpe cat .ktpf.fpe \ | gawk '//{print $1,$2,$3,$4; }' \ | combine-counts \ | sort -b +2 -3 \ > .ktpf.tpe plot-freqs .ktpf.tpe .all.tpe These plots are even more uniform than the previous ones. The ratio of gallows elements to non-gallows is amazingly constant (R ~ 0.22) for all subsections and languages. One cannot even see the "zod" trend. Let's look at the "skeletons" of the words, obtained by deleting the [aoy] inserts and the [i] and [e] modifiers. cat .all.fpw \ | ../017/factor-field-OK \ -v inField=5 \ -v outField=6 \ | gawk \ ' /./{ \ f = $6; \ gsub(/^[^{}]*/,"",f); gsub(/[^{}]*$/,"",f); \ if (match(f,/{([ice][ktpf]?[eh]|[ktpf])e}/)) \ { gsub(/e}/,"}",f); } \ gsub(/{i+/,"{",f); gsub(/{_}/,"",f); \ gsub(/}[^{}]*{/,"",f); \ print $1,$2,$3,$4,f; \ } \ ' \ | combine-counts \ | sort -b +1 -2 +2 -3 +0 -1nr \ > .all.fps And let's compute the total non-[aoy] elements per page: cat .all.fps \ | gawk \ ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \ END { for(w in ct) { print ct[w], w; } } \ ' \ | sort -b +2 -3 +0 -1nr \ > .all.tps dicio-wc .all.tpw .all.tps