# Hacking at the Voynich manuscript - Side notes 093 Parsing words into elements and counting them Last edited on 2026-01-17 23:35:56 by stolfi This note is a remake of note 622. Its purpose is to parse the transcription into "elements" according to my new word paradigm, and computing statistics thereof. The main elements comprise the single glyphs @[aoyqdlrs]; the benches @Ch, @Sh, @Ih, the "topless bench" @'ee', and the platform gallows (@CTh, @CPHh, @IKh, etc.), both optionally modified by a single @e; and the codas @n, @m, @in, @im, @ir, ... @iiin, @iiim, @iiir. Other rare combinations will be flagged and handled appropriately later. SETUP ln -s ../.. work ln -s work/compute_freqs.gawk ln -s work/combine_counts.gawk ln -s work/error_funcs.gawk ln -s work/validate_25e1_ivt_format.gawk ln -s work/error_funcs.py ln -s work/process_funcs.py ln -s work/factor_field_general.gawk ln -s work/factor_text_25e1_eva_to_elems.gawk SOURCE TRANSCRIPTION The source transcription will be my own 2025 transcription (code ";U") completed with Rene's IVT (code ";Z") as prepared in Note 074, split by section and type as per Note 092: ln -s ../092/st_words The "*.eva" files in this folder should have all weirdos &NNN mapped to "?". All comments "" and alignment markers "<%>" "<$>" [«=»] should have been removed. The files "*.wff" should have fractional counts that take into account the dubious spaces ','. Also, all ligature braces '{...}' should have been removed, and all EVA characters should have been lowercased. This loses information about weird ligatures like {Cto} or {Qy}. We will fix this detail later. We consider only "parags" type text. SAVING now="$( yyyy-mm-dd-hhmmss )" echo now=${now} 1>&2 # now=2026-01-15-110038 mkdir -p SAVE/${now} cp -av \ Note-093.txt \ do_093*.sh \ elem_parse_funcs.gawk \ parse_ivt_file_into_elements.gawk \ parse_wff_file_into_elements.gawk \ parse_ivt_files_into_elements.sh \ parse_weff_file_into_okoko_pats.gawk \ parse_oko_file_into_cmc_pats.gawk \ SAVE/${now} LEVEL 0 - EVA CHARS Let's first count how many words are composed of the standard EVA chars, as opposed to rare chars (@b @g @j @u @v @x), weirdos @&NNN, and unreadable glyphs @?. do_093_char_stats.sh all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 6210.000000 6172.000000 38.000000 99.39 bio-parags 1008.500000 984.500000 24.000000 97.62 cos-parags 7749.000000 7569.000000 180.000000 97.68 hea-parags 3364.500000 3310.500000 54.000000 98.40 heb-parags 2291.500000 2194.500000 97.000000 95.77 pha-parags 10714.000000 10581.750000 132.250000 98.77 str-parags 3001.500000 2944.500000 57.000000 98.10 unk-parags 34339.000000 33756.750000 582.250000 98.30 tot-parags 34339.000000 33616.750000 722.250000 97.90 tot-parags So at least 97% of the words in the main sections ("str", "bio", "hea", "heb") use only the "main" chars. The vast majority of the "bad" words have "?". The most words with no "?" or weirdos that are rejected because of rare chars is 51 in "hea-parags", 40.25 in "str-parags". From the words with valid chars only, per section and type, we extracted the lexicons, consisting of words with a certain min number of occurrences that depends on the size of the subset: LEXCON SIZES AMONG WORDS WITH VALID CHARS bio-parags lexicon size = 300 last = 3.000000 yteey cos-parags lexicon size = 57 last = 3.000000 shodaiin hea-parags lexicon size = 415 last = 3.000000 ytoldy heb-parags lexicon size = 208 last = 3.000000 ypchdy pha-parags lexicon size = 152 last = 3.000000 ykeor str-parags lexicon size = 554 last = 3.000000 ykedy unk-parags lexicon size = 206 last = 3.000000 ytody tot-parags lexicon size = 1376 last = 3.000000 ytodaiin Note that the "tot-parags" lexicon size is more than the sum of the other lexicon sizes, because in includes some words that occur sufficiently often in two or more sections but not enough in any of them. LEVEL 1 - ELEMENTS We now parse the words into the elements of the word paradigm. Valid elements are surrounded by "{}". Glyphs that cannot be parsed as valid elements (including "?" and rare chars) are surrounded by "[]". We exclude words that have the rare chars above or the invalid glyph '?'. The element set is {q} {o} {a} {y} {d} {r} {l} {s} {ch} {che} {sh} {she} {ee} {eee} {k} {ke} {t} {te} {p} {pe} {f} {fe} {ckh} {ckhe} {cth} {cthe} {cph} {cphe} {cfh} {cfhe} {n} {in} {iin} {iiin} {m} {im} {iim} {ir} {iir} For a justification of those elements, see the file "Note-093-extra.txt" do_093_elem_stats.sh all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 6172.000000 6123.500000 48.500000 99.21 bio-parags 984.500000 949.500000 35.000000 96.44 cos-parags 7569.000000 7434.500000 134.500000 98.22 hea-parags 3310.500000 3245.500000 65.000000 98.04 heb-parags 2194.500000 2137.500000 57.000000 97.40 pha-parags 10581.750000 10416.750000 165.000000 98.44 str-parags 2944.500000 2907.500000 37.000000 98.74 unk-parags 33756.750000 33214.750000 542.000000 98.39 tot-parags So at least 98% of all words in the main sections that have only the "valid" chars can be parsed into valid elements of the model. Here are the element frequencies: 12564.750000 0.09922 {a} 19.000000 0.00015 {cfhe} 49.000000 0.00039 {cfh} 3919.250000 0.03095 {che} 5980.500000 0.04722 {ch} 212.500000 0.00168 {ckhe} 634.000000 0.00501 {ckh} 57.000000 0.00045 {cphe} 129.500000 0.00102 {cph} 169.000000 0.00133 {cthe} 709.500000 0.00560 {cth} 11548.250000 0.09119 {d} 322.000000 0.00254 {eee} 3864.750000 0.03052 {ee} 324.000000 0.00256 {f} 158.000000 0.00125 {iiin} 16.000000 0.00013 {iim} 3780.000000 0.02985 {iin} 131.500000 0.00104 {iir} 41.000000 0.00032 {im} 1674.000000 0.01322 {in} 490.250000 0.00387 {ir} 1526.500000 0.01205 {ke} 7320.250000 0.05780 {k} 9278.500000 0.07327 {l} 875.000000 0.00691 {m} 115.500000 0.00091 {n} 21616.500000 0.17069 {o} 1204.750000 0.00951 {p} 5186.000000 0.04095 {q} 5820.250000 0.04596 {r} 1962.875000 0.01550 {she} 2251.000000 0.01777 {sh} 2093.000000 0.01653 {s} 787.500000 0.00622 {te} 4215.250000 0.03328 {t} 15594.875000 0.12314 {y} Most of the lexemes that could not be parsed into elements occur only once or twice in all parags text. here are those that occur more than twice (with '!' showing where the parsing into elements failed): 9.000000 0.016610 chckh!hy 9.000000 0.016610 cth!hy 7.000000 0.012920 da!i!idy 5.000000 0.009230 chcth!hy 5.000000 0.009230 ckh!hy 4.500000 0.008300 a!il 4.000000 0.007380 !ety 4.000000 0.007380 qo!edy 4.000000 0.007380 qo!eol 3.000000 0.005540 chs!ey 3.000000 0.005540 !cty 3.000000 0.005540 q!ekchdy 3.000000 0.005540 qo!edaiin 3.000000 0.005540 shcph!hy 3.000000 0.005540 shcth!hy 2.500000 0.004610 a!is 2.500000 0.004610 o!edy As discussed in "Note-093-extra.txt", it may be worth "fixing" the transcription by mapping all @'hh' to @'he', instead of rejecting them in the ELEM model. A relatively common pattern in those rejected words is initial @'qoe' (78 tokens) or @'qe' (66 tokens). Maybe we should include those two strings as elements. There are 63 tokens that start with @'oe' and 63 that start with @e. But maybe those are parts of words after dubious or wrong spaces. There are 90 tokens where the @c (@e with lig) is used in various combinations other than @'ch' or platform gallows, like @'cty' (14 tokens), @'cky' (12 tokens) @'cke' (6 tokens) etc. There are a few more like those but with @i instead of @c. Those may be mistakes like platform gallows with missed or miswritten or misread @h. There are a couple hundred tokens with one or more @'i' not followed by [nmr]. Most common letters after those rejected @i run are gallows that are not followed by @h or @hh. LEVEL 2 - OKOKO MODEL We now consider the words that consist only of valid elements. We map the elements to "O" = { @a, @o, @y }, and "K" = all the others, and parse the resulting string as a sequence of zero or more "K" with at most one "O" after each "K" and an optional "O" prefix, and at most three total "O"s: do_093_okoko_stats.sh all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 6123.500000 6104.750000 18.750000 99.69 bio-parags 949.500000 925.500000 24.000000 97.47 cos-parags 7434.500000 7265.125000 169.375000 97.72 hea-parags 3245.500000 3206.875000 38.625000 98.81 heb-parags 2137.500000 2070.000000 67.500000 96.84 pha-parags 10416.750000 10242.156250 174.593750 98.32 str-parags 2907.500000 2851.125000 56.375000 98.06 unk-parags 33214.750000 32665.531250 549.218750 98.35 tot-parags Thus, in the main sections, at least 97.7% of all words that consist of valid elements also fit the OKOKO model. Many of those bad words are rejected because of two or more "O" elements in a row. If we allow up to two consecutive "O" (but still no more than three overall), the acceptance becomes almost total: all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 6123.500000 6116.750000 6.750000 99.89 bio-parags 949.500000 944.500000 5.000000 99.47 cos-parags 7434.500000 7403.250000 31.250000 99.58 hea-parags 3245.500000 3237.125000 8.375000 99.74 heb-parags 2137.500000 2121.750000 15.750000 99.26 pha-parags 10416.750000 10379.156250 37.593750 99.64 str-parags 2907.500000 2886.375000 21.125000 99.27 unk-parags 33214.750000 33088.906250 125.843750 99.62 tot-parags Most of those 125 tokens (248 lexemes) with valid elements that still fail the OKOKO model now look like two or more words stuck together: 1.500000 0.011920 OKOKOK!O:okalod!y 1.000000 0.007950 OKOKO!O:araro!y 1.000000 0.007950 KOKOKOK!OK:chalykor!ain 1.000000 0.007950 KOKOOK!O:cthodoal!y 1.000000 0.007950 KOKOKKOK!OK:dolarshyd!or 1.000000 0.007950 KOKOKOK!OK:folorar!om 1.000000 0.007950 KOKOKKOK!OKO:kodamchocth!ody 1.000000 0.007950 OOKKOK!O:oaldar!y 1.000000 0.007950 OO!OKOK:oa!orar 1.000000 0.007950 OKOKOK!O:octhodal!y 1.000000 0.007950 OKOKOK!O:odalal!y 1.000000 0.007950 OKOKOK!O:okairod!y 1.000000 0.007950 OKOKKOKK!O:okalchold!y 1.000000 0.007950 OKOKKOK!O:okoldal!y LEVEL 2 - CORE-MANTLE-CRUST MODEL In this model we ignore all "O" elements and map the others to the specific classes "Q" = { @q }, "D" = { @d, @l,@r, @s } (/dealers/) "X" = { @ch, @sh, @ee } with optional @e suffix (/benches/), "G" = { @k, @t, @p, @f } with optional @e suffix (the /simple gallows/), "H" = { @chk, @ctk, @cph, @cfh } with optional @e or @h suffix (/platform gallows/), "N" = { @n, @m } after zero or more @i, or @r after one or more @i (/codas/). In my original nomenclature, the "D" elements are called the /crust/, the "X" elements are the /mantle/, and the "G" and "H" elements are the /core/. The "Q" and "N" elements then could be the /seas/. The CMC model says that a word is valid if it fits the pattern Q^q D^d X^x G^g H^h X^y D^e N^n where q,n may be 0 or 1; g+h may be 0 or 1; q+d+e+n may be at most 3; x+h+y may be at most 2. That is, there can be at most one gallows (G or H), at most three of Q, D, and N, and at most two benches, counting a platform gallows as one implicit bench. With these rules we get: do_093_cmc_stats.sh all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 6116.750000 5970.187500 146.562500 97.60 bio-parags 944.500000 913.750000 30.750000 96.74 cos-parags 7403.250000 7156.375000 246.875000 96.67 hea-parags 3237.125000 3106.875000 130.250000 95.98 heb-parags 2121.750000 2052.625000 69.125000 96.74 pha-parags 10379.156250 9952.687500 426.468750 95.89 str-parags 2886.375000 2771.125000 115.250000 96.01 unk-parags 33088.906250 31923.625000 1165.281250 96.48 tot-parags There are ~1165 tokens (1555 lexemes) that satisfy the OKOKO model but fail the CMC model. These are the ones that occur more than twice in all the parags text: 7.000000 0.006010 GD!XD:pol!chedy 5.500000 0.004720 XD!G:chol!ky 4.500000 0.003860 XD!GN:chol!kaiin 4.000000 0.003430 XXG!X:cheet!eey 4.000000 0.003430 XD!X:chod!chy 4.000000 0.003430 XD!GD:chol!kar 4.000000 0.003430 XD!G!XD:chol!k!eedy 4.000000 0.003430 DN!D:daira!l 4.000000 0.003430 DN!N:dair!in 3.000000 0.002570 N!D:aiina!l 3.000000 0.002570 N!D:airo!dy 3.000000 0.002570 N!D:airo!l 3.000000 0.002570 XD!GN:cheol!kain 3.000000 0.002570 XD!XD:chol!chedy 3.000000 0.002570 DN!D:dairo!dy 3.000000 0.002570 DDD!N:dalda!iin 3.000000 0.002570 GX!H:pcho!cthy 3.000000 0.002570 GD!X:pol!shy 3.000000 0.002570 XXG!X:sheek!chy 3.000000 0.002570 G!H:to!ckhy 2.750000 0.002360 N!N:aira!m 2.500000 0.002150 XD!X:chol!chey 2.500000 0.002150 GX!GX:kcho!kchy 2.500000 0.002150 GD!X:kor!chy 2.500000 0.002150 GD!XD:okal!chedy 2.500000 0.002150 GD!X:okal!chy 2.500000 0.002150 GD!GN:opal!kaiin 2.375000 0.002040 N!D:aira!l 2.250000 0.001930 GD!XD:tol!chedy Most seem to be two or more more or less common words run together. To measure the CMC compliance among lexemes (as opposed to tokens) we extracted all words with valid EVA chars and at least 3 occurrences, and then verified how many of those passed through all levels and had valid CMC structure. The table below LEXICON SIZES WITH VALID CMC STRUCTURE sec-type vlex vgud vbad %gud least common ---------- ----- ----- ----- ------ -------------------- bio-parags 300 300 0 100.00 3.0000 yteey cos-parags 57 57 0 100.00 3.0000 shodaiin hea-parags 415 409 6 98.55 3.0000 ytoldy heb-parags 208 206 2 99.04 3.0000 ypchdy pha-parags 152 151 1 99.34 3.0000 ykeor str-parags 554 548 6 98.92 3.0000 ykedy unk-parags 206 206 0 100.00 3.0000 ytody tot-parags 1376 1341 35 97.46 3.0000 ytodaiin >>> REVISE <<< II. TABULATING ELEMENT FREQUENCIES PER SUBSECTION set sectags = ( `cat text-subsecs/all.names` ) echo $sectags foreach etag ( RAW EQV ) tabulate-frequencies \ -dir ${etag}/efreqs/subsecs \ -title "elem" \ tot ${sectags} end Elements sorted by frequency (× 99), per subsection: tot unk pha str hea heb bio ast cos zod ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- 17 o 16 o 23 o 15 o 20 o 14 y 15 o 18 o 16 o 20 o 12 y 12 a 9 y 11 y 11 y 14 o 15 y 13 y 13 y 13 a 9 a 11 y 8 l 11 a 9 ch 11 d 10 d 9 a 11 a 8 y 8 d 8 d 6 a 8 d 7 a 10 a 8 l 7 d 9 l 8 l 7 l 7 l 6 d 6 l 7 d 6 k 7 a 5 ch 6 d 7 t 5 k 5 r 4 r 6 k 6 l 5 l 6 q 5 ee 4 ch 5 r 4 ch 5 k 4 k 4 q 5 r 4 ch 6 k 5 r 4 k 5 d 4 r 4 ch 4 ch 4 ee 4 k 4 r 3 ee 4 k 4 r 5 ee 4 q 3 t 3 q 4 r 3 t 3 iin 3 che 3 s 4 t 3 ch 3 ee 3 iin 3 ee 3 iin 3 iin 3 che 3 r 3 l 3 iin 3 k 3 iin 3 q 3 che 3 ch 2 sh 2 q 2 she 3 t 2 sh 2 te 3 t 2 che 3 iin 3 che 2 q 2 t 2 t 2 che 2 q 2 iin 3 che 2 sh 2 ke 3 t 2 s 2 ee 2 ch 2 iin 2 ee 1 s 1 sh 1 ee 1 s 1 she 1 che 1 ke 1 in 2 ke 2 che 1 che 1 she 1 she 1 ? 1 sh 1 cth 1 sh 1 ke 1 q 1 s 1 sh 1 ke 1 p 1 sh 1 ke 1 ee 1 she 1 iin 1 ? 1 she 1 she 1 s 1 s 1 she 1 in 0 she 1 s 1 sh 1 e? 0 ke 0 p 0 in 0 ke 1 t 0 p 0 p 0 te 1 s 1 she 0 e? 0 ? 0 p 0 te 0 ckh 0 s 0 ckh 0 p 0 te 1 te 0 p 0 in 0 te 0 ir 0 ckhe 0 te 0 in 0 ckh 0 ckh 1 sh 0 te 0 ke 0 cth 0 cth 0 te 0 ir 0 m 0 f 0 p 0 p 0 ? 0 eee 0 ckh 0 f 0 p 0 eee 0 ke 0 in 0 cth 0 ir 0 cth 0 e? 0 ir 0 in 0 e? 0 ckh 0 cph 0 ir 0 ckhe 0 cth 0 in 0 m 0 ? 0 m 0 iiir 0 cth 0 te 0 m 0 f 0 ckh 0 m 0 cth 0 eee 0 ckh 0 in 0 f 0 cthe 0 cth 0 eee 0 eee 0 ckh 0 cthe 0 f 0 cph 0 cth 0 e? 0 f 0 e? 0 e? 0 cthe 0 eee 0 iir 0 m 0 eee 0 cthe 0 ckhe 0 ? 0 cthe 0 ir 0 in 0 cthe 0 q 0 e? 0 e? 0 ir 0 ? 0 ckhe 0 ? 0 cthe 0 iir 0 ir 0 ckhe 0 ? 0 f 0 m 0 e? 0 ckhe 0 iiin 0 i? 0 iir 0 cthe 0 cfh 0 m 0 iir 0 eee 0 eee 0 h? 0 il 0 cfh 0 cph 0 cthe 0 eee 0 i? 0 ir 0 iir 0 cph 0 j 0 ckhe 0 iir 0 cphe 0 j 0 cthe 0 n 0 iiin 0 cphe 0 ckhe 0 ij 0 iiin 0 ckhe 0 cphe 0 iiin 0 cfh 0 cphe 0 ? 0 cph 0 n 0 iir 0 cph 0 cph 0 cphe 0 cph 0 ck 0 f 0 i? 0 im 0 iiin 0 il 0 i? 0 cfhe 0 ikh 0 im 0 cphe 0 n 0 iir 0 n 0 iiin 0 i? 0 il 0 m 0 cfh 0 i? 0 i? 0 cphe 0 iir 0 im 0 m 0 iiir 0 x 0 cfhe 0 de 0 ct 0 cfh 0 n 0 de 0 iil 0 de 0 x 0 de 0 n 0 cfh 0 il 0 il 0 is 0 is 0 im 0 x 0 de 0 im 0 n 0 im 0 cfhe 0 de 0 i? 0 cfhe 0 id 0 cfhe 0 b 0 id 0 is 0 x 0 pe 0 cfh 0 ck 0 ith 0 is 0 iil 0 id 0 c? 0 j 0 iid 0 iil 0 iir 0 h? 0 id 0 cf 0 b 0 ck 0 iiid 0 g 0 cp 0 ct 0 iiil 0 h? 0 ct 0 iil 0 iim 0 iiil 0 g 0 id 0 iis 0 iiir 0 iil I have compared these counts with those obtained by removing two, one, or zero elements from each line end. The conclusion is that the ordering of the first six entries in each column is quite stable; it is probably not an artifact. Some quick observations: there seem to be three "extremal" samples: hea ("ch" abundant), bio ("q" important), and zod ("t" important). There are too many "e?" elements; I must check where they come from and perhaps modify the set of elements to account for them. [ It seems that many came from groups of the form "e[ktpf]e", "e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without ligatures. Most of the remaining come from Friedman's transcription; there are practically none in the more careful transcriptions. ] All valid elements that occur at least 10 times in the text: o y a q n in iin iiin r ir iir iiir d s is l il m im j de k t ke te p f cth ckh cthe ckhe cph cfh cphe cfhe ch che sh she ee eee x Valid elements that occur less than 10 times in the whole text: iil ij pe ct ck id Created a file "RAW/plots/vald/keys.dic" with all the valid elements. Equiv-reduced elements sorted by frequency (× 99), per subsection: tot unk pha str hea heb bio ast cos zod -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- 38 o~ 40 o~ 40 o~ 38 o~ 39 o~ 40 o~ 37 o~ 40 o~ 41 o~ 42 o~ 10 t~ 10 t~ 8 l~ 11 t~ 11 ch~ 12 d~ 10 d~ 10 ch~ 9 t~ 11 t~ 8 d~ 8 d~ 7 ch~ 8 d~ 9 t~ 10 t~ 9 t~ 8 t~ 9 l~ 8 ch~ 8 ch~ 7 l~ 7 d~ 8 ch~ 7 d~ 6 ch~ 8 l~ 7 d~ 7 ch~ 8 l~ 7 l~ 6 ch~ 6 t~ 6 l~ 6 l~ 5 l~ 6 q~ 5 r~ 6 d~ 5 d~ 4 r~ 5 r~ 4 r~ 4 q~ 5 r~ 4 r~ 5 ch~ 3 s~ 4 r~ 5 r~ 4 q~ 3 in~ 3 q~ 4 in~ 4 in~ 3 in~ 3 che~ 3 l~ 3 in~ 3 te~ 4 in~ 3 q~ 3 in~ 4 r~ 2 sh~ 3 che~ 3 in~ 3 te~ 2 sh~ 3 in~ 3 che~ 2 che~ 3 che~ 4 che~ 2 cth~ 2 q~ 3 r~ 3 che~ 2 q~ 2 che~ 1 te~ 2 sh~ 2 te~ 1 te~ 2 q~ 2 te~ 2 she~ 2 in~ 2 che~ 1 s~ 1 sh~ 1 te~ 1 s~ 1 she~ 2 s~ 1 sh~ 2 te~ 1 q~ 1 te~ 1 she~ 1 she~ 1 she~ 1 ?~ 1 sh~ 1 che~ 1 she~ 1 sh~ 1 ?~ 1 s~ 1 sh~ 1 cth~ 1 cth~ 1 cth~ 0 s~ 0 te~ 1 s~ 1 cth~ 1 e?~ 1 she~ 0 ?~ 1 s~ 1 s~ 1 sh~ 0 cth~ 0 she~ 1 cth~ 1 s~ 1 she~ 0 cth~ 0 e?~ 0 ir~ 0 ir~ 1 she~ 0 ir~ 0 cthe~ 0 ir~ 0 cthe~ 1 cth~ 0 e?~ 0 cthe~ 0 cthe~ 0 cthe~ 1 cthe~ 0 cthe~ 0 ir~ 0 cthe~ 0 e?~ 1 sh~ 0 ?~ 0 cth~ 0 ?~ 0 e?~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ir~ 0 ir~ 0 ir~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 h?~ 0 cthe~ 0 cthe~ 0 q~ 0 n~ 0 id~ 0 i?~ 0 i?~ 0 n~ 0 id~ 0 ith~ 0 i?~ 0 id~ 0 i?~ 0 n~ 0 de~ 0 il~ 0 i?~ 0 i?~ 0 ct~ 0 il~ 0 il~ 0 i?~ 0 is~ 0 n~ 0 ct~ 0 n~ 0 ?~ 0 id~ 0 id~ 0 il~ 0 n~ 0 de~ 0 id~ 0 x~ 0 il~ 0 de~ 0 x~ 0 id~ 0 id~ 0 de~ 0 de~ 0 n~ 0 ct~ 0 x~ 0 il~ 0 de~ 0 is~ 0 is~ 0 b~ 0 i?~ 0 x~ 0 is~ 0 is~ 0 h?~ 0 h?~ 0 c?~ 0 ith~ 0 b~ 0 b~ 0 c?~ There are 23 valid elements with frequency > 20 under the equivalence: o t te cth cthe ch che sh she d de id l r q s m n in ir im il Valid elements with frequency below 20: ct is g b x Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the above equivalence. IV. "ED"'S STORY Rene observed that the EVA digraph is a marker for the A/B language split. He produced some plots where the horizontal axis is page number, with subsections distinguished by colors. Let's count the word frequencies per page: zcat ../037/vms-17-ok.soc.gz \ | tr '/' '-' \ | gawk \ ' \ (($2 ~ /[A]/) && ($6 \!~ /[-=., ]/)){ \ gsub(/[.].*$/,"",$1); print $9, substr($10,2), $1, $6; \ } \ ' \ | sort | uniq -c | expand \ | sort -b +1 -2 +2 -3 +0 -1nr \ > .all.fpw cat .all.fpw \ | list-page-champs -v maxChamps=4 \ > .all.chpw Let's count the total word occurrences per page: cat .all.fpw \ | gawk \ ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \ END { for(w in ct) { print ct[w], w; } } \ ' \ | sort -b +2 -3 +0 -1nr \ > .all.tpw Let's now count the -containing words per page: cat .all.fpw \ | gawk \ ' ($5 ~ /ed/){ print; } ' \ > .ed.fpw cat .ed.fpw \ | list-page-champs -v maxChamps=6 \ > .ed.chpw cat .ed.fpw \ | gawk '//{ print $1,$2,$3,$4; }' \ | combine-counts \ | sort -b +2 -3 \ > .ed.tpw Let's plot the ratio of -words to total words per page: plot-freqs .ed.tpw .all.tpw The plots of the -ratio R show that "hea" and "pha" are virtually -free (R < 0.03, below the erro level); "cos-1" (the part before the "zod") and "zod" begin with slightly higher ratios than "hea"/"pha" (R ~ 0.04); R then increases sharply along "zod", from R ~ 0.03 to R ~ 0.11, and jumps to R ~ 0.20 in "cos-2" (the part after "zod"). "heb-2" (after "zod") has R ~ 0.17 just below that of "cos-2". "heb-1" (before "zod") has widely variable R, with mean R ~ 0.20. "str" has R ~ 0.20 like "heb", but more uniform (except for the two pages before "zod", which have R ~ 0.02). "bio" has R ~ 0.20 in the middle, R ~ 0.32 at both ends So, based only on these plots, the writing sequence would be hea + pha (no obvious order) cos-1 + zod heb-2 str + heb-1 + cos-2 bio V. ABOUT "ED" AND LADY "DY" It seems that most of the words in language B are actually words that end with . In fact there seems to be a very small number of words involved. Let's plot the per-page frequencies of the ending: cat .all.fpw \ | gawk ' ($5 ~ /dy$/){ print $1,$2,$3,$4; } ' \ | combine-counts \ | sort -b +2 -3 \ > .dy.tpw plot-freqs .dy.tpw .all.tpw This plot shows the same trends as the frequency, except that the data for language-A is noisier and the distinction between languages A and B is less marked (because the counts for language A are no longer zero). Here "cos-1" and "zod" are practically equal. Curiously, pharma has a slightly higher R than herbal-A; and R actually decreases as we go down herbal-A. This decrease is strange since the trend in the zodiac pages establishes that increases frm older to newer, hence language A should be earlier than language B. Let's try again with the ending proper: cat .all.fpw \ | gawk ' ($5 ~ /edy$/){ print $1,$2,$3,$4; } ' \ | combine-counts \ | sort -b +2 -3 \ > .edy.tpw plot-freqs .edy.tpw .all.tpw These plots are like the plots but cleaner. The "bio" subsection has R ~ 0.25 with a dip in the middle. Subsections "str-2" and "heb-1" have almost the same R ~ 0.15. Subsections "cos-2" and "heb-2" have R ~ 0.10. Subsections "cos-1" and "zod" have R ~ 0.03 (barely significant) and the trend in "zod" is not so clear. Finally subsections "hea-1", "hea-2", "pha", and "str-1" have hardly any "edy". VI. THE "EDY" WORDS Let's compute the overall frequency of each word per subsection, removing the prefix and mapping [ktpf] to : cat .all.fpw \ | gawk \ ' /./{ \ gsub(/^q/,"",$5); gsub(/[ktpf]/,"k",$5); \ print $1,$2,"000","000",$5 \ } \ ' \ | combine-counts \ | sort -b +1 -2 +0 -1nr \ > .all.ftw cat .all.ftw \ | gawk '/./{print $1,$2,$3,$4 } ' \ | combine-counts \ | sort -b +2 -3 \ > .all.ttw Now let's look at the words specifically: cat .all.ftw \ | gawk '($5 ~ /edy$/){ print; }' \ > .edy.ftw cat .edy.ftw \ | list-page-champs -v maxChamps=6 \ > .edy.chtw Here are the six most common words in each subsection, manually sorted: sec totwd champions --- ----- ---------------------------------------------------------------------- str 10783 okedy(180) okeedy(271) chedy(193) shedy(119) lchedy(56) okchedy(131) bio 6716 okedy(310) okeedy(252) chedy(218) shedy(252) lchedy(59) okchedy(44) heb 3337 okedy(101) okeedy(31) chedy(67) shedy(36) kedy(26) ykedy(25) cos 2590 okeedy(11) chedy(12) okedy(23) shedy(11) okchedy(10) kchedy(5) zod 997 okeedy(5) chedy(4) okedy(3) shedy(5) okshedy(2) eeedy(2) hea 7553 chedy(1) ykchedy(1) okeedy(2) esedy(1) pha 2401 chedy(1) ockhedy(1) cheedy(2) cholkeedy(1) ckhedy(1) okedy(1) unk 1847 okedy(21) chedy(19) okchedy(14) shedy(14) okeedy(7) olkeedy(7) --- ----- ---------------------------------------------------------------------- As it can be seen, the words (there are many of them!) are characteristic of language B ("bio", "heb", "str"), and also a bit of "cos" and "zod". The frequency of "okedy" (and its k/t/q variants) is 1: 22 "bio" 1: 33 "heb" 1: 55 "str" 1: 220 "cos" 1: 200 "zod" and practically nil in "hea", "pha". Let's look at the words that DON'T end in : cat .all.ftw \ | gawk ' ($5 \!~ /edy$/){ print; } ' \ > .not-edy.ftw cat .not-edy.ftw \ | list-page-champs -v maxChamps=6 \ > .not-edy.chtw These are the six most common non-edy words in each subsection, also manually sorted: sec totwd champions --- ----- ---------------------------------------------------------------------- str 10783 okaiin(350) okal(198) okeey(341) aiin(199) okain(173) okar(184) bio 6716 okaiin(145) okal(185) okeey(128) okain(240) ol(363) oky(124) heb 3337 okaiin(67) okal(56) aiin(68) okar(92) daiin(79) or(68) cos 2590 aiin(44) ar(57) okeey(45) or(43) dar(40) daiin(39) zod 997 aiin(29) ar(28) okeey(24) al(30) okaiin(21) okal(17) hea 7553 daiin(393) chol(215) chor(144) okchy(142) ckhy(138) oky(131) pha 2401 daiin(105) chol(47) okeol(62) okol(52) okeey(51) ol(41) unk 1847 okar(58) daiin(42) okaiin(40) okal(32) aiin(31) or(31) --- ----- ---------------------------------------------------------------------- Note that is the most common word in herbal-A and pharma, but it shows up also in the other subsections, at 1/2 to 1/4 the frequency: "hea" 1: 18 "pha" 1: 24 "heb" 1: 40 "str" 1: 75 "bio" 1: 80 "cos" 1: 60 "zod" 1: 80 So perhaps is a function word that got less and less used as the author's vocabulary expanded. The most popular non- words in language B are okaiin okal okeey aiin okain okar They are fairly uniform across subsections, except perhaps okar which is more concentrated in herbal-B. It is hard to get any conclusion from these lists (other than `it strongly suggests Chinese' 8-). Let's try with the words /: cat .all.fpw \ | gawk \ ' ($5 ~ /^[cs]hedy$/){ \ print $1,$2,$3,$4 \ } \ ' \ | combine-counts \ | sort -b +2 -3 \ > .chedy.tpw plot-freqs .chedy.tpw .all.tpw Predictably the R values are smaller overall, and only the "str-2", "heb-1", and "bio" are significantly greater than 0. The "bio" pages still show the dip in the middle. Let's try /, which should show the reverse trend: cat .all.fpw \ | gawk \ ' ($5 ~ /^d[ao]i+n$/){ \ print $1,$2,$3,$4 \ } \ ' \ | combine-counts \ | sort -b +2 -3 \ > .dain.tpw plot-freqs .dain.tpw .all.tpw Predictably again these pages show the opposite trends. In "hea" R is large and decreasing ("hea-1" has R ~ 0.07, "hea-2" has R ~ 0.04). Next is "pha" at R ~ 0.04, then "heb-2" and "heb-1" a R ~ 0.03, then "str", "cos", "zod" and "bio" all at R ~ 0.02. The "unk" pages f1r and f49v have R ~ 0.08, which is right in the middle of the herbal-A range. The others have lower R, which is consistent with language B material. Let's compare the frequencies of "Ke" elements relative to total non-[aoy] (mostly "K" and "Ke") elements. cat .all.fpw \ | ../017/factor-field-OK \ -v inField=5 \ -v outField=6 \ | gawk \ ' /./{ \ f = $6; \ gsub(/^[^{}]*/,"",f); gsub(/[^{}]*$/,"",f); \ gsub(/}[^{}]*{/,"} {",f); n = split(f, ff); \ for(i=1;i<=n;i++){ print $1,$2,$3,$4,ff[i]; } \ } \ ' \ | grep -v '{_}' \ | combine-counts \ | sort -b +1 -2 +2 -3 +0 -1nr \ > .all.fpe dicio-wc .all.fpw .all.fpe lines words bytes file ------- ------- --------- ------------ 24921 124605 688935 .all.fpw 5632 28160 147899 .all.fpe And let's compute the total non-[aoy] elements per page: cat .all.fpe \ | gawk \ ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \ END { for(w in ct) { print ct[w], w; } } \ ' \ | sort -b +2 -3 +0 -1nr \ > .all.tpe dicio-wc .all.tpw .all.tpe lines words bytes file ------- ------- --------- ------------ 227 908 3798 .all.tpw 227 908 3924 .all.tpe Let's now count the "Ke" elements only: cat .all.fpe \ | gawk \ ' ($5 ~ /{([ice][ktpf]?[he]|[ktpf])e}/){ print; } ' \ > .Ke.fpe cat .Ke.fpe \ | gawk '//{print $1,$2,$3,$4; }' \ | combine-counts \ | sort -b +2 -3 \ > .Ke.tpe plot-freqs .Ke.tpe .all.tpe Strangely the plots show little change from language-A to language-B, less than the variation within the same subsection. The ratio for "hea-1" is lowest (R ~ 0.03) and is minimum around page p025. Curiously it seems to oscillate with a period of 1-2 pages. The ratios for all other subsections are about the same, around 0.10. The "zod" pages show again a sharp increasing trend, except for the first couple of pages. Observations: If languages A and B are indeed different languages, it is hard to explain why some letter group statistics are so uniform, and why some are so variable. If languages A and B are different spellings of the same language, then the spelling change must not have affected the use of "Ke" elements relative to the "K" elements. If the difference between languages A and B is merely due to vocabulary (including tense/person/etc.), then the difference again must not favor "Ke" words over "K" words. Let's try the gallows elements only: cat .all.fpe \ | gawk \ ' ($5 ~ /{.*[ktpf].*}/){ print; } ' \ > .ktpf.fpe cat .ktpf.fpe \ | gawk '//{print $1,$2,$3,$4; }' \ | combine-counts \ | sort -b +2 -3 \ > .ktpf.tpe plot-freqs .ktpf.tpe .all.tpe These plots are even more uniform than the previous ones. The ratio of gallows elements to non-gallows is amazingly constant (R ~ 0.22) for all subsections and languages. One cannot even see the "zod" trend. Let's look at the "skeletons" of the words, obtained by deleting the [aoy] inserts and the [i] and [e] modifiers. cat .all.fpw \ | ../017/factor-field-OK \ -v inField=5 \ -v outField=6 \ | gawk \ ' /./{ \ f = $6; \ gsub(/^[^{}]*/,"",f); gsub(/[^{}]*$/,"",f); \ if (match(f,/{([ice][ktpf]?[eh]|[ktpf])e}/)) \ { gsub(/e}/,"}",f); } \ gsub(/{i+/,"{",f); gsub(/{_}/,"",f); \ gsub(/}[^{}]*{/,"",f); \ print $1,$2,$3,$4,f; \ } \ ' \ | combine-counts \ | sort -b +1 -2 +2 -3 +0 -1nr \ > .all.fps And let's compute the total non-[aoy] elements per page: cat .all.fps \ | gawk \ ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \ END { for(w in ct) { print ct[w], w; } } \ ' \ | sort -b +2 -3 +0 -1nr \ > .all.tps dicio-wc .all.tpw .all.tps