Hacking at the Voynich manuscript - Side notes 504 Attempt at factoring the 'edy' and 'air' word classes Last edited on 1999-01-31 06:20:15 by stolfi OBSOLETE This is partly a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] I mapped the word files bio-{c,f}-eva-gut.{wds,dic} to a reduced alphabet (ERA) obtaining files bio-{c,f}-era-gut.{wds,dic} I built finite-state automata for these two files, and looked for producetive states. The Friedman version seems cleaner. [Note-003.txt] 97-11-10 stolfi =============== Poring over the productive states of the automata and playng around with counting programs, I tentatively identified two major classes of morphologically similar words: one inflecting as -e*d?y or -e*d (the 'edy' class) and one inflecting as -e*d?[ao]i*[rlnm] (the 'air' class). Let's collect their "roots". To keep the tables manageable, and reduce noise, let's work with ERA-mapped files. Recall that "a" and "y" become "o" in ERA. cat bio-f-era-gut.wds \ | egrep -e '[orlmnd]$' \ | sed -e 's:e*d*o$:- -EDY:' \ | sed -e 's:e*d$:- -EDY:' \ | sed -e 's:e*d*oi*[rlmn]$:- -AIR:' \ | egrep -e '- -' \ > Note-004/.factored cat Note-004/.factored \ | gawk '/./ {print $1}' \ | revbytes | sort | uniq | revbytes \ > Note-004/.prefs-all.dic dicio-wc Note-004/.prefs-all.dic lines words bytes file ------ ------- --------- ------------ 266 266 1704 Note-004/.prefs-all.dic I built a file "Note-004/suffs.cls" containing only the lines "-EDY" and "-AIR". cat .factored \ | count-diword-freqs \ -v rows=Note-004/.prefs-all.dic \ -v cols=Note-004/suffs.cls \ -v digits=4 \ > Note-004/.tbl-1 The raw count table was then extracted by hand from Note-004/.tbl-1 and saved to Note-004/.tbl-2. These are the most popular prefixes from that table (10 or more occurrences), sorted by reverse prefix: prefix TOT EDY AIR ----------- ------ ---- ---- - 846 81 765 ch- 971 771 200 dch- 58 47 11 okech- 12 11 1 kch- 54 44 10 okch- 104 101 3 lch- 139 125 14 olch- 111 105 6 polch- 11 10 1 rolch- 13 12 1 och- 46 39 7 pch- 43 26 17 opch- 39 34 5 rch- 47 43 4 k- 208 97 111 ek- 34 24 10 cheek- 110 108 2 chek- 189 172 17 oek- 17 14 3 chk- 29 12 17 lk- 32 20 12 olk- 148 86 62 rolk- 17 13 4 ok- 1652 873 779 dok- 12 9 3 l- 70 24 46 ol- 109 58 51 dol- 19 15 4 okol- 31 22 9 o- 32 18 14 p- 19 1 18 op- 10 4 6 r- 253 18 235 or- 48 15 33 dor- 10 7 3 ... ... ... ... ----------- ------ ---- ---- TOT 5992 3395 2597 Let's do a scatter plot of the prefixes, using the EDY/AIR counts: gnuplot < Note-004/pref-suff-class.txt Here are the significant entries (type > 0), sorted by reverse prefix. prefix TOT EDY AIR R ----------- ------ ---- ---- - - 846 81 765 9 ----------- ------ ---- ---- - d- 6 6 . 1 ----------- ------ ---- ---- - ch- 971 771 200 2 dch- 58 47 11 2 olkech- 3 3 . 1 okech- 12 11 1 1 fch- 4 3 1 3 ofch- 9 8 1 1 kch- 54 44 10 2 lkch- 3 3 . 1 olkch- 7 7 . 1 okch- 104 101 3 1 lch- 139 125 14 1 olch- 111 105 6 1 dolch- 6 6 . 1 kolch- 3 3 . 1 okolch- 4 4 . 1 polch- 11 10 1 1 rolch- 13 12 1 1 och- 46 39 7 2 pch- 43 26 17 5 chepch- 5 5 . 1 lpch- 3 2 1 4 opch- 39 34 5 2 rch- 47 43 4 1 orch- 6 6 . 1 ----------- ------ ---- ---- - k- 208 97 111 6 dk- 4 3 1 3 ek- 34 24 10 3 cheek- 110 108 2 1 lcheek- 9 8 1 1 chek- 189 172 17 1 lchek- 4 4 . 1 rchek- 6 6 . 1 oek- 17 14 3 2 chk- 29 12 17 7 lk- 32 20 12 4 olk- 148 86 62 5 dolk- 3 3 . 1 opolk- 2 . 2 9 rolk- 17 13 4 3 ok- 1652 873 779 5 dok- 12 9 3 3 cheok- 4 4 . 1 chok- 8 4 4 6 kok- 2 . 2 9 olok- 5 4 1 2 rok- 6 3 3 6 rk- 5 1 4 8 ----------- ------ ---- ---- - l- 70 24 46 7 chl- 3 1 2 7 okl- 3 . 3 9 ol- 109 58 51 5 dol- 19 15 4 2 cheol- 8 5 3 4 chol- 4 4 . 1 kol- 6 5 1 2 okol- 31 22 9 3 lol- 8 6 2 3 opol- 3 3 . 1 rol- 7 6 1 2 orol- 4 4 . 1 ----------- ------ ---- ---- - o- 32 18 14 5 cho- 3 3 . 1 oko- 7 6 1 2 ----------- ------ ---- ---- - p- 19 1 18 9 ep- 9 5 4 5 cheep- 4 4 . 1 chep- 4 3 1 3 op- 10 4 6 7 ----------- ------ ---- ---- - r- 253 18 235 9 lr- 2 . 2 9 or- 48 15 33 8 dor- 10 7 3 3 kor- 8 4 4 6 okor- 5 4 1 2 lor- 3 . 3 9 olor- 5 2 3 7 por- 2 . 2 9 ror- 6 1 5 9 doror- 3 2 1 4 ----------- ------ ---- ---- - Note that the EDY/AIR ratio depends rather strongly on the last letter of the suffix. These were the counts for prefixes containing 'c[ktpf]h' before we decided to map those compounds to 'e[ktpf]e': Counts for 'c[ktpf]h' before: ----------- ------ ---- ---- - cph- 9 5 4 5 checph- 4 4 . 1 chcph- 3 3 . 1 ----------- ------ ---- ---- - ckh- 26 19 7 3 checkh- 82 80 2 1 lcheckh- 8 8 . 1 chckh- 133 128 5 1 ockh- 16 13 3 2 ----------- ------ ---- ---- - Counts for 'e[ktpf]-' before ----------- ------ ---- ---- - ep- . . . 0 cheep- . . . 0 chep- . . . 0 ----------- ------ ---- ---- - ek- 8 5 3 4 cheek- 28 28 . 1 lcheek- . . . 0 chek- 56 44 12 2 oek- . . . 0 ----------- ------ ---- ---- - Counts for 'e[ktpf]-' after: ----------- ------ ---- ---- - ep- 9 5 4 5 cheep- 4 4 . 1 chep- 4 3 1 3 ----------- ------ ---- ---- - ek- 34 24 10 3 cheek- 110 108 2 1 lcheek- 9 8 1 1 chek- 189 172 17 1 oek- 17 14 3 2 ----------- ------ ---- ---- -