Hacking at the Voynich manuscript - Side notes 503 Automaton-based analysis with reduced alphabet Last edited on 1999-01-31 06:18:58 by stolfi OBSOLETE This is partly a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] I built finite automata for the two versions, and looked for strange words and inflection classes. Friedman's version looked a bit more regular than Currier's. [Note-002.txt] 97-11-10 stolfi =============== After a couple of days of playing around with prefixes and suffixes, I concluded that, at this stage in the analysis, it was necessary to collapse similar letters: not only to reduce transcription and sampling noise, but also to reduce the output to a manageable size. Therefore, I wrote a filter "eva2era" that maps strings of EVA letters to a reduced alphabet ERA, as follows: function eva_to_era(txt) { # Converts a chunk of comment-free EVA to ERA gsub(/sh/, "ch", txt); gsub(/s/, "r", txt); gsub(/t/, "k", txt); gsub(/ckh/, "eke", txt); gsub(/cph/, "epe", txt); gsub(/cfh/, "efe", txt); gsub(/ei/, "o", txt); gsub(/a/, "o", txt); gsub(/y/, "o", txt); gsub(/iii*/, "i", txt); gsub(/q/, "", txt); return txt } I created files of good words from the two transcriptions: foreach f ( f c ) cat bio-${f}-eva-gut.wds \ | egrep -v '.q' \ | eva2era \ > bio-${f}-era-gut.wds cat bio-${f}-era-gut.wds \ | sort | uniq \ > bio-${f}-era-gut.dic dicio-wc bio-${f}-era-gut.{wds,dic} end lines words bytes file ------ ------- --------- ------------ 6166 6166 34893 bio-f-era-gut.wds 763 763 5164 bio-f-era-gut.dic 5864 5861 34612 bio-c-era-gut.wds 940 939 6720 bio-c-era-gut.dic Note that I removed words with embedded "q"s. The reason is that eva2era deletes "q"s, so those words might have ended up in confusing places. There were 16 such words in the Friedman version: lshdyqo qoqokeey oqoky qokeedyqokar oqokaiin cheyqy yqol olqo tyqoky oqol oqo qoqokal yqokaiin oqofchedy oqol yqor and 20 in the Currier version: oqokain lshdyqo qoqokeey oqoky sheq oqokain dyqokol dqokedy qolqol ysheeyqo cheyqytaiin olqo tyqoky oqol oqo chedyqoty teyqokedy yqokain oqofchedy yqolyqor The ones present in both are lshdyqo qoqokeey oqoky oqokaiin olqo tyqoky oqol oqo yqokain but I haven't checked whether they are really the same words of the VMs. The two ".dic" files differ in 459 words (even after eva2era collapse!). To be precise, Friedman\Currier is 134 words, Currier\Friedman is 313 words. I built automata for the two files: foreach f ( f c ) cat bio-${f}-era-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump Note-003/${f}-era.dmp nice AutoAnalysis \ -load Note-003/${f}-era.dmp \ -unprod Note-003/${f}-era-1-unp.sts \ -maxUnprod 1 \ -unprodSugg Note-003/${f}-era-1-unp.sugg nice AutoAnalysis \ -load Note-003/${f}-era.dmp \ -prod Note-003/${f}-era-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 end Automata: strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 763 4401 360 122 857 710 5.135 Friedman 940 5780 486 162 1119 940 5.165 Currier Strange words: Friedman: 40 unproductive states, 26 strange words chlchpcheeo cholkeeeo efedorol epoloir kchdolkdo keoeoloin keokem koekeeo korolchrdo lkeoldo lroiror ochepolr oddcheo okino olkeeocheol olkeeoolor oloefeo olpoekeo pdolchor poldoko polkechol prchedol rchkchdo reokoin rokcheed rpcho Currier: 59 unproductive states, 32 strange words: chedkeolo chkolcheedo chlchpcheeo doleerolcheo dolkeeeoro ekokedor epoloir erer kchdolkdo kchololkee kedopocho keokeg koekeeo korolchrdo lolkedokoinol nlkeedo ochepolr ochokeed ofolcheko okchorolko okedoepeo okorlche pchchedol pcheokeeor pcholpchefedo pdolchor poldoroirol polkechol porolcho rcholcheekeo rofchom rolchkedokor The 8 words that are strange in both automata: chlchpcheeo epoloir kchdolkdo koekeeo korolchrdo ochepolr pdolchor polkechol Productive states in Friedman's automaton: state nprefs nsuffs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 11 22 2 44 21 { n, rolcheo, rcho, opcheo, ... }:{ (), l } 60 22 2 44 21 { rolched, roin, oror, orol, ... }:{ (), o } 33 14 2 28 13 { rolo, oloko, olchedo, okecho, ... }:{ (), r } 8 4 4 16 9 { ekeo, cheeo, roro, chdo }:{ (), in, l, r } 86 10 2 20 9 { olkche, okold, okeee, epee, ... }:{ do, o } 34 5 3 15 8 { opchol, okchd, oched, chor, ... }:{ (), o, or } 26 7 2 14 6 { olkol, okeedol, pcheol, ocheol, ... }:{ (), do } 37 7 2 14 6 { rchek, orok, lked, lchek, ... }:{ eo, o } 80 4 3 12 6 { pchedo, oekeo, kcho, cheolo }:{ (), l, r } 83 4 3 12 6 { polchd, opched, olkeed, cheor }:{ (), o, ol } 306 4 3 12 6 { polche, okche, lkee, kolch }:{ d, do, o } 107 2 6 12 5 { okedo, chko }:{ (), in, ir, l, m, r } 7 3 3 9 4 { oloro, eko, choko }:{ (), in, l } 9 2 5 10 4 { ror, chd }:{ (), o, oin, ol, or } 12 5 2 10 4 { opchd, okd, kold, dcheed, ... }:{ o, ol } 103 5 2 10 4 { oroi, olkoi, koi, okedoi, ... }:{ n, r } 168 3 3 9 4 { okolch, dolche, dee }:{ do, edo, o } 188 2 4 8 3 { rolke, doke }:{ do, edo, eo, o } 217 4 2 8 3 { okorolo, loin, keedo, dororo }:{ (), m } 335 2 4 8 3 { okolo, lcheo }:{ (), l, m, r } 6 3 2 6 2 { doko, okolko, chldo }:{ (), in } 38 3 2 6 2 { ocheek, cheeek, roek }:{ eeo, eo } 55 3 2 6 2 { orche, epch, cheepe }:{ edo, o } 88 2 3 6 2 { lcheeke, chepch }:{ edo, eo, o } 167 3 2 6 2 { oloke, lchepe, kech }:{ do, edo } 218 2 3 6 2 { keed, doror }:{ (), o, om } 528 2 3 6 2 { rched, old }:{ (), o, oir } Productive states in Currier's automaton: state nprefs nsuffs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 80 23 2 46 22 { rolched, oror, oloin, olkor, ... }:{ (), o } 17 20 2 40 19 { rolkeo, rolcheo, opcheo, olkeo, ... }:{ (), l } 35 17 2 34 16 { opolo, oloko, okecho, lcheeo, ... }:{ (), r } 66 17 2 34 16 { rolkee, olkche, okeee, loee, ... }:{ do, o } 21 12 2 24 11 { rchek, olkchd, lked, lchek, ... }:{ eo, o } 186 5 3 15 8 { opchol, okchd, oched, kolchd, ... }:{ (), o, or } 406 5 3 15 8 { pchedo, okolo, oekeo, lcho, ... }:{ (), l, r } 14 3 4 12 6 { cheoko, cheeo, chdo }:{ (), in, l, r } 12 5 2 10 4 { rolko, ololo, okolko, ooko, ... }:{ (), in } 18 5 2 10 4 { olold, ofchd, kold, dcheed, ... }:{ o, ol } 291 5 2 10 4 { roror, rolor, roir, pchor, ... }:{ (), ol } 473 3 3 9 4 { polche, odche, lkee }:{ d, do, o } 696 3 3 9 4 { polchd, olor, olched }:{ (), o, ol } 49 4 2 8 3 { rek, roek, ocheek, cholke }:{ eeo, eo } 67 4 2 8 3 { ode, rorch, lkch, cheepch }:{ edo, eo } 145 4 2 8 3 { olkoi, okedoi, ki, chkoi }:{ n, r } 238 2 4 8 3 { orche, doke }:{ do, edo, eo, o } 308 4 2 8 3 { rcheed, okechd, kched, eked }:{ o, or } 382 4 2 8 3 { okolr, lolr, lcheer, keeol }:{ (), oin } 461 4 2 8 3 { okedol, oind, lkol, ldol }:{ (), or } 508 2 4 8 3 { olcheo, lolo }:{ (), l, m, r } 38 3 2 6 2 { oker, chr, chedol }:{ (), do } 68 3 2 6 2 { rorc, lkc, cheepc }:{ hedo, heo } 108 3 2 6 2 { okede, dcheoke, cheolch }:{ do, eo } 112 3 2 6 2 { oroiro, olchol, cheolkoin }:{ (), ro } 140 2 3 6 2 { ore, kech }:{ do, edo, eo } 167 3 2 6 2 { rolkch, chokch, lpch }:{ do, edo } 168 3 2 6 2 { rolkc, chokc, lpc }:{ hdo, hedo } 514 2 3 6 2 { olr, lor }:{ (), oin, olo }