Hacking at the Voynich manuscript - Side notes 502 Studying the Voynich "words" with finite automata Last edited on 1999-01-31 06:19:08 by stolfi OBSOLETE This is a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] 97-11-06 stolfi =============== I created automata for the two versions: cat bio-c-eva-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump Note-002/c.dmp cat bio-f-eva-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump Note-002/f.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1556 9448 861 162 2067 1757 4.571 Currier 1342 7656 621 146 1627 1385 4.706 Friedman Hmmm... Perhaps Friedman is the best version after all. (Or maybe the most affected by premature conclusions about the vocabulary...) I ran AutoAnalysis, looking for unproductive states (1 words or less) and strange words. First, the Currier file: nice AutoAnalysis \ -load Note-002/c.dmp \ -unprod Note-002/c-1-unp.sts \ -maxUnprod 1 \ -unprodSugg Note-002/c-1-unp.sugg 167 unproductive states 167 strange words (with repetitions) listed 85 strange words (without repetitions) listed Here are the "strangest" words in the Currier file: ashyteed chedyqoty chedytchd cheiror cheoltainsy cheyqytaiin chlchpsheey cholkeeey chsheckhy cthykorol daiilaldy daiinaryly dalkeeeyry dchckhedy dlchery doleesolchey dorchedchey dylches dyqokol eckhal elshedy epaloir fchedykary kedyposhy koroiiny kotaly lchediim lchedyrpchedy lfcheal lldyr lolkedykainol ofalsheky okalchedytory olshalsy orairaro oshepols oyshey paldarairal pcheykeear pdalshor polteshol poralshy pshchedal psholpchcfhdy qekytedar qodched qokarchckhy qokylddy qokyrlshe qookaiin qoqokeey qotlolkal qotoralom qotytytor qpolkain qsolkeedy qyrchs ralchl riil salchtedytar salshcthdy shedea shedkeoly sheorolkchdy shtalcheedy siiraiin sockhey sofcham ssholshecthy stolpchy tchalolkee tchdoltdy teealain teolsheol teyteg torolshsdy tyqoky ydolsheey yepaiin ykchdar ykedacphy yqolyqor yrotey ytchorolky yteesoldy Now Friedman's: nice AutoAnalysis \ -load Note-002/f.dmp \ -unprod Note-002/f-1-unp.sts \ -maxUnprod 1 \ -unprodSugg Note-002/f-1-unp.sugg 72 unproductive states 72 strange words (with repetitions) listed 47 strange words (without repetitions) listed Here are the "strangest" words in Friedman's version: alocfhy alshees cfhdarol chlaiiin chlchpsheey cholkeeey darcheedal dchckhedy epaloir kotaly loiinm lshedary oddchey okarolom okiiny olkeeyolor olkeeysheol olpockhy oroka orsheedy oshepols otarodl palchd pchcfhdy polteshol pschedal pykedy qcthdys qodched qokeedyqokar qokeeylshedy qokylddy qopalo qsolkeedy ralchl rcheald reataiin sockhey sokcheed tchdoltdy teaeolain teytem tocthey torolshsdy tyqoky yepaiin ykchdar Indeed, Friedman's seems more regular... Here are the 18 words that are "strange" in both versions: chlchpsheey cholkeeey dchckhedy epaloir kotaly oshepols polteshol qodched qokylddy qsolkeedy ralchl sockhey tchdoltdy torolshsdy tyqoky yepaiin ykchdar Let's look for popular states and radicals, first in Currier's version: nice AutoAnalysis \ -load Note-002/c.dmp \ -classes Note-002/c-1.cls \ -minClassPrefs 5 \ -radicals Note-002/c-1.rad 51 lexical classes found nice AutoAnalysis \ -load Note-002/f.dmp \ -classes Note-002/f-1.cls \ -minClassPrefs 5 \ -radicals Note-002/f-1.rad 43 lexical classes found The output was not illuminating. I then ran AutoAnalysis, to which I had previously added a command "-prod" to list productive states. AutoAnalysis \ -load Note-002/c.dmp \ -prod Note-002/c-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 38 productive states Here are the most productive states in the Currier automaton: state nprefs nsuffs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 55 32 2 64 31 { rched, rar, qotees, qokeol, ... }:{ (), y } 97 30 2 60 29 { solkee, sheckhe, qotshe, ... }:{ dy, y } 129 25 2 50 24 { ytai, ydai, shai, rai, pai, ... }:{ in, n } 264 12 2 24 11 { tolsheo, qa, qoa, qokea, ... }:{ l, r } 320 10 2 20 9 { qolte, qoshe, orshe, lte, ... }:{ dy, edy } 93 9 2 18 8 { sheet, sheeke, qeke, lked, ... }:{ ey, y } 132 3 4 12 6 { olta, oda, cheda }:{ iin, in, l, r } 136 7 2 14 6 { yteed, tched, rsheed, qotald, ... }:{ ar, y } 130 6 2 12 5 { solka, shekai, roi, oloi, dyka, ... }:{ iin, in } 56 3 3 9 4 { lcheda, dala, aro }:{ l, r, ry } 96 5 2 10 4 { shl, shedal, qokain, oteol, ... }:{ (), dy } 104 5 2 10 4 { otedai, seii, dyta, cheta, ... }:{ in, r } 118 5 2 10 4 { polke, olksh, kolch, kech, ... }:{ dy, ey } 124 5 2 10 4 { qode, yfch, oksh, odch, lkch }:{ edy, ey } 125 3 3 9 4 { orche, lcheckh, checth }:{ edy, ey, y } 361 3 3 9 4 { oee, dee, dolche }:{ dy, edy, y } 436 3 3 9 4 { sheke, okesh, dyte }:{ dy, ey, y } 475 3 3 9 4 { otai, olkai, kai }:{ in, n, r } 484 3 3 9 4 { qore, qoksh, otsh }:{ dy, edy, ey } 94 4 2 8 3 { ock, cholc, chedc, chcp }:{ hey, hy } 174 4 2 8 3 { olks, kolc, kec, cheolc }:{ hdy, hey } 408 4 2 8 3 { told, shd, polshd, dsheed }:{ al, y } 580 4 2 8 3 { yfc, oks, odc, lkc }:{ hedy, hey } 625 4 2 8 3 { solkc, qops, otals, lpc }:{ hdy, hedy } 33 2 3 6 2 { chckhe, alche }:{ dy, s, y } 116 3 2 6 2 { octhe, dchee, chech }:{ ol, y } 126 2 3 6 2 { lcheck, chect }:{ hedy, hey, hy } 164 2 3 6 2 { lolo, cheka }:{ l, m, r } 166 3 2 6 2 { shepch, lsheckh, cheke }:{ edy, y } 389 2 3 6 2 { salche, dolshe }:{ d, dy, ol } 522 2 3 6 2 { qotche, kshe }:{ dy, ol, y } 535 3 2 6 2 { otii, okeda, lchea }:{ m, r } 583 2 3 6 2 { qokche, lkee }:{ d, dy, y } 760 3 2 6 2 { tedy, qotain, olor }:{ (), ol } 854 3 2 6 2 { shckhy, otoldy, opchedy }:{ (), lchey } 880 2 3 6 2 { tai, soi }:{ iin, in, n } 900 3 2 6 2 { qolt, qosh, orsh }:{ edy, eedy } 958 2 3 6 2 { qoks, ots }:{ hdy, hedy, hey } Here is Friedman's: AutoAnalysis \ -load Note-002/f.dmp \ -prod Note-002/f-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 41 productive states state nprefs nsuffs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 74 32 2 64 31 { sheckhe, qotche, qokes, qokeee, ... }:{ dy, y } 69 27 2 54 26 { ytar, sheed, rshed, rched, rar, ... }:{ (), y } 203 14 2 28 13 { shl, shol, shedal, rshee, ... }:{ (), dy } 46 13 2 26 12 { tcha, qoa, qokea, qokeda, pcho, ... }:{ l, r } 8 10 2 20 9 { ykai, rai, qolkai, otai, taii, ... }:{ in, n } 84 4 4 16 9 { solke, qotsh, ckh, chcth }:{ dy, edy, ey, y } 107 9 2 18 8 { salke, qode, qch, yfch, qoksh, ... }:{ edy, ey } 232 9 2 18 8 { n, yty, sheolo, shcthey, qoty, ... }:{ (), l } 85 3 4 12 6 { qots, ck, chct }:{ hdy, hedy, hey, hy } 113 4 3 12 6 { qola, olta, oda, cheda }:{ iin, l, r } 129 7 2 14 6 { sheet, sheeke, sheecth, olkesh, ... }:{ ey, y } 293 7 2 14 6 { olte, qopsh, otalsh, ofch, ... }:{ dy, edy } 370 4 3 12 6 { qotai, orai, olkai, kai }:{ in, n, r } 452 4 3 12 6 { salche, qokche, polche, lkee }:{ d, dy, y } 244 6 2 12 5 { yfc, qoks, olks, oks, lkc, ... }:{ hedy, hey } 882 6 2 12 5 { solka, ska, qoteda, qora, sora, ... }:{ iin, r } 127 5 2 10 4 { yteed, tshed, tched, oched, ... }:{ ar, y } 154 5 2 10 4 { tedy, sheal, qotol, pshar, lar }:{ (), ol } 323 5 2 10 4 { shd, qopchd, polshd, otd, ... }:{ al, y } 388 3 3 9 4 { qokeo, opo, keo }:{ l, ly, r } 492 5 2 10 4 { shepch, shecph, rolch, qckh, ... }:{ edy, y } 493 5 2 10 4 { shepc, shecp, rolc, qck, ... }:{ hedy, hy } 583 3 3 9 4 { shckhe, opche, okesh }:{ dy, ey, y } 816 2 5 10 4 { qokch, polch }:{ dy, ed, edy, ey, y } 817 2 5 10 4 { qokc, polc }:{ hdy, hed, hedy, hey, hy } 112 4 2 8 3 { shka, qeta, olora, cheea }:{ iin, l } 411 4 2 8 3 { qops, otals, ofc, lchcp }:{ hdy, hedy } 83 2 3 6 2 { otsh, lte }:{ dy, edy, ey } 101 3 2 6 2 { sheke, dke, chke }:{ dy, ey } 108 2 3 6 2 { lcheckh, checth }:{ edy, ey, y } 109 2 3 6 2 { lcheck, chect }:{ hedy, hey, hy } 142 3 2 6 2 { okeda, lolo, cheka }:{ m, r } 237 3 2 6 2 { oteo, olko, dairo }:{ l, ldy } 267 2 3 6 2 { teed, daror }:{ (), am, y } 294 2 3 6 2 { qolshe, dee }:{ dy, edy, y } 403 2 3 6 2 { solche, kshe }:{ dy, ol, y } 417 2 3 6 2 { lchea, lcheda }:{ l, m, r } 529 3 2 6 2 { sheect, olkes, ock }:{ hey, hy } 548 2 3 6 2 { tai, oii }:{ iin, in, n } 705 2 3 6 2 { tar, opshed }:{ (), al, y } 1026 2 3 6 2 { shea, qoto }:{ l, lol, r } Note that the top productivities are similar (31, 29, 24, 11 for Currier; 31, 26, 13, 12 for Friedman); however, the corresponding states are quite different! Just for comparison, this is the old version of the Currier version table, still with the FSG encoding: state nsuffs nprefs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 180 35 2 70 34 { TDZ8, TC2, TAR, RTC8, RSC8, ... }:{ (), G } 60 30 2 60 29 { THZC, TDCC, TCHC, SCDZC, OHTC, ... }:{ 8G, G } 186 15 2 30 14 { TC8O, PZA, PTO, PTC8A, OESCO, ... }:{ E, R } 681 4 4 16 9 { TC8A, OEHA, O8A, GHA }:{ E, M, N, R } 71 5 3 15 8 { 8TCC, 4OPTC, 4OHTC, DSC, 2OETC }:{ 8G, G, OE } 85 8 2 16 7 { OHAES, OFT, EPT, EHC, 8GDC, ... }:{ 8G, C8G } 58 7 2 14 6 { TR, SOE, SC8AE, OHOE, ODOE, ... }:{ (), 8G } 206 4 3 12 6 { OHS, AET, 4ORC, 4ODS }:{ 8G, C8G, CG } 233 4 3 12 6 { TCHZ, ETCDZ, 4OT, 4ODZ }:{ C8G, CG, G } 255 4 3 12 6 { OCC, 8OETC, 8CC, 4OESC }:{ 8G, C8G, G } 123 6 2 12 5 { TCCD, TC8T, SCCDC, ODZ, EDC8, ... }:{ CG, G } 189 5 2 10 4 { SHZCG, SCOEO, OHC8G, 4ODT8G, ... }:{ (), E } 193 3 3 9 4 { OPSC8, ODCC8, 4ODCC8 }:{ (), AE, G } 301 5 2 10 4 { TDAR, ROR, OEOR, HC8G, 4OHOE }:{ (), OE } 463 5 2 10 4 { TCC8, OET8, HTC8, GHCC8, CC8 }:{ AR, G } 30 4 2 8 3 { O2, HAR, CCC2, 2AR }:{ (), AE } 151 4 2 8 3 { ODS, GFT, EDT, 4O8C }:{ C8G, CG } 304 2 4 8 3 { TPZ, 4OHS }:{ 8G, C8G, CG, G } 229 2 3 6 2 { EDCC, 4ODTC }:{ 8, 8G, G } 368 3 2 6 2 { ODAI, AI, 8CI }:{ IIL, R } 372 3 2 6 2 { TCOET, OTC, DCT }:{ 8G, CG } 408 3 2 6 2 { S8, POES8, 8SCC8 }:{ AE, G } 588 2 3 6 2 { SODA, EORA }:{ E, M, N } 662 2 3 6 2 { SCDC, ODCS }:{ 8G, CG, G } 819 3 2 6 2 { SCAE, OEDCCG, ODC8G }:{ (), R }