Hacking at the Voynich manuscript - Side notes
502 Studying the Voynich "words" with finite automata

Last edited on 1999-01-31 06:19:08 by stolfi

OBSOLETE

  This is a remake of work from Notebook-1.txt,
  originally done around 97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]
    
    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]
    
    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    Then I started going back to redoing some of the previous tasks
    using the new encoding.
    
    I extracted the Currier (;C>) and Friedman-I (;F>)
    versions of the "bio" section, in EVA alphabet, as files
    bio-{c,f}-eva.evt. I also built the associated text files and word lists
    bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt]

97-11-06 stolfi
===============

  I created automata for the two versions:
  
    cat bio-c-eva-gut.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump Note-002/c.dmp

    cat bio-f-eva-gut.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump Note-002/f.dmp

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1556     9448       861      162     2067     1757    4.571  Currier
        1342     7656       621      146     1627     1385    4.706  Friedman
        
  Hmmm... Perhaps Friedman is the best version after all. (Or maybe the
  most affected by premature conclusions about the vocabulary...)


  I ran AutoAnalysis, looking for unproductive states (1 words or less) and
  strange words.  First, the Currier file:
  
    nice AutoAnalysis \
      -load Note-002/c.dmp \
      -unprod Note-002/c-1-unp.sts \
        -maxUnprod 1 \
        -unprodSugg Note-002/c-1-unp.sugg

        167 unproductive states
        167 strange words (with repetitions) listed
         85 strange words (without repetitions) listed

  Here are the "strangest" words in the Currier file:

    ashyteed chedyqoty chedytchd cheiror cheoltainsy cheyqytaiin
    chlchpsheey cholkeeey chsheckhy cthykorol daiilaldy daiinaryly
    dalkeeeyry dchckhedy dlchery doleesolchey dorchedchey dylches
    dyqokol eckhal elshedy epaloir fchedykary kedyposhy koroiiny
    kotaly lchediim lchedyrpchedy lfcheal lldyr lolkedykainol
    ofalsheky okalchedytory olshalsy orairaro oshepols oyshey
    paldarairal pcheykeear pdalshor polteshol poralshy pshchedal
    psholpchcfhdy qekytedar qodched qokarchckhy qokylddy qokyrlshe
    qookaiin qoqokeey qotlolkal qotoralom qotytytor qpolkain qsolkeedy
    qyrchs ralchl riil salchtedytar salshcthdy shedea shedkeoly
    sheorolkchdy shtalcheedy siiraiin sockhey sofcham ssholshecthy
    stolpchy tchalolkee tchdoltdy teealain teolsheol teyteg torolshsdy
    tyqoky ydolsheey yepaiin ykchdar ykedacphy yqolyqor yrotey
    ytchorolky yteesoldy

  Now Friedman's:

    nice AutoAnalysis \
      -load Note-002/f.dmp \
      -unprod Note-002/f-1-unp.sts \
        -maxUnprod 1 \
        -unprodSugg Note-002/f-1-unp.sugg

       72 unproductive states
       72 strange words (with repetitions) listed
       47 strange words (without repetitions) listed

  Here are the "strangest" words in Friedman's version:

    alocfhy alshees cfhdarol chlaiiin chlchpsheey cholkeeey darcheedal
    dchckhedy epaloir kotaly loiinm lshedary oddchey okarolom okiiny
    olkeeyolor olkeeysheol olpockhy oroka orsheedy oshepols otarodl
    palchd pchcfhdy polteshol pschedal pykedy qcthdys qodched
    qokeedyqokar qokeeylshedy qokylddy qopalo qsolkeedy ralchl rcheald
    reataiin sockhey sokcheed tchdoltdy teaeolain teytem tocthey
    torolshsdy tyqoky yepaiin ykchdar

  Indeed, Friedman's seems more regular...
  
  Here are the 18 words that are "strange" in both versions:
  
    chlchpsheey cholkeeey dchckhedy epaloir kotaly oshepols polteshol
    qodched qokylddy qsolkeedy ralchl sockhey tchdoltdy torolshsdy
    tyqoky yepaiin ykchdar
  

  Let's look for popular states and radicals, first in Currier's version:
  
    nice AutoAnalysis \
      -load Note-002/c.dmp \
      -classes Note-002/c-1.cls \
        -minClassPrefs 5 \
        -radicals Note-002/c-1.rad
        
     51 lexical classes found
    
    nice AutoAnalysis \
      -load Note-002/f.dmp \
      -classes Note-002/f-1.cls \
        -minClassPrefs 5 \
        -radicals Note-002/f-1.rad
        
     43 lexical classes found
    
  The output was not illuminating. I then ran AutoAnalysis, to which 
  I had previously added a command "-prod" to list productive states.
  
    AutoAnalysis \
      -load Note-002/c.dmp \
      -prod Note-002/c-1-prd.pst \
      -minProductivity 1 \
      -maxPrefSize 30 \
      -maxSuffSize 30 
      
       38 productive states
      
  Here are the most productive states in the Currier automaton:

      state  nprefs  nsuffs  nwords  prodty  prefs/suffs
    ------- ------- ------- ------- -------  -----------------
         55      32       2      64      31  { rched, rar, qotees, qokeol, ... }:{ (), y }
         97      30       2      60      29  { solkee, sheckhe, qotshe, ... }:{ dy, y }
        129      25       2      50      24  { ytai, ydai, shai, rai, pai, ... }:{ in, n }
        264      12       2      24      11  { tolsheo, qa, qoa, qokea, ... }:{ l, r }
        320      10       2      20       9  { qolte, qoshe, orshe, lte, ... }:{ dy, edy }
         93       9       2      18       8  { sheet, sheeke, qeke, lked, ... }:{ ey, y }
        132       3       4      12       6  { olta, oda, cheda }:{ iin, in, l, r }
        136       7       2      14       6  { yteed, tched, rsheed, qotald, ... }:{ ar, y }
        130       6       2      12       5  { solka, shekai, roi, oloi, dyka, ... }:{ iin, in }
         56       3       3       9       4  { lcheda, dala, aro }:{ l, r, ry }
         96       5       2      10       4  { shl, shedal, qokain, oteol, ... }:{ (), dy }
        104       5       2      10       4  { otedai, seii, dyta, cheta, ... }:{ in, r }
        118       5       2      10       4  { polke, olksh, kolch, kech, ... }:{ dy, ey }
        124       5       2      10       4  { qode, yfch, oksh, odch, lkch }:{ edy, ey }
        125       3       3       9       4  { orche, lcheckh, checth }:{ edy, ey, y }
        361       3       3       9       4  { oee, dee, dolche }:{ dy, edy, y }
        436       3       3       9       4  { sheke, okesh, dyte }:{ dy, ey, y }
        475       3       3       9       4  { otai, olkai, kai }:{ in, n, r }
        484       3       3       9       4  { qore, qoksh, otsh }:{ dy, edy, ey }
         94       4       2       8       3  { ock, cholc, chedc, chcp }:{ hey, hy }
        174       4       2       8       3  { olks, kolc, kec, cheolc }:{ hdy, hey }
        408       4       2       8       3  { told, shd, polshd, dsheed }:{ al, y }
        580       4       2       8       3  { yfc, oks, odc, lkc }:{ hedy, hey }
        625       4       2       8       3  { solkc, qops, otals, lpc }:{ hdy, hedy }
         33       2       3       6       2  { chckhe, alche }:{ dy, s, y }
        116       3       2       6       2  { octhe, dchee, chech }:{ ol, y }
        126       2       3       6       2  { lcheck, chect }:{ hedy, hey, hy }
        164       2       3       6       2  { lolo, cheka }:{ l, m, r }
        166       3       2       6       2  { shepch, lsheckh, cheke }:{ edy, y }
        389       2       3       6       2  { salche, dolshe }:{ d, dy, ol }
        522       2       3       6       2  { qotche, kshe }:{ dy, ol, y }
        535       3       2       6       2  { otii, okeda, lchea }:{ m, r }
        583       2       3       6       2  { qokche, lkee }:{ d, dy, y }
        760       3       2       6       2  { tedy, qotain, olor }:{ (), ol }
        854       3       2       6       2  { shckhy, otoldy, opchedy }:{ (), lchey }
        880       2       3       6       2  { tai, soi }:{ iin, in, n }
        900       3       2       6       2  { qolt, qosh, orsh }:{ edy, eedy }
        958       2       3       6       2  { qoks, ots }:{ hdy, hedy, hey }

  Here is Friedman's:

    AutoAnalysis \
      -load Note-002/f.dmp \
      -prod Note-002/f-1-prd.pst \
      -minProductivity 1 \
      -maxPrefSize 30 \
      -maxSuffSize 30 
      
       41 productive states

      state  nprefs  nsuffs  nwords  prodty  prefs/suffs
    ------- ------- ------- ------- -------  -----------------
         74      32       2      64      31  { sheckhe, qotche, qokes, qokeee, ... }:{ dy, y }
         69      27       2      54      26  { ytar, sheed, rshed, rched, rar, ... }:{ (), y }
        203      14       2      28      13  { shl, shol, shedal, rshee, ... }:{ (), dy }
         46      13       2      26      12  { tcha, qoa, qokea, qokeda, pcho, ... }:{ l, r }
          8      10       2      20       9  { ykai, rai, qolkai, otai, taii, ... }:{ in, n }
         84       4       4      16       9  { solke, qotsh, ckh, chcth }:{ dy, edy, ey, y }
        107       9       2      18       8  { salke, qode, qch, yfch, qoksh, ... }:{ edy, ey }
        232       9       2      18       8  { n, yty, sheolo, shcthey, qoty, ... }:{ (), l }
         85       3       4      12       6  { qots, ck, chct }:{ hdy, hedy, hey, hy }
        113       4       3      12       6  { qola, olta, oda, cheda }:{ iin, l, r }
        129       7       2      14       6  { sheet, sheeke, sheecth, olkesh, ... }:{ ey, y }
        293       7       2      14       6  { olte, qopsh, otalsh, ofch, ... }:{ dy, edy }
        370       4       3      12       6  { qotai, orai, olkai, kai }:{ in, n, r }
        452       4       3      12       6  { salche, qokche, polche, lkee }:{ d, dy, y }
        244       6       2      12       5  { yfc, qoks, olks, oks, lkc, ... }:{ hedy, hey }
        882       6       2      12       5  { solka, ska, qoteda, qora, sora, ... }:{ iin, r }
        127       5       2      10       4  { yteed, tshed, tched, oched, ... }:{ ar, y }
        154       5       2      10       4  { tedy, sheal, qotol, pshar, lar }:{ (), ol }
        323       5       2      10       4  { shd, qopchd, polshd, otd, ... }:{ al, y }
        388       3       3       9       4  { qokeo, opo, keo }:{ l, ly, r }
        492       5       2      10       4  { shepch, shecph, rolch, qckh, ... }:{ edy, y }
        493       5       2      10       4  { shepc, shecp, rolc, qck, ... }:{ hedy, hy }
        583       3       3       9       4  { shckhe, opche, okesh }:{ dy, ey, y }
        816       2       5      10       4  { qokch, polch }:{ dy, ed, edy, ey, y }
        817       2       5      10       4  { qokc, polc }:{ hdy, hed, hedy, hey, hy }
        112       4       2       8       3  { shka, qeta, olora, cheea }:{ iin, l }
        411       4       2       8       3  { qops, otals, ofc, lchcp }:{ hdy, hedy }
         83       2       3       6       2  { otsh, lte }:{ dy, edy, ey }
        101       3       2       6       2  { sheke, dke, chke }:{ dy, ey }
        108       2       3       6       2  { lcheckh, checth }:{ edy, ey, y }
        109       2       3       6       2  { lcheck, chect }:{ hedy, hey, hy }
        142       3       2       6       2  { okeda, lolo, cheka }:{ m, r }
        237       3       2       6       2  { oteo, olko, dairo }:{ l, ldy }
        267       2       3       6       2  { teed, daror }:{ (), am, y }
        294       2       3       6       2  { qolshe, dee }:{ dy, edy, y }
        403       2       3       6       2  { solche, kshe }:{ dy, ol, y }
        417       2       3       6       2  { lchea, lcheda }:{ l, m, r }
        529       3       2       6       2  { sheect, olkes, ock }:{ hey, hy }
        548       2       3       6       2  { tai, oii }:{ iin, in, n }
        705       2       3       6       2  { tar, opshed }:{ (), al, y }
       1026       2       3       6       2  { shea, qoto }:{ l, lol, r }

  Note that the top productivities are similar 
  (31, 29, 24, 11 for Currier; 31, 26, 13, 12 for Friedman);
  however, the corresponding states are quite different!

  Just for comparison, this is the old version of the Currier version 
  table, still with the FSG encoding:

      state  nsuffs  nprefs  nwords  prodty  prefs/suffs
    ------- ------- ------- ------- -------  -----------------
        180      35       2      70      34  { TDZ8, TC2, TAR, RTC8, RSC8, ... }:{ (), G }
         60      30       2      60      29  { THZC, TDCC, TCHC, SCDZC, OHTC, ... }:{ 8G, G }
        186      15       2      30      14  { TC8O, PZA, PTO, PTC8A, OESCO, ... }:{ E, R }
        681       4       4      16       9  { TC8A, OEHA, O8A, GHA }:{ E, M, N, R }
         71       5       3      15       8  { 8TCC, 4OPTC, 4OHTC, DSC, 2OETC }:{ 8G, G, OE }
         85       8       2      16       7  { OHAES, OFT, EPT, EHC, 8GDC, ... }:{ 8G, C8G }
         58       7       2      14       6  { TR, SOE, SC8AE, OHOE, ODOE, ... }:{ (), 8G }
        206       4       3      12       6  { OHS, AET, 4ORC, 4ODS }:{ 8G, C8G, CG }
        233       4       3      12       6  { TCHZ, ETCDZ, 4OT, 4ODZ }:{ C8G, CG, G }
        255       4       3      12       6  { OCC, 8OETC, 8CC, 4OESC }:{ 8G, C8G, G }
        123       6       2      12       5  { TCCD, TC8T, SCCDC, ODZ, EDC8, ... }:{ CG, G }
        189       5       2      10       4  { SHZCG, SCOEO, OHC8G, 4ODT8G, ... }:{ (), E }
        193       3       3       9       4  { OPSC8, ODCC8, 4ODCC8 }:{ (), AE, G }
        301       5       2      10       4  { TDAR, ROR, OEOR, HC8G, 4OHOE }:{ (), OE }
        463       5       2      10       4  { TCC8, OET8, HTC8, GHCC8, CC8 }:{ AR, G }
         30       4       2       8       3  { O2, HAR, CCC2, 2AR }:{ (), AE }
        151       4       2       8       3  { ODS, GFT, EDT, 4O8C }:{ C8G, CG }
        304       2       4       8       3  { TPZ, 4OHS }:{ 8G, C8G, CG, G }
        229       2       3       6       2  { EDCC, 4ODTC }:{ 8, 8G, G }
        368       3       2       6       2  { ODAI, AI, 8CI }:{ IIL, R }
        372       3       2       6       2  { TCOET, OTC, DCT }:{ 8G, CG }
        408       3       2       6       2  { S8, POES8, 8SCC8 }:{ AE, G }
        588       2       3       6       2  { SODA, EORA }:{ E, M, N }
        662       2       3       6       2  { SCDC, ODCS }:{ 8G, CG, G }
        819       3       2       6       2  { SCAE, OEDCCG, ODC8G }:{ (), R }