Hacking at the Voynich manuscript
Notebook - volume 1

Warning: these notebooks aren't strictly chronological logs.
  Sometimes I go back and redo things, clarify comments,
  delete garbage, etc.

97-07-05 stolfi
===============

  I obtained Landini's interlinear transcription of the VMs, version 1.6
  (landini-interln16.evt) from
  http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip
  
  I extracted manually from it a homogeneous, full-text sample
  bio-m-evt.evt, consisting of pages 147-166 of the "biological"
  section, in Currier's Language B, hand 2.  This section includes Currier's
  and Friedman's transcriptions.  Currier's seems to be the most complete of
  them.

  I reduced this sample to a file cb.txt in "plain" text format as follows:
  
    * I eliminated comments and Friedman's lines:
    
       cat bio-m-evt.evt \
         | egrep -v '^#' \
         | grep ';C>' \
         > bio-c-evt.evt

    * I eliminated (with Emacs) the line numbers <...> 
      at the beginning of each line;
      
    * I replaced 3 or more consecutive occurrences of "%" 
      by "%%"
      
    * I eliminated the strings of "!", present at end of some lines.
    
    * I replaced the end-of-line "-" by ".//."
    
    * I replaced the end-of-paragraph "=" by ".//.=."
      (Note that not all pages end in "=" in the original Currier file.)
      
    * I replaced "." by " " (with tr).
    
    * I removed " " at end of all lines.
    
     lines   words     bytes file        
    ------ ------- --------- ------------
       765    7230     38906 bio-c-evt.txt

  Next I computed the word frequencies:
  
    cat bio-c-evt.txt \
      | tr ' ' '\012' \
      | sort \
      | uniq -c \
      | sort +0 -1nr \
      > bio-c-evt.frq

    cat bio-c-evt.txt \
      | tr ' ' '\012' \
      | sort \
      | uniq \
      > bio-c-evt.dic

  I removed all "bad" words (with "%", "/", "=", "*"):
  
    cat bio-c-evt.dic \
      | grep -v '[%*/=]' \
      | egrep '.' \
      > bio-c-evt-gut.dic
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      1851    1850     12117 bio-c-evt.dic
      1381    1381      8277 bio-c-evt-gut.dic

  I created an automaton for bio-c-evt-gut.dic:
  
    cat bio-c-evt-gut.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump bio-c-evt-gut.dmp

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1381     6896       535      114     1633     1341    4.223

  I ran AutoAnalysis, looking for unproductive states (2 words or less) and
  strange words:
  
    nice AutoAnalysis \
      -load bio-c-evt-gut.dmp \
      -unprod bio-c-evt-gut-1-unp.sts \
        -maxUnprod 2 \
        -unprodSugg bio-c-evt-gut-1-unp.sugg

        161 unproductive states
        255 strange words (with repetitions) listed
        200 strange words (without repetitions) listed

  I redid again, redefining a state as unproductive if it is used by only one
  word:
   
   
      nice AutoAnalysis \
        -load bio-c-evt-gut.dmp \
        -unprod bio-c-evt-gut-2-unp.sts \
          -maxUnprod 1 \
          -unprodSugg bio-c-evt-gut-2-unp.sugg
        
       67 unproductive states
       67 strange words (with repetitions) listed
       45 strange words (without repetitions) listed

  Here are the strange words:

      2AESHZ8G 2AIIRAE 42OEDCC8G 4GDAM 4O4ODCCG 4ODGE88G 8ETCRG C2C2
      CESC8G CPAEOIR DOHAEG DOROER EE8GR EEAIIRG EFTCAE EODAK EOIIIK
      FZ8OROE G4ODAN GCPAM GDT8AR GROHCG HCGHC6 HG4ODG HOROES28G
      HT8OEH8G IEPT8G ODAROEOK OGSCG OHTOHAR OPTOEOR OSCPOE2 P8AESOR
      PGDC8G POECC8ARAL POEHCSOE POHTODAR PSTC8AE SC8CA SETCR TCIROR
      TEAIIIL TETPSCCG TOEDCCCG TSCDZG
  
  Looked for popular states and radicals:
  
    nice AutoAnalysis \
      -load bio-c-evt-gut.dmp \
      -classes bio-c-evt-gut-1.cls \
        -minClassPrefs 5 \
        -radicals bio-c-evt-gut-1.rad
        
     37 lexical classes found
    
  Modified AutoAnalysis, adding a new command "-prod" to list
  productive states.
  
    AutoAnalysis \
      -load bio-c-evt-gut.dmp \
      -prod bio-c-evt-gut-1-prd.pst \
      -minProductivity 1 \
      -maxPrefSize 30 \
      -maxSuffSize 30 
      
       25 productive states
      
  Here they are:

      state  nsuffs  nprefs  nwords  prodty  prefs/suffs
    ------- ------- ------- ------- -------  -----------------
        180      35       2      70      34  { TDZ8, TC2, TAR, RTC8, RSC8, ... }:{ (), G }
         60      30       2      60      29  { THZC, TDCC, TCHC, SCDZC, OHTC, ... }:{ 8G, G }
        186      15       2      30      14  { TC8O, PZA, PTO, PTC8A, OESCO, ... }:{ E, R }
        681       4       4      16       9  { TC8A, OEHA, O8A, GHA }:{ E, M, N, R }
         71       5       3      15       8  { 8TCC, 4OPTC, 4OHTC, DSC, 2OETC }:{ 8G, G, OE }
         85       8       2      16       7  { OHAES, OFT, EPT, EHC, 8GDC, ... }:{ 8G, C8G }
         58       7       2      14       6  { TR, SOE, SC8AE, OHOE, ODOE, ... }:{ (), 8G }
        206       4       3      12       6  { OHS, AET, 4ORC, 4ODS }:{ 8G, C8G, CG }
        233       4       3      12       6  { TCHZ, ETCDZ, 4OT, 4ODZ }:{ C8G, CG, G }
        255       4       3      12       6  { OCC, 8OETC, 8CC, 4OESC }:{ 8G, C8G, G }
        123       6       2      12       5  { TCCD, TC8T, SCCDC, ODZ, EDC8, ... }:{ CG, G }
        189       5       2      10       4  { SHZCG, SCOEO, OHC8G, 4ODT8G, ... }:{ (), E }
        193       3       3       9       4  { OPSC8, ODCC8, 4ODCC8 }:{ (), AE, G }
        301       5       2      10       4  { TDAR, ROR, OEOR, HC8G, 4OHOE }:{ (), OE }
        463       5       2      10       4  { TCC8, OET8, HTC8, GHCC8, CC8 }:{ AR, G }
         30       4       2       8       3  { O2, HAR, CCC2, 2AR }:{ (), AE }
        151       4       2       8       3  { ODS, GFT, EDT, 4O8C }:{ C8G, CG }
        304       2       4       8       3  { TPZ, 4OHS }:{ 8G, C8G, CG, G }
        229       2       3       6       2  { EDCC, 4ODTC }:{ 8, 8G, G }
        368       3       2       6       2  { ODAI, AI, 8CI }:{ IIL, R }
        372       3       2       6       2  { TCOET, OTC, DCT }:{ 8G, CG }
        408       3       2       6       2  { S8, POES8, 8SCC8 }:{ AE, G }
        588       2       3       6       2  { SODA, EORA }:{ E, M, N }
        662       2       3       6       2  { SCDC, ODCS }:{ 8G, CG, G }
        819       3       2       6       2  { SCAE, OEDCCG, ODC8G }:{ (), R }

  There appear to be two major kinds of words: those that inflect
  with [M,E,N,R] and those that inflect with [(),G,8G,CG,C8G], or
  [G] for shrt.
  
  I collected the radicals of the class [M,N,E,R]:
  
    cat bio-c-evt.frq \
      | egrep -v '[%*]' \
      | egrep '[MENR]$' \
      | sort +1 -2 \
      > foo.frq

  Here are some popular radicals of the [M,E,N,R] class, with aproximate counts
  of the four inflections:
  
    4ODA(383) O(238) 8A(225) ODA(111) 2A(94) 4O(93) OHA(79) A(75) 4OHA(73) 2O(49)
    OEDA(41) RA(35) SCO(31) EO(31) TCO(30) OEO(28) DA(24) 4ODO(23) RO(22)  ''(20)
    HA(19) EDA(15) SO(14) SA(13) TC8A(12) OEA(11) PO(11) SC8A(10) TA(10)
    EA(9) HO(9) ORA(9) SCA(9) 
    
  Note that the radicals all end in "A" or "O"
  
  I collected the radicals of the class [G]:
  
    cat bio-c-evt.frq \
      | egrep -v '[%*]' \
      | egrep '[G]$' \
      | sort +1 -2 \
      > foo.frq

  Looking at the most frequent words in this file, it seems that the
  main suffixes of this class are actually, in decreasing frequency
  order for each radical:
    
    4OD{C8G,CC8G,CCG,CG,G,SC8G,T8G,TC8G,TCG,TG,ZCG,ZG}
    4OH{C8G,CC8G,G,CG,CCG,TC8G,TG}
    8{G,SC8G,TC8G,C8G,CC8G,...} 
    D{C8G,CC8G,TCG,CCC8G,CG}
    E{TC8G,SC8G,TCG,8G,DC8G,G,DCC8G,SCG,TG,SCC8G,OEG,SCCG,...}
    GD{CC8G,CCG,C8G,}
    GH{C8G,CCG,CC8G}
    H{C8G,CC8G,CCG}
    OD{C8G,CC8G,CCG,G,CG,TCG,SCG,T8G,CSG,SC8G,TG,ZG,CCC8G,...}
    OE{G,TC8G,SC8G,TCG,8G,...}
    OED{C8G,CCG,CC8G,...}
    OH{C8G,CC8G,CCG,G,CG,C8CCG,CCC8G,...}
    OP{TC8G,SC8G,TCG,TCCG,...}
    S{C8G,CG,CC8G,CCG,G,8G,...}
    T{C8G,CG,CCG,CC8G,8G,G,...}
    
  So it seems that the [G] class is actually [C8G,CC8G,CCG,CG,G].
  However the preceding letter is quite often [S,T,C,Z,D], so the 
  class may include [SC8G,TC8G,...].

  I extracted the words with those popular radicals combined with 
  the selected endings, and counted their frequencies:
  
    /bin/rm .foo
    foreach f ( '' 4OD E OH 4OH OD 8 OE H D GD 4OED )
      echo 'Prefix "'"${f}"'":' >> .foo
      echo ' ' >> .foo
      cat bio-c-evt.frq \
        | egrep -v '[%*]' \
        | egrep '	'"${f}"'[STZ]?[C]*[8]?[G]?$' \
        | sed -e 's@[	]'"${f}"'@ @' \
        | sort -b +0 -1nr \
        | format-counts \
        | sed -e 's@^@  @' \
        >> .foo
      echo ' ' >> .foo
    end

    Prefix "":

      SC8G(210) TC8G(193) SCG(78) TCG(74) 8G(41) TCCG(34) SCC8G(33)
      SCCG(30) TCC8G(21) T8G(19) SG(9) G(5) S8G(5) TC8(5) TG(5)
      CC8G(3) S(2) SC8(2) (1) 8(1) CCC8G(1) CCG(1) SCC8(1) TCCCG(1)

    Prefix "4OD":

      C8G(159) CC8G(149) CCG(83) G(58) CG(39) T8G(10) TC8G(7) SC8G(6)
      TG(5) C8(4) CC8(4) ZCG(4) (3) CCCG(3) TCG(3) CCC8G(2) 8G(1)
      S8G(1) SCG(1) TC8(1) ZC8G(1) ZG(1)

    Prefix "E":

      TC8G(55) SC8G(23) TCG(18) 8G(9) (7) G(7) SCG(6) TG(6) SCC8G(5)
      SCCG(4) T8G(4) TCCG(4) TC8(3) S8G(2) SC8(2) TCC8G(2) 8(1) C(1)
      CC8G(1) S(1) T(1)

    Prefix "OH":

      C8G(48) CC8G(25) CCG(18) G(17) CG(13) SC8G(4) S8G(3) TC8G(3)
      TCG(3) (2) SCG(2) TG(2) C8(1) CCC8G(1) T8G(1) ZCG(1) ZG(1)

    Prefix "4OH":

      C8G(46) CC8G(39) G(26) CG(8) CCG(7) TC8G(3) TG(3) CC8(2) SC8G(2)
      TCG(2) (1) CCC8G(1) S8G(1) SCG(1) SG(1) T8G(1) ZCG(1)

    Prefix "OD":

      C8G(43) CC8G(32) CCG(16) G(12) CG(11) TCG(6) SCG(3) T8G(3)
      CC8(2) SC8G(2) TG(2) ZG(2) (1) CCC8G(1) T8(1) TC8G(1) ZCG(1)

    Prefix "8":

      G(41) SC8G(17) TC8G(9) C8G(2) CC8G(2) SCC8G(2) SCCG(2) SCG(2)
      T8G(2) TCG(2) (1) CCC8G(1) CCG(1) S8G(1) SC8(1) TCC8G(1) TCCG(1)
      TG(1)

    Prefix "OE":

      (176) G(26) TC8G(24) SC8G(14) TCG(13) 8G(12) SCG(8) TG(5) S8G(4)
      T8G(3) TCCG(3) SC8(2) SCCG(2) TCC8G(2) CCC8(1) CCC8G(1) SCC8(1)
      SCC8G(1) TC8(1)

    Prefix "H":

      C8G(20) TC8G(10) CC8G(4) CCG(4) SC8G(3) SCG(3) ZC8G(3) ZCG(3)
      G(2) TCG(2) Z8G(2) T8G(1) TCCG(1) TG(1)

    Prefix "D":

      C8G(11) CC8G(10) TCG(4) CC8(3) (2) CCC8G(2) CG(2) SCG(2) T8G(2)
      ZCG(2) S8G(1) SC8G(1) TC8G(1) Z8G(1) ZC8G(1) ZG(1)

    Prefix "GD":

      CC8G(10) CCG(7) C8G(5) CCCG(1) SC8G(1) ZCG(1)

    Prefix "4OED":

      CC8G(4) CCG(4) G(4) CG(1)

  It seems that the class consists of the suffixes [S,T,Z,][C]*[8,][G]. 
  I collected all prefixes with these suffixes:

    /bin/rm .bar
    cat bio-c-evt.frq \
      | egrep -v '[%*]' \
      | egrep 'G$' \
      | /bin/sed \
          -e 's@[	]@	:@' \
          -e 's@[	]\(.*\)\([STZ][C]*[8]\{0,1\}G\)$@ \1- \2@' \
          -e 's@[	]\(.*[^STZC]\)\(C[C]*[8]\{0,1\}G\)$@ \1- \2@' \
          -e 's@[	]\(.*[^STZC]\)\(8G\)$@ \1- \2@' \
          -e 's@[	]\(.*[^STZC8]\)\(G\)$@ \1- \2@' \
          -e 's@ :@ @' \
      | sort -b +2 -3 +0 -1nr \
      > .bar

    Suffix "8G":
    
      (41) OE(12) E(9) 8AE(4) 4O(3) DOE(3) O(3) SOE(3)
      4ODG(2) 4ODO(2) 4OHAE(2) PSCOE(2) 2OE(1) 4OD(1) 4ODA(1)
      4ODAE(1) 4ODC2(1) 4ODGE8(1) 4ODOE(1) 4ODT2(1) 4OE(1)
      4OHAR(1) 4OP(1) 8AIRG(1) 8AR(1) A(1) AEAE(1) EDCOE(1)
      EO(1) EOE(1) G(1) GSCAE(1) HOROES2(1) HT8OEH(1) ODA(1)
      ODAE(1) ODOE(1) OEAO(1) OEHG(1) OHAE(1) OHCCO(1) OHCOE(1)
      OHOE(1) OHTR(1) ORO(1) PZO(1) SC8AE(1) SCCOE(1) SCOE(1)
      SE(1) TO(1) TR(1)

    Suffix "C8G":
    
      4OD(159) OH(48) 4OH(46) OD(43) H(20) OED(18) D(11) ED(7)
      GH(7) GD(5) OEH(3) TD(3) 8(2) 8D(2) EH(2) SCD(2) TCH(2)
      2OED(1) 4(1) 4CH(1) 4ODC8(1) 4ODO(1) 4OR(1) 8GD(1)
      8GH(1) 8OD(1) 8OE(1) 8OED(1) AED(1) CD(1) OR(1) PGD(1)
      TH(1)
    
    Suffix "CC8G":
    
      4OD(149) 4OH(39) OD(32) OH(25) OED(13) D(10) GD(10)
      4O(6) ED(6) 4OED(4) H(4) (3) 2OED(3) GH(3) 2OD(2) 4(2)
      8(2) 8GD(2) EH(2) OEH(2) 2OEH(1) 42OED(1) 4O8(1)
      4OCCD(1) 4OEH(1) 4OR(1) E(1) EO(1) G8(1) HD(1) HOEOD(1)
      HSCOD(1) LED(1) O(1) OCD(1) PO(1) TCD(1) TD(1)
    
    Suffix "CCC8G":
    
      4O(2) 4OD(2) D(2) (1) 4(1) 4OH(1) 8(1) O(1) OD(1)
      OE(1) OH(1)
    
    Suffix "CCCG": 
    
      4OD(3) OED(2) GD(1) TOED(1)
    
    Suffix "CCG": 
    
      4OD(83) OED(18) OH(18) OD(16) 4OH(7) GD(7) 4OED(4) GH(4)
      H(4) 2OED(3) ED(3) 8D(2) (1) 2AED(1) 2D(1) 4(1) 4CC8(1)
      4CD(1) 4O(1) 4O4OD(1) 4O8(1) 4OR(1) 8(1) 8OH(1) EO(1)
      EP(1) O(1) OEOED(1) OHC8(1) POED(1) SCCD(1) SCD(1) TD(1)
      TGH(1)
    
    Suffix "CG":
    
      4OD(39) OH(13) OD(11) 4OH(8) OED(5) TCH(3) D(2) SCD(2)
      TCCD(2) 2OED(1) 4(1) 4CD(1) 4O8GD(1) 4OED(1) 8GH(1)
      AH(1) CC2(1) ED(1) EDC8(1) GH(1) GROH(1) HOED(1)
      OEHS8(1) SCCD(1) SCCH(1) SCH(1) TCD(1)
    
    Suffix "G":
    
      4OD(58) 4OH(26) OE(26) OH(17) OD(12) TCD(12) OED(11)
      SCD(11) TCCD(11) SCCD(8) 4OE(7) 8AE(7) E(7) SCCH(7)
      4ODAE(6) OR(6) SCH(6) (5) 4OED(4) 8AR(4) AE(4) EOE(4)
      TCH(4) AR(3) ED(3) OHAE(3) 4ODCO(2) AM(2) ESCH(2) H(2)
      ODAE(2) OEOD(2) OROE(2) R(2) RTCD(2) SCOD(2) SD(2)
      TCAE(2) TCCH(2) TH(2) 2(1) 2AE(1) 2OE(1) 2OED(1) 2SCD(1)
      2TCH(1) 4CH(1) 4ODAO(1) 4ODCOE(1) 4ODO(1) 4ODOP(1)
      4OEDAR(1) 4OEDCCOE(1) 4OGD(1) 4OHAE(1) 4OHAR(1) 4OHCC2(1)
      4OP(1) 8AEAR(1) 8AED(1) 8AROR(1) 8ETCR(1) 8GD(1) 8GR(1)
      8OE(1) A(1) A2(1) AEOR(1) CH(1) DAR(1) DOEP(1) DOHAE(1)
      EAE(1) EEAIIR(1) ETC8AR(1) GH(1) GO(1) GOD(1) GR(1)
      HAE(1) HG4OD(1) O4OD(1) ODAN(1) ODGED(1) ODO(1) OE2AE(1)
      OEAM(1) OEDOR(1) OESH(1) OFOE(1) OHAR(1) OHOR(1) ONOE(1)
      OOR(1) OP(1) OPAE(1) OPOE(1) OPTAE(1) OPTCCD(1) OROIR(1)
      POE8AD(1) POETO(1) RAE(1) RAR(1) ROE(1) SCGD(1) SCO(1)
      SH(1) SOD(1) TAE(1) TAEOE(1) TAR(1) TC2(1) TCCCH(1)
      TCO(1) TCOD(1) TCOE(1) TCP(1) TD(1) TDCA(1) THZO(1)
      TOE(1) TOH(1)
    
    Suffix "S8G":
    
      (5) OE(4) OH(3) 8AE(2) E(2) 4OD(1) 4OE(1) 4OH(1) 4OP(1)
      8(1) D(1) ODC(1) OF(1) OHAE(1) OP(1) P(1) POE(1) R(1)
      SOD(1)
    
    Suffix "SC8G":
    
      (210) E(23) 8(17) OE(14) G(8) 4OD(6) R(6) 2(5) OH(4) 4OP(3)
      H(3) OP(3) 4OH(2) OD(2) OEH(2) 4(1) 4O(1) 4OE(1) 4OHC8(1)
      8AE(1) 8E(1) AE(1) CE(1) D(1) GD(1) O(1) OHAE(1) OPAE(1)
      P(1) POE8(1)
    
    Suffix "SCC8G":
    
      (33) E(5) 4OE(2) 8(2) R(2) 2(1) 4O(1) G(1) O(1) OE(1)
      OR(1) TP(1)
    
    Suffix "SCCG":
    
      (30) E(4) G(3) 8(2) OE(2) TETP(1)
    
    Suffix "SCG": 
    
      (78) OE(8) E(6) G(4) H(3) O(3) OD(3) 2(2) 8(2) D(2)
      OH(2) 2OE(1) 4OD(1) 4OE(1) 4OH(1) 8D(1) G8AR(1) GDC(1)
      ODC(1) OED(1) OEDC(1) OG(1) OP(1) POE(1) R(1) SCCP(1)
    
    Suffix "SG":
    
      (9) ODC(2) 4OE(1) 4OF(1) 4OH(1) 8GD(1) AE(1) HOE(1)
      POE(1)
    
    Suffix "T8G":
    
      (19) 4OD(10) E(4) OD(3) OE(3) 8(2) D(2) P(2) 2(1)
      2SD(1) 4CD(1) 4ODC(1) 4ODG(1) 4OE(1) 4OH(1) 4OP(1) AE(1)
      DC(1) DOE(1) EP(1) ETP(1) GE(1) H(1) IEP(1) OF(1) OH(1)
      OP(1) POE(1) POEAE(1) S(1) TCOE(1)
    
    Suffix "TC8G":
    
      (193) E(55) OE(24) OP(13) P(13) H(10) 4OP(9) 8(9)
      4OD(7) 4OE(6) R(6) 2(5) 2OE(4) F(4) G(4) 4OH(3) OH(3)
      4ODC(2) ED(2) OEF(2) OEP(2) OF(2) POE(2) 4O(1) 4ODOE(1)
      4OF(1) 4OR(1) 4P(1) 8AE(1) 8OE(1) 8OEF(1) AE(1) CF(1)
      D(1) EO(1) EP(1) GF(1) O(1) O4OF(1) O8(1) OD(1) OEOH(1)
      SCCP(1) SCP(1) TCP(1) TP(1)
    
    Suffix "TCC8G":
    
      (21) E(2) OE(2) 4OE(1) 8(1) 8OE(1) G(1)
    
    Suffix "TCCCG":
    
      (1)

    Suffix "TCCG":
    
      (34) E(4) OE(3) OP(2) 2OD(1) 8(1) G(1) H(1) O(1) R(1)
    
    Suffix "TCG":
    
      (74) E(18) OE(13) 4OE(8) OD(6) R(5) D(4) P(4) 2OE(3)
      4OD(3) OH(3) OP(3) 4OH(2) 8(2) AE(2) H(2) 2(1) 4O(1)
      4ODAE(1) 4ODC(1) 4OF(1) 4OP(1) 8AR(1) 8OE(1) DC(1) DZ(1)
      ED(1) EDE(1) G(1) GF(1) GS(1) OEH(1) OEOE(1) OR(1)
      TC8(1) TCOE(1) TOE(1)
    
    Suffix "TG":
    
      E(6) (5) 4OD(5) OE(5) 4OH(3) OD(2) OEE(2) OH(2) SCP(2)
      2OE(1) 2P(1) 4CD(1) 4O(1) 4OE(1) 8(1) CP(1) DOR(1)
      GDCC(1) H(1) ODAE(1) ODC(1) OED(1) OPC(1) POE(1) R(1)
      SCCD(1) SCHZC8(1) TC(1) TC8(1)
    
    Suffix "Z8G":
    
      TD(5) H(2) SCD(2) TP(2) 2AESH(1) D(1) TCD(1) TH(1)
    
    Suffix "ZC8G":

      SD(6) TD(5) H(3) SCD(3) TH(2) 4D(1) 4OD(1) 8TD(1) D(1)
      ESCD(1) ETCD(1) ETP(1) SCH(1) SH(1) SOP(1) TCH(1) TCP(1)
      TP(1)
    
    Suffix "ZCCG":
    
      F(1)

    Suffix "ZCG": 
    
      TD(9) SD(5) 4OD(4) H(3) SCD(3) SH(3) D(2) P(2) TCD(2)
      TCH(2) TP(2) 2H(1) 2OD(1) 4H(1) 4OH(1) AH(1) CP(1)
      ETCD(1) GD(1) GTCD(1) HOH(1) OD(1) OH(1) SCCH(1) TCCH(1)
      TF(1) TH(1)
    
    Suffix "ZG":
    
      TD(31) SD(23) TCD(21) TH(20) SCD(19) TCH(17) SCH(12)
      SH(12) ETCD(3) ESCD(2) OD(2) OETH(2) SCP(2) TP(2) 2D(1)
      2SCD(1) 2TD(1) 2TH(1) 4H(1) 4OD(1) 4ODCTD(1) 8AD(1)
      8SCD(1) D(1) ESD(1) ESH(1) ETH(1) GTCH(1) HTH(1)
      OEPOD(1) OETCD(1) OH(1) RAH(1) SCCD(1) SOD(1) TSCD(1)
    
  From looking at these distributions, it seems that the "S", "T",
  and "Z" are actually part of the roots;  and that either the "G" 
  suffix also occurs in words of unrelated classes, or there
  are more suffixes that end in "G" within this class.
  
  Here are again the prefixes of "G" that are not in the oterh classes,
  this time sorted by last letter of prefix:
  
      (5) 
      
      2(1) A2(1) 4OHCC2(1) TC2(1) 
      
      A(1) TDCA(1) 
      
      4OD(58) OD(12) TCCD(11) SCD(11) OED(11) SCCD(8) POE8AD(1)
      OPTCCD(1) 2SCD(1) TCD(12) 4OED(4) RTCD(2) ED(3) 8AED(1) ODGED(1)
      2OED(1) 8GD(1) SCGD(1) 4OGD(1) HG4OD(1) O4OD(1) SCOD(2) TCOD(1)
      OEOD(2) GOD(1) SOD(1) SD(2) TD(1)
      
      OE(26) 8AE(7) E(7) AE(4) 2AE(1) OE2AE(1) TCAE(2) ODAE(2)
      4ODAE(6) EAE(1) HAE(1) OHAE(3) 4OHAE(1) DOHAE(1) OPAE(1) RAE(1)
      TAE(1) OPTAE(1) 2OE(1) 4OE(7) 8OE(1) 4OEDCCOE(1) 4ODCOE(1)
      TCOE(1) EOE(4) TAEOE(1) OFOE(1) ONOE(1) OPOE(1) ROE(1) OROE(2)
      TOE(1)
      
      4OH(26) OH(17) SCCH(7) SCH(6) TCH(4) H(2) CH(1) 4CH(1) TCCCH(1)
      TCCH(2) ESCH(2) 2TCH(1) GH(1) TOH(1) SH(1) OESH(1) TH(2)
      
      AM(2) OEAM(1)
      
      ODAN(1)
      
      4ODCO(2) 4ODAO(1) SCO(1) TCO(1) ODO(1) 4ODO(1) GO(1)
      POETO(1) THZO(1)
      
      TCP(1) DOEP(1) OP(1) 4OP(1) 4ODOP(1)
      
      OR(6) 8AR(4) R(2) AR(3) ETC8AR(1) DAR(1) 4OEDAR(1) 8AEAR(1)
      OHAR(1) 4OHAR(1) RAR(1) TAR(1) 8ETCR(1) GR(1) 8GR(1)
      EEAIIR(1) OROIR(1) OEDOR(1) AEOR(1) OHOR(1) OOR(1)
      8AROR(1)
  
  There is a hint that "DG" and "HG" may be related suffixes.
  Indeed, there is hint that the final "D" and "H" in most of these
  stems may be part of the suffix. 

  So here is the new guess about the main class of words that
  end in "G":
  
    {DAE8G,HAE8G,
     8G,
     DC8G,HC8G,
     DCC8G,HCC8G,
     DCCC8G,HCCC8G,CCC8G,
     DCCCG,
     EDCCG,DCCG,HCCG,
     EDCG,DCG,HCG,
     DG,HG,EG,EDG,CCDG,CCHG,
     ES8G,DS8G,HS8G,
     ESC8G,DSC8G,HSC8G,SC8G,
     ESCC8G,SCC8G,TCC8G,CC8G,
     ESCCG,CCG,
     DSCG,HSCG,ESCG,
     ESG,HSG,
     ET8G,DT8G,HT8G,PT8G,
     ETC8G,DTC8G,HTC8G,PTC8G,TC8G,
     ETCC8G,
     ETCCG,PTCCG,DTCCG,
     ETCG,DTCG,HTCG,PTCG,DAETCG,DCTCG,
     ETG,DTG,HTG,DCTG,DAETG,
     DZ8G,HZ8G,
     DZC8G,HZC8G,
     DZCG,HZCG,PZCG,
     DZG,CDZG,CHZG
   }

  I tried to separat these suffixes with a giant "sed":
  
    /bin/rm .bax
    cat bio-c-evt.frq \
      | egrep -v '[%*]' \
      | egrep 'G$' \
      | /bin/sed -f split-G-suffs.sed \
      | /bin/sed -e 's/:$//' \
      | sort -b +2 -3 +0 -1nr \
      > .bax

  The results were not very good, so I decided to look closely at the
  most popular prefixes among the [G] words, which are [4O,O,G,8,2]:

    /bin/rm .bax
    cat bio-c-evt.frq \
      | egrep -v '[%*]' \
      | egrep '[	](4O|O|G|8|2).*G$' \
      | /bin/sed \
        -e 's@[	]4O@	4O- @' \
        -e 's@[	]O@	O- @' \
        -e 's@[	]G@	G- @' \
        -e 's@[	]8@	8- @' \
        -e 's@[	]2@	2- @' \
      | sort -b +1 -2 +0 -1nr \
      > .bax

    Suffixes of "2": 
  
      SC8G(5) TC8G(5) OETC8G(4) OEDCC8G(3) OEDCCG(3) OETCG(3)
      ODCC8G(2) SCG(2) AEDCCG(1) AEG(1) AESHZ8G(1) DCCG(1) DZG(1) G(1)
      HZCG(1) ODTCCG(1) ODZCG(1) OE8G(1) OEDC8G(1) OEDCG(1) OEDG(1)
      OEG(1) OEHCC8G(1) OESCG(1) OETG(1) PTG(1) SCC8G(1) SCDG(1)
      SCDZG(1) SDT8G(1) T8G(1) TCG(1) TCHG(1) TDZG(1) THZG(1) (0)

    Suffixes of "8":
    
      G(41) SC8G(17) TC8G(9) AEG(7) AE8G(4) ARG(4) AES8G(2) C8G(2)
      CC8G(2) DC8G(2) DCCG(2) GDCC8G(2) SCC8G(2) SCCG(2) SCG(2) T8G(2)
      TCG(2) ADZG(1) AEARG(1) AEDG(1) AESC8G(1) AETC8G(1) AIRG8G(1)
      AR8G(1) ARORG(1) ARTCG(1) CCC8G(1) CCG(1) DSCG(1) ESC8G(1)
      ETCRG(1) GDC8G(1) GDG(1) GDSG(1) GHC8G(1) GHCG(1) GRG(1)
      ODC8G(1) OEC8G(1) OEDC8G(1) OEFTC8G(1) OEG(1) OETC8G(1)
      OETCC8G(1) OETCG(1) OHCCG(1) S8G(1) SCDZG(1) TCC8G(1) TCCG(1)
      TDZC8G(1) TG(1) (0)

    Suffixes of "4O":
    
      DC8G(159) DCC8G(149) DCCG(83) DG(58) HC8G(46) DCG(39) HCC8G(39)
      HG(26) DT8G(10) PTC8G(9) ETCG(8) HCG(8) DTC8G(7) EG(7) HCCG(7)
      CC8G(6) DAEG(6) DSC8G(6) ETC8G(6) DTG(5) DZCG(4) EDCC8G(4)
      EDCCG(4) EDG(4) 8G(3) DCCCG(3) DTCG(3) HTC8G(3) HTG(3) PSC8G(3)
      CCC8G(2) DCCC8G(2) DCOG(2) DCTC8G(2) DG8G(2) DO8G(2) ESCC8G(2)
      HAE8G(2) HSC8G(2) HTCG(2) 4ODCCG(1) 8CC8G(1) 8CCG(1) 8GDCG(1)
      CCDCC8G(1) CCG(1) D8G(1) DA8G(1) DAE8G(1) DAETCG(1) DAOG(1)
      DC28G(1) DC8C8G(1) DCOEG(1) DCT8G(1) DCTCG(1) DCTDZG(1)
      DGE88G(1) DGT8G(1) DOC8G(1) DOE8G(1) DOETC8G(1) DOG(1) DOPG(1)
      DS8G(1) DSCG(1) DT28G(1) DZC8G(1) DZG(1) E8G(1) EDARG(1)
      EDCCOEG(1) EDCG(1) EHCC8G(1) ES8G(1) ESC8G(1) ESCG(1) ESG(1)
      ET8G(1) ETCC8G(1) ETG(1) FSG(1) FTC8G(1) FTCG(1) GDG(1) HAEG(1)
      HAR8G(1) HARG(1) HC8SC8G(1) HCC2G(1) HCCC8G(1) HS8G(1) HSCG(1)
      HSG(1) HT8G(1) HZCG(1) P8G(1) PG(1) PS8G(1) PT8G(1) PTCG(1)
      RC8G(1) RCC8G(1) RCCG(1) RTC8G(1) SC8G(1) SCC8G(1) TC8G(1)
      TCG(1) TG(1) (0) (0)

    Suffixes of "G":
    
      DCC8G(10) SC8G(8) DCCG(7) HC8G(7) DC8G(5) HCCG(4) SCG(4) TC8G(4)
      HCC8G(3) SCCG(3) 8ARSCG(1) 8CC8G(1) 8G(1) DCCCG(1) DCCTG(1)
      DCSCG(1) DSC8G(1) DZCG(1) ET8G(1) FTC8G(1) FTCG(1) HCG(1) HG(1)
      ODG(1) OG(1) RG(1) ROHCG(1) SCAE8G(1) SCC8G(1) STCG(1) TCC8G(1)
      TCCG(1) TCDZCG(1) TCG(1) TCHZG(1)
    
    Suffixes of "O":

      HC8G(48) DC8G(43) DCC8G(32) EG(26) HCC8G(25) ETC8G(24) EDC8G(18)
      EDCCG(18) HCCG(18) HG(17) DCCG(16) ESC8G(14) EDCC8G(13) ETCG(13)
      HCG(13) PTC8G(13) DG(12) E8G(12) DCG(11) EDG(11) ESCG(8) DTCG(6)
      RG(6) EDCG(5) ETG(5) ES8G(4) HSC8G(4) 8G(3) DSCG(3) DT8G(3)
      EHC8G(3) ET8G(3) ETCCG(3) HAEG(3) HS8G(3) HTC8G(3) HTCG(3)
      PSC8G(3) PTCG(3) SCG(3) DAEG(2) DCSG(2) DSC8G(2) DTG(2) DZG(2)
      EDCCCG(2) EETG(2) EFTC8G(2) EHCC8G(2) EHSC8G(2) EODG(2)
      EPTC8G(2) ESCCG(2) ETCC8G(2) ETHZG(2) FTC8G(2) HSCG(2) HTG(2)
      PTCCG(2) ROEG(2) 4ODG(1) 4OFTC8G(1) 8TC8G(1) CC8G(1) CCC8G(1)
      CCG(1) CDCC8G(1) DA8G(1) DAE8G(1) DAETG(1) DANG(1) DCCC8G(1)
      DCS8G(1) DCSCG(1) DCTG(1) DGEDG(1) DOE8G(1) DOG(1) DTC8G(1)
      DZCG(1) E2AEG(1) EAMG(1) EAO8G(1) ECCC8G(1) EDCSCG(1) EDORG(1)
      EDSCG(1) EDTG(1) EHG8G(1) EHS8CG(1) EHTCG(1) EOEDCCG(1)
      EOETCG(1) EOHTC8G(1) EPODZG(1) ESCC8G(1) ESHG(1) ETCDZG(1)
      FOEG(1) FS8G(1) FT8G(1) GSCG(1) HAE8G(1) HAES8G(1) HAESC8G(1)
      HARG(1) HC8CCG(1) HCCC8G(1) HCCO8G(1) HCOE8G(1) HOE8G(1) HORG(1)
      HT8G(1) HTR8G(1) HZCG(1) HZG(1) NOEG(1) ORG(1) PAEG(1)
      PAESC8G(1) PCTG(1) PG(1) POEG(1) PS8G(1) PSCG(1) PT8G(1)
      PTAEG(1) PTCCDG(1) RC8G(1) RO8G(1) ROIRG(1) RSCC8G(1) RTCG(1)
      SC8G(1) SCC8G(1) TC8G(1) TCCG(1)

  Common suffixes:
  
    cat bio-c-evt.txt \
      | tr ' ' '\012' \
      | egrep -v '[%*/=]' \
      | egrep '.' \
      > bio-c-evt.wds
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      5928    5928     32257 bio-c-evt.wds

    /bin/rm .sf-join.frq
    /bin/touch .sf-join.frq
    set ofmt = "0"
    @ i = 1
    set noglob
    set prefs = ( \
      4OD 4OE 4OH 4O'[^DEH]' \
      OD OED OE8 OE'[^D8]' OH OP OR O'[^DEHPR]' \
      S T \
      '[^4OST]' \
    ) 
    foreach f ( ${prefs} ) 
      echo "${i} ${f}"
      /bin/rm .sf-part-${i}.frq
      cat bio-c-evt.wds \
        | egrep '^'"${f}" \
        | /bin/sed \
          -e 's@^'"${f}"'\(.*\)$@-\1@' \
        | sort \
        | uniq -c \
        > .sf-part-${i}.frq
      /n/gnu/bin/join -a 1 -a 2 -j1 2 -e 0 -o "${ofmt},1.1" \
        .sf-part-${i}.frq .sf-join.frq > .sf-tmp.frq
      @ i = ${i} + 1
      set ofmt = "${ofmt},2.${i}"
      mv .sf-tmp.frq .sf-join.frq
    end
    unset ofmt i noglob

    ( echo "# ${prefs}"; cat .sf-join.frq ) \
      | add-counts \
      > .sf-suffs.frq

  From the output, it seems indeed there are two classes
  of words, [A,O][EKLMNR] and [G].
  
  Split bio-c-evt.wds into the two classes plus "misc":
  
    egrep '[AO][EKLMNR]$' bio-c-evt.wds > bio-c-evt-eklmnr.wds
    egrep 'G$' bio-c-evt.wds > bio-c-evt-g.wds
    egrep -v '([AO][EKLMNR]|G)$' bio-c-evt.wds > bio-c-evt-other.wds
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      3215    3215     19119 bio-c-evt-g.wds
      2347    2347     11361 bio-c-evt-eklmnr.wds
       366     366      1777 bio-c-evt-other.wds
      5928    5928     32257 bio-c-evt.wds

  Next step: AutoAnalysis on each category.

97-07-06 stolfi
===============

  Reading again the description of Landini's file, I found out that
  the `%' and `!' marks should have been handled differently from the way
  I did. 
  
  Also, looking at the actual shape of the characters, I realized that
  the FSG encoding was not very good for my purposes, since is assigns
  completely different codes to glyphs which may be just calligraphic
  variations of the same grapheme.
  
  Thus I decided to redo everything from the beginning, using a 
  more analytical encoding.  
  
  I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding, but even
  that is a bit too synthetic --- for example, his <2> should be "i'",
  and his <9> should be `c)', for consistency. (The statistics on the
  occurrence of repeated <i>s apparently confirm this choice).
  
  Thus I decided to define my own "super-analytic" or "SA" encoding.
  The idea is to break all characters doen to individual "logical"
  strokes, and use one (computer) character to encode each stroke.
  
  There is some question as to what is a logical stroke, and when two
  strokes are different.  Obviously, the definition of a stroke must
  include not only its shape but also the way it connects to the
  neighboring strokes; and, given the irregularity of handwritten
  glyphs, that may be hard to decide.
  
  For instance, FSG's [A] character can be broken down into two
  strokes, shaped like the [C] and [I] glyphs.  Supposedly, the
  difference between an [A] and a [CI] is that in the former the
  strokes are connected into a closed shape.  Is this difference
  significant?
  
  I checked the occurrences of [CI], [CM], and [CN] in the interlinear
  file.  Two things are curious. First, these combinations are
  extremely rare.  Second, a good many of them are transcribed
  differently by Currier and the FSG: where one has [CIIR] the other
  often has [AIR], and vice-versa.  Same for [CM] versus [AN], etc.
  
  In light of these observations, I have decided to treat all
  occurrences of [A] as [CI]. If the two are indeed different, that
  will be just one more ambiguity added to the inherent ambiguity of
  natural language; so it cannot make the decipherment task more
  difficult.  Confusing the two will change the letter frequencies, it
  is true; but, since the language does not appear to be a
  standardized one, there is not much information we can extract from
  absolute letter frequencies.  The methods we hope to use --- such as
  automaton analysis --- are not significantly disturbed by collapsing
  letters.
  
  On the other hand, if [A] and [CI] are the same grapheme, using
  different encodings will seriously confuse statistics --- especially
  if the spacing depends on the immediate context.
 
  I took the file bio-c-evt.txt and removed all `%' from it.  Many of them
  should have been spaces, but so what.
  
  I also added a space before and after each line; that may be helpful
  when doing greps.
  
       lines   words     bytes file        
      ------ ------- --------- ------------
         765    7227     39823 bio-c-evt.txt

  Next I applied a recoding:
  
    cat bio-c-evt.txt \
      | fsg2jsa \
      > bio-c-jsa.txt

  Next I extracted words:
  
    cat bio-c-jsa.txt \
      | tr ' ' '\012' \
      | egrep '.' \
      > bio-c-jsa.wds
      
    cat bio-c-jsa.wds \
      | sort \
      | uniq \
      > bio-c-jsa.dic
      
    cat bio-c-jsa.wds \
      | sort \
      | uniq -c \
      | sort +0 -1nr \
      > bio-c-jsa.frq

     lines   words     bytes file        
    ------ ------- --------- ------------
       765    7227     61678 bio-c-jsa.txt
      7227    7227     60144 bio-c-jsa.wds
      1687    1687     17398 bio-c-jsa.dic
      1687    3374     30894 bio-c-jsa.frq

  Next I separated the good words:
  
    cat bio-c-jsa.wds \
      | egrep '^[a-z+^]*$' \
      > bio-c-jsa-gut.wds
  
    cat bio-c-jsa.dic \
      | egrep '^[a-z+^]*$' \
      > bio-c-jsa-gut.dic
      
    bool 1-2 bio-c-jsa.dic bio-c-jsa-gut.dic \
      > bio-c-jsa-bad.dic

     lines   words     bytes file        
    ------ ------- --------- ------------
        34      34       270 bio-c-jsa-bad.dic
      6420    6420     57560 bio-c-jsa-gut.wds
      1653    1653     17128 bio-c-jsa-gut.dic
 
  Next I buit the automaton:
  
    cat bio-c-jsa-gut.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump bio-c-jsa-gut.dmp

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1653    15475      1373      178     2609     2234    5.931
        
  Note that the efficiency increased, even though the words got considerably
  longer!
  
  I ran AutoAnalysis, looking for unproductive states (2 words or less) and
  strange words:
  
    nice AutoAnalysis \
      -load bio-c-jsa-gut.dmp \
      -unprod bio-c-jsa-gut-1-unp.sts \
        -maxUnprod 2 \
        -unprodSugg bio-c-jsa-gut-1-unp.sugg

      546 unproductive states
      826 strange words (with repetitions) listed
      389 strange words (without repetitions) listed
      
   I redid again, considering a state unproductive if it is used by only one
   word):
   
      nice AutoAnalysis \
        -load bio-c-jsa-gut.dmp \
        -unprod bio-c-jsa-gut-2-unp.sts \
          -maxUnprod 1 \
          -unprodSugg bio-c-jsa-gut-2-unp.sugg
        
       266 unproductive states
       266 strange words (with repetitions) listed
        96 strange words (without repetitions) listed

   Here are the strange words:
   
      ccljtciix ccqgtccy cgciiiivciiscyixcy cgciiiixciixcgcy
      cgciixcyiscg cgciixljccccyiscy cgcstoixcycg cgctcljtccgcy
      cgcyixctccs cgcyqoljoix cgixctciscy cgoisctccgctccy
      cgoixcccsoixctccy cicstcyqjcccg ciixcstclgccy cqgciixoiis
      cqjtcyljoisoix csciixcstcqjtcgcy csciixctqjccgcyqjciis
      cscstoixcstccqjtcy csolgctciij csqjoixqgctcy cstccgljcoixcy
      cstcoisoixljctcgcy cstixctcis cstocqgtccgcy cstqjciixctcccgcy
      ctccgciiivcgci ctccgcyqjctcg ctccicgis ctcciixisois ctccycsctcy
      ctccyqcyqjciiiiv ctcoixqjciiivcscy ctcstccljtcy ctixctqgcstcccy
      ctoixljccccy cycgoixcstcccy cycqgciiiiv cyisoqjccy
      cyljccgcicqgtcy cyqjcccsoixcgcy cyqjctixctccgcy cyqjctoisoixljcy
      cyqoixcyqois iixqgctcgcy isoisciisoix ixcsciiis
      ixctccgcyisqgctccgcy ixixcgcyis ixixciiiiscy ixlgctcciix
      ixoixljccgcyljciiivoix lgctccgcyljciiscy ljccgcyqgocstcy
      ljciixcyctccy ljoisoiiivcy ljoqjciixcy ocstcqgoixcs ocycstccy
      oisciiisciiso oixcstciixcscy olgciixcstcljcy olgctcivcycs
      oljciixctccgcyqjoiscy oqjciisqgcccgcy qcljcyqjccgciis
      qcsoixljcccgcy qcyisctcs qgcgciixcstois qgciixcgciisciiisciix
      qgcstctccgciix qgcstoixqgctclgtcgcy qgctccyljccciis
      qgoisciixcstcy qgoixcccgciisciiv qgoixqjccstoix qgoqjctoljciis
      qjccciixciiiv qjccyqjccj qjcoixcstcoix qjctcgoixqjcgcy
      qjctciixoixljcc qjcyqoljcy qjocqjtccy qjoisoixcstcscgcy
      qjoixoiscstcccy qoljciixljcoix qoljcyisixcstc qoljcyixcgcgcy
      qoqjcgcgcyciis qoqjcyqjcyqjois qoqjixoixljciix qoqjoisciixoij
      qoqoljcccy qqgoixljciiiv

  I looked for popular states and radicals:
  
    nice AutoAnalysis \
      -load bio-c-jsa-gut.dmp \
      -classes bio-c-jsa-gut-1.cls \
        -minClassPrefs 5 \
        -radicals bio-c-jsa-gut-1.rad

     77 lexical classes found

  The output wasn't very illuminating, though.  I looked for productive
  states instead:

    AutoAnalysis \
      -load bio-c-jsa-gut.dmp \
      -prod bio-c-jsa-gut-1-prd.pst \
      -minProductivity 1 \
      -maxPrefSize 30 \
      -maxSuffSize 30 

  Here is the result:
  
     nprefs  nsuffs  nwords  prodty  prefs/suffs
    ------- ------- ------- -------  -----------------
         40       2      80      39  { qoqjciisc, qoljcctcc, qoljcccc, ... }:{ gcy, y }
         35       2      70      34  { qoqjcccs, qoljcio, qoixoix, ... }:{ (), cy }
         24       2      48      23  { qgciii, oljciiii, oisciiii, ... }:{ iv, v }
         15       2      30      14  { qoljccii, ctccgcyi, cgctoi, ... }:{ s, x }
          7       3      21      12  { qoixctcc, qoccc, oqjccc, ... }:{ cgcy, gcy, y }
         12       2      24      11  { qoljccgci, qljo, qjoixcstco, ... }:{ is, ix }
         11       2      22      10  { oixqjcc, ixqjcc, cgcyljcc, ... }:{ cgcy, gcy }
         10       2      20       9  { cyctccgc, qoqjciixcgc, ... }:{ iis, y }
          4       4      16       9  { qoixcstcc, ctcqjtc, qoqjcstc, ... }:{ cgcy, cy, gcy, y }
          9       2      18       8  { qoqjciixcg, qoljciixcg, ... }:{ ciis, cy }
          5       3      15       8  { ctccljtc, cstcljcc, csoixctcc, ... }:{ cy, gcy, y }
          5       3      15       8  { qocljtc, qoctc, oisctcc, ... }:{ cgcy, cy, y }
          8       2      16       7  { qocgcc, qcljct, ixocc, ... }:{ cgcy, cy }
          8       2      16       7  { qcljcc, ctccljc, cstccqjc, ... }:{ cy, y }
          7       2      14       6  { qoqgcst, oqjciixcst, qoixqjc, ... }:{ ccgcy, cgcy }
          4       3      12       6  { qoiscc, ljctc, qoljcstc, ... }:{ cgcy, cy, gcy }
          4       3      12       6  { oljcic, ixctccc, cqjtcc, ... }:{ gcy, s, y }
          4       3      12       6  { qjciii, isoii, isciii, csoii }:{ iiv, iv, v }
          4       3      12       6  { qoct, oisctc, ixctccljt, ... }:{ ccgcy, ccy, cy }
          6       2      12       5  { oqgcstccgc, oljcccgc, qjoixcgc, ... }:{ iix, y }
          6       2      12       5  { oixoii, cstcljciii, qoiscii, ... }:{ iiv, iv }
          6       2      12       5  { cgljcc, qgoixljcc, oixljcstc, ... }:{ cy, gcy }
          5       2      10       4  { qgoixctcc, csciixctcc, ... }:{ g, gcy }
          2       5      10       4  { cgctcc, qoqjctc }:{ cgcy, cy, gcy, oix, y }
          5       2      10       4  { qgoixljc, oixljcst, ljoixct, ... }:{ ccy, cgcy }
          5       2      10       4  { qoljciiiv, ctis, oqjcoix, ... }:{ (), cgcy }
          5       2      10       4  { qoljciixct, ocgct, ixljct, ... }:{ ccgcy, ccy }
          3       3       9       4  { qoixciii, oqjciii, oixljciii }:{ iv, s, v }
          2       4       8       3  { cgcc, cgoixctc }:{ ccgcy, cgcy, cy, gcy }
          4       2       8       3  { qjoixcg, qgoixcstcg, cstcg, ... }:{ ciix, cy }
          2       4       8       3  { qoqgctc, ljcstc }:{ cgcy, cy, gcy, oix }
          4       2       8       3  { qoisci, ctcoixljci, csoixljci, ... }:{ iiiv, iiv }
          4       2       8       3  { oljccgcy, oixljcccy, ctoixo, ... }:{ (), is }
          4       2       8       3  { ocljt, ixljccg, ctoixct, ... }:{ ccy, cy }
          4       2       8       3  { qoljctcgcy, oqjccgcy, ... }:{ (), ix }
          2       4       8       3  { qoqjcst, ctcqgt }:{ ccgcy, ccy, cgcy, cy }
          4       2       8       3  { qoqjciiiv, qjccgcy, oixois, ... }:{ (), oix }
          2       4       8       3  { oixqjcii, ocgcii }:{ iiv, iv, s, x }
          2       4       8       3  { oixqjci, ocgci }:{ iiiv, iiv, is, ix }
          3       2       6       2  { oqjccgciii, ctcqjcii, ... }:{ iv, s }
          2       3       6       2  { csciixctc, cgoixcstc }:{ cg, cgcy, oix }
          2       3       6       2  { ixctccgcii, ciisoi }:{ s, scy, x }
          2       3       6       2  { ixctccgci, ciiso }:{ is, iscy, ix }
          3       2       6       2  { oqjoixcgcy, oqgctccgcy, ... }:{ (), ixctccy }
          3       2       6       2  { oqoljc, ctcoljc, ctccyljc }:{ iiiv, y }
          3       2       6       2  { oqolj, ctcolj, ctccylj }:{ ciiiv, cy }
          2       3       6       2  { qoljctcc, ixljccc }:{ g, gcy, y }
          2       3       6       2  { qoljcst, oqjcst }:{ ccgcy, ccy, cgcy }
          3       2       6       2  { qoixqj, qocst, oiscst }:{ cccgcy, ccgcy }
          2       3       6       2  { oqgcstccg, oljcccg }:{ (), ciix, cy }
          2       3       6       2  { qoljcoi, oqgoi }:{ s, x, xcy }
          2       3       6       2  { qoljcs, oqjcs }:{ tccgcy, tccy, tcgcy }
  
97-07-07 stolfi
===============

  Some interesting patterns are apparent above, but things may become clearer
  if we remove the garbage.
  
  Meanwhile, here is a count of the digraphs in the "good" words (counting
  repeated words):

                    o     c     t     i     q     l |     v     x     y     j     g     s   TOT
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
              0  1363  2574     0   502  1874   107 |     0     0     0     0     0     0  6420
        o    34     4   112     0  1680   635  1430 |     0     0     0     0     0     0  3895
        c     7   172  3922  1447  2002   197   291 |     0     0  3764    12  2728  1445 15987
        t     7    96  2696     0    23    18    24 |     0     0     0     0     0     0  2864
        i     6     5    31     0  3395     3     3 |   943  2349     0    57     0   913  7705
        q     5  1622    35     0     0     2     4 |     0     0     0   969   215     0  2852
        l     0     0     0     0     0     0     0 |     0     0     0  2185    36     0  2221
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
        v   912    10    18     0     3     0     0 |     0     0     0     0     0     0   943
        x  1085   159   757     0    22    50   276 |     0     0     0     0     0     0  2349
        y  3510     7    72     0    37    66    72 |     0     0     0     0     0     0  3764
        j    84   138  2658   319    23     0     1 |     0     0     0     0     0     0  3223
        g    78   123  2729    25    14     2     8 |     0     0     0     0     0     0  2979
        s   692   196   383  1073     4     5     5 |     0     0     0     0     0     0  2358
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
      TOT  6420  3895 15987  2864  7705  2852  2221 |   943  2349  3764  3223  2979  2358 57560

      Next-symbol probabilities (� 99):

                    o     c     t     i     q     l |     v     x     y     j     g     s   TOT
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
              .    21    40     .     8    29     2 |     .     .     .     .     .     .    99
        o     1     .     3     .    43    16    36 |     .     .     .     .     .     .    99
        c     .     1    24     9    12     1     2 |     .     .    23     .    17     9    99
        t     .     3    93     .     1     1     1 |     .     .     .     .     .     .    99
        i     .     .     .     .    44     .     . |    12    30     .     1     .    12    99
        q     .    56     1     .     .     .     . |     .     .     .    34     7     .    99
        l     .     .     .     .     .     .     . |     .     .     .    97     2     .    99
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
        v    96     1     2     .     .     .     . |     .     .     .     .     .     .    99
        x    46     7    32     .     1     2    12 |     .     .     .     .     .     .    99
        y    92     .     2     .     1     2     2 |     .     .     .     .     .     .    99
        j     3     4    82    10     1     .     . |     .     .     .     .     .     .    99
        g     3     4    91     1     .     .     . |     .     .     .     .     .     .    99
        s    29     8    16    45     .     .     . |     .     .     .     .     .     .    99
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
      TOT    11     7    27     5    13     5     4 |     2     4     6     6     5     4 57560

      Previous-symbol probabilities (� 99):

                    o     c     t     i     q     l |     v     x     y     j     g     s   TOT
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
              .    35    16     .     6    65     5 |     .     .     .     .     .     .    11
        o     1     .     1     .    22    22    64 |     .     .     .     .     .     .     7
        c     .     4    24    50    26     7    13 |     .     .    99     .    91    61    27
        t     .     2    17     .     .     1     1 |     .     .     .     .     .     .     5
        i     .     .     .     .    44     .     . |    99    99     .     2     .    38    13
        q     .    41     .     .     .     .     . |     .     .     .    30     7     .     5
        l     .     .     .     .     .     .     . |     .     .     .    67     1     .     4
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
        v    14     .     .     .     .     .     . |     .     .     .     .     .     .     2
        x    17     4     5     .     .     2    12 |     .     .     .     .     .     .     4
        y    54     .     .     .     .     2     3 |     .     .     .     .     .     .     6
        j     1     4    16    11     .     .     . |     .     .     .     .     .     .     6
        g     1     3    17     1     .     .     . |     .     .     .     .     .     .     5
        s    11     5     2    37     .     .     . |     .     .     .     .     .     .     4
          ----- ----- ----- ----- ----- ----- -----   ----- ----- ----- ----- ----- ----- -----
      TOT    99    99    99    99    99    99    99 |    99    99    99    99    99    99 57560

  Note that the stroke `l' is always followed by either `j' or `g', hence `lj'
  and `lg' should be single letters.
      
  Note also that there are two clearly different kinds of strokes, "body" B =
  {`c',`o',`t',`i',`q',`l'} and "limb" L = {`v',`x',`y',`j',`g',`s'}.  If we
  reduce the digraph count matrix to these two classes, plus word break W, we
  get

    cat bio-c-jsa-gut.wds \
      | tr 'cotiqlvxyjgs' 'BBBBBBLLLLLL' \
      | count-digraph-freqs

    Digraph counts:

                  B     L
        ----- ----- -----
            .  6420     .
      B    59 19849 15616
      L  6361  9255     .
        ----- ----- -----

    Next-symbol probabilities (� 99):

                  B     L
        ----- ----- -----
            .    99     .
      B     .    55    44
      L    40    59     .
        ----- ----- -----

    Previous-symbol probabilities (� 99):

                  B     L
        ----- ----- -----
            .    18     .
      B     1    55    99
      L    98    26     .
        ----- ----- -----

  Note that every word begins with a body stroke; this was expected from the
  definition of the limb strokes (they can be recognized only by their
  relationship to a previous stroke).  Note also that a limb stroke cannot be
  followed by another limb stroke; this too is not wholly unexpected.

  The surprise is that almost no words *end* in a body stroke.  The least rare
  body stroke in word-final position is `o'.  Here are all the words that end
  in body strokes, in context.  (The "<<" marks the error)

                ctix ois cstcoixo << //
         ciix ciix ctccgciiivcgci << //
       qoljciiiv qjcy ixcstcgcyqo << //
                   qoqjcccgcy ixo << // 
                  oljccgciis cgci << // 
                  qoqjciix ctoixo << // 
                 cgciiscy cgciixo << // 
             cgcy cgciiiiv ctcqjt << // 
             oljcccg qoljciixoiso << // 
           qoljciix qoljccgcy ixo << // 
          cstcccgcy ixctccgcy ixc << // 
          cstccgcy qoljcyisixcstc << // 
          cstcljtcy oisciiisciiso << // 
         isctccgcy qoqjcccgcy ixo << // 
         qoljcccy qoqjcccgcy ixoc << // 

          ljctoix qjctciixoixljcc << cylgctccy 
                      isctcs ctcc << oixcs ciiiiv csljciix
               oixcstccy oixcstcc << qoixcstcccy qoqjcccy 
      cyctcccgcy csciiivo oix ctc << cgcy qoljciis 
               csciiiiv cstccgcci << cstccqjtcy ctccy
           ixctcgcy cgctcgcy cgci << qciis 
                ixljocgciix oqgci << ljoisoixis // 
            qcqjtcccg cstcgcy qci << oixljcccy cgciiiv  
                             // o << cgctcccgcy qoixctccy
                     cyctccciis o << oiiiv occcgcy 
                ctccy qoljciiiv o << isoiiiiv // 
                  ciiiv isciiiv o << ljciiv ctixciiiiiv // 
               cycstccis ciiiiv o << cyljcccgcy qoljcccgcy
               // csciiiv oljci*o << ctccgcy ixljctccgcy 
               cstccqjtcy qoljcio << ixois // 
             cgcicljtcy ixljciijo << cyljcccy ixcstccy
             cstccyljcccgcy ixljo << oqgctccgcy
         qoix oixciiiv cstcoix qo << qoljciix cstcivix // 
                     qoixctccy qo << ixctccg qoixljcy
                      ctcccgcy qo << ctciis ciiiiv // 
             oljcccy ixctccgcy qo << oixciiiv oqj 
                 // cycstccgcy qo << ois oljciiiv 
              cstcqjtcy ctcocy qo << qoljccy cgciixciiiiv 
                ctcccgcy qoix oqo << qoljciiiv oixcstcciij // 
                  qocgctccg oixqo << cgciis ctccljto ixoixoix 
                      cycstcccyqo << isciis oix ctcccy
            oixqo cgciis ctccljto << ixoixoix
                        // ixcsto << qoljccy ixcstccgcy
              cyctcccgcy csciiivo << oix ctc cgcy qoljciis 
          qoljctcgcy ctcqjccy ixo << qoljccgcy qoljciiv     
             qoljctcgcy ctccy ixo << ctcljcy oix 
           qois *cccy ixctccy ixo << cycgciiiv cstccy 
      csciiiisciix csciix cgciixo << qjciiiv cgciiscy cgciixo // 
            oisoix ccccsciix oixo << qjcoix oiscy // 
                             // q << ljciiiv cstccqjcy qoljciiiv 
                             // q << ljccccy cstccgcy qoljcccgcy
                             // q << qoqjccgcstccgcy
           ctqjciiis oqgctccgcy q << qjciix ctccgcy ctcqgctccgcy
             csciix ctccgcy cstcq << lgctois qoqjcicsoixljcy 
          qoljciis cstcccgcy ixct << cstoljciiiv ct*            
                  // ctccy ctcljt << cstcoqccccy ctoix
                  // cgciiiiv cst << qoixctccy 
           oqjciixcy qoljciix cst << oixcstciixcscy // 
                      // qoljccst << qoljccgcy oqjccgcy
           // cgciiiv ctccy ixcst << cgciiiiv ctccy //          

  Those of the first group appear to be interference by the line break.  (Note
  that the manuscript does not appear to use any hyphenation mark.  Either
  words are not broken across lines, which would be unusual, or they are broken
  without any extra marks, which would produce the Those of the second group
  appear to be due to bogus word breaks in the transcription (e.g. between the
  `q' and `l') or transcription errors.

  An interesting observation from the body/limb frequency tables above
  is that the transition probabilities from body stroke to body and
  limb are respectively 55% and 45%.  Thus, if the limb strokes mark
  the end of a syllabe (or letter?), the the average number of body
  strokes in a syllabe is slightly over 2.  (Considering that we are
  counting each "i" as a body stroke, the correct number may well be
  precisly 2.)

  I decided that, before spending more time in the analysis, I must first
  prepare a "corrected" interlinear where discrepancies between FSG and Currier
  are resolved taking into account the probabilities above.

  The idea is to make a dictionary of 5-tuples, and try to use it 
  to decide on the corrections.  Namely, define the context of a letter
  occurrence in a text as its four nearest letter occurrences. 
  We can represent a context in sed-like notation as wx.yz where
  the "." is the position of the central letter.
  
  We scan some training text, collecting for each possible context
  the frequency distribution of the middle letter.  At the end, if,
  for a given context wx.yz there is some central letter t which 
  is more likely than all the others combined, we output a correction
  rule of the form wx?yz -> wxtyz.
  
  So, here is the work.  First, I generated a training data set:

    cat bio-m-evt.evt \
      | egrep '^<.*;[FC]> ' \
      | grep -v '[][%*_]' \
      | sed \
          -e 's/<.*;[FC]> */  /g' \
          -e 's/{[^}]*}//g' \
          -e 's/\!//g'\
      > .train.txt

     lines   words     bytes file        
    ------ ------- --------- ------------
       858     858     42866 .train.txt

  Next, I generated the correction patterns from it:
  
    cat .train.txt \
      | generate-fix-patterns -vMINOCC=10 \
      > .fixit.sed
       
     lines   words     bytes file        
    ------ ------- --------- ------------
       592     688     10219 .fixit.sed
       
  The parameter MINOCC is the minimum number of times a context must occur before
  we try to generate a correction rule for it. 

  Next, I generated a "consensus" interlinear file:
  
    cat bio-m-evt.evt \
      | make-consensus-interlin \
      > bio-x-evt.evt

  I extracted the consensus text from it:
  
    cat bio-x-evt.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
          -e 's/[\!]//g' \
      > bio-j-evt-raw.evt

  I applied the corrections:
  
    cat bio-j-evt-raw.evt \
      | sed -f .fixit.sed \
      > bio-j-evt.evt
      
  Now let's extract the words and check how many good ones we got:
  
    cat bio-j-evt.evt \
      | sed \
          -e 's/<.*;[A-Z]> *//g' \
          -e 's/- *$/.\/\//g' \
          -e 's/= *$/.\/\/.=/g' \
      | tr '.' '\012' \
      | egrep '.' \
      > bio-j-evt.wds
         
    cat bio-j-evt.wds | sort | uniq -c | sort +0 -1nr > bio-j-evt.frq
    
    cat bio-j-evt.wds | sort | uniq > bio-j-evt.dic
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      7216    7216     39223 bio-j-evt.wds
      1761    1761     12154 bio-j-evt.dic
    
  I extracted the good words:
  
    cat bio-j-evt.wds | grep -v '?' > bio-j-evt-gut.wds
    
    cat bio-j-evt-gut.wds | sort | uniq > bio-j-evt-gut.dic

    cat bio-j-evt-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-evt-gut.frq
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      6188    6188     31705 bio-j-evt-gut.wds
      1085    1085      6532 bio-j-evt-gut.dic
    
  I created an automaton for bio-j-evt-gut.dic:
  
    cat bio-j-evt-gut.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump bio-j-evt-gut.dmp

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1085     5447       422       90     1263     1027    4.313

  I looked for unproductive states:

    nice AutoAnalysis \
      -load bio-j-evt-gut.dmp \
      -unprod bio-j-evt-gut-1-unp.sts \
        -maxUnprod 1 \
        -unprodSugg bio-j-evt-gut-1-unp.sugg
        
       46 unproductive states
       46 strange words (with repetitions) listed
       31 strange words (without repetitions) listed

    // 2PTG 42OEDCC8G 4GDAM 4O4ODCCG 4ODGE88G 8ARTCC8AE 8OEDCC2OE
    CPAEOIR DOHAEG EODAK GCPAM GDT8AR HG4ODG HOROES28G HT8OEH8G
    ODAROEOK OEPODZG OGSCG OHTOHAR OSCPOE2 P8AESOR PGDC8G PODAN
    POECC8ARAL POEHCSOE PSAROE RTCAE8 TCIROR TETPSCCG TOEDCCCG

  I removed these words and tried again:
  
    cat bio-j-evt-gut-1-unp.sugg \
      | sort -u \
      | bool 1-2 j-gut.dic - \
      > bio-j-evt-cln-1.dic
      
    cat bio-j-evt-cln-1.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump bio-j-evt-cln-1.dmp  

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1054     5237       365       87     1176      950    4.453

    nice AutoAnalysis \
      -load bio-j-evt-cln-1.dmp \
      -unprod bio-j-evt-cln-1-unp.sts \
        -maxUnprod 1 \
        -unprodSugg bio-j-evt-cln-1-unp.sugg

        4 unproductive states
        4 strange words (with repetitions) listed
        3 strange words (without repetitions) listed

      42AN HSCODCC8G CPTG

  I removed these and tried again:

    cat bio-j-evt-cln-1-unp.sugg \
      | sort -u \
      | bool 1-2 bio-j-evt-cln-1.dic - \
      > bio-j-evt-cln-2.dic
      
    cat bio-j-evt-cln-2.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump bio-j-evt-cln-2.dmp  

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1051     5220       360       87     1167      944    4.473

    nice AutoAnalysis \
      -load bio-j-evt-cln-2.dmp \
      -unprod bio-j-evt-cln-2-unp.sts \
        -maxUnprod 1 \
        -unprodSugg bio-j-evt-cln-2-unp.sugg

        0 unproductive states
        0 strange words (with repetitions) listed
        0 strange words (without repetitions) listed

  I recoded it into the "super-analyitic" encoding,
  but this time treating `qj', `qg', `lj', `lg' as single letters
  (`h', `k', `f', `p' respectively):
  
    cat bio-j-evt.wds \
      | fsg2jsa \
      | jsa2hoc \
      > bio-j-hoc.wds

    cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic
      
    cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq

     lines   words     bytes file        
    ------ ------- --------- ------------
      7216    7216     56394 bio-j-hoc.wds
      1761    1761     17523 bio-j-hoc.dic

  Next I separated the good words:
  
    cat bio-j-hoc.wds \
      | egrep '^[a-z+^]*$' \
      > bio-j-hoc-gut.wds
  
    cat bio-j-hoc.dic \
      | egrep '^[a-z+^]*$' \
      > bio-j-hoc-gut.dic
      
    bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \
      > bio-j-hoc-bad.dic

     lines   words     bytes file        
    ------ ------- --------- ------------
      5427    5427     44172 bio-j-hoc-gut.wds
      1083    1083      9958 bio-j-hoc-gut.dic
       678     678      7565 bio-j-hoc-bad.dic
 
  Next I buit the automaton:
  
    cat bio-j-hoc-gut.dic \
      | nice MaintainAutomaton \
          -add - \
          -dump bio-j-hoc-gut.dmp

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
        1083     8875       701       91     1492     1258    5.948
        
  Digraph statistics:
  
    cat bio-j-hoc-gut.wds \
      | count-digraph-freqs 
      
    Digraph counts:

              |     i     o     c     t     q     f     p     h     k |     v     j     x      s    y     g   TOT
        ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- -----  ---------- ----- -----
            . |   396  1146  2210     .  1398    94     1   112    70 |     .     .     .      .    .     .  5427
        ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- -----  ---------- ----- -----
      i     4 |  2248     2     8     .     .     2     .     .     . |   497    40  1979    650    .     .  5430
      o    19 |  1371     1    69     .     5  1190     8   455    60 |     .     .     .      .    .     .  3178
      c     1 |  1367   150  3487  1201     .   245     2   134    25 |     .     .     .   1187 3301  2408 13508
      t     4 |    17    73  2320     .     .    14     1    10     3 |     .     .     .      .    .     .  2442
      q     1 |     .  1383    21     .     .     1     .     .     . |     .     .     .      .    .     .  1406
      f     6 |     5    47  1543   180     .     .     .     .     . |     .     .     .      .    .     .  1781
      p     . |     .     2    15     1     .     .     .     .     . |     .     .     .      .    .     .    18
      h     3 |     2    41   606   103     .     .     .     .     . |     .     .     .      .    .     .   755
      k     2 |     .    38   111    14     .     .     .     .     . |     .     .     .      .    .     .   165
        ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- -----  ---------- ----- -----
      v   493 |     .     1     3     .     .     .     .     .     . |     .     .     .      .    .     .   497
      j    40 |     .     .     .     .     .     .     .     .     . |     .     .     .      .    .     .    40
      x  1101 |     5   116   545     .     1   183     4    18     6 |     .     .     .      .    .     .  1979
      s   540 |     1   128   222   943     .     2     .     .     1 |     .     .     .      .    .     .  1837
      y  3161 |    12     3    49     .     2    46     2    26     . |     .     .     .      .    .     .  3301
      g    52 |     6    47  2299     .     .     4     .     .     . |     .     .     .      .    .     .  2408
        ----- | ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- -----  ---------- ----- -----
    TOT  5427 |  5430  3178 13508  2442  1406  1781    18   755   165 |   497    40  1979   1837 3301  2408 44172

  Again, it is obvious that `ix' `ij' `iv' are single letters; we will drop the `i' from them.
  Same goes for `cy' and `cg'; we will drop the `c'.
  
  We may also let `cs' and `is' be single letters.  This is the right
  thing to do if the distribution of the letter after the `s' depends
  on the letter before the s:
  
                  o     c     t     i     k     f    TOT
         ---------- ----- ----- ----- ----- -----  -----
      cs    45   86   109   943     1     1     2   1187
      is   495   42   113     .     .     .     .    650

      Next-symbol probabilities (� 99):

                   o     c     t     i     k     f   TOT
         ----- ----- ----- ----- ----- ----- ----- -----
      cs     4     7     9    79     .     .     .    99
      is    75     6    17     .     .     .     .    99

  They are similar except that cs if often follwed by `t' whereas
  `is' is often terminal and is never followed by `t'. (Not surprising
  since `t' only appears after `c' in this corpus.
  
  But OK, let's replace `cs' by `s' and `is' by `r':

    cat j.wds \
      | sed -f fsg2jsa.sed \
      > bio-j-hoc.wds

    cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic
      
    cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      7216    7216     44840 bio-j-hoc.wds
      1761    1761     13898 bio-j-hoc.dic
    
    cat bio-j-hoc.wds \
      | egrep '^[a-z67+^]*$' \
      > bio-j-hoc-gut.wds
  
    cat bio-j-hoc.dic \
      | egrep '^[a-z67+^]*$' \
      > bio-j-hoc-gut.dic
      
    bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \
      > bio-j-hoc-bad.dic
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      5427    5427     34110 bio-j-hoc-gut.wds
      1083    1083      7646 bio-j-hoc-gut.dic
       678     678      6252 bio-j-hoc-bad.dic

  Digraph statistics:
  
    cat bio-j-hoc-gut.wds \
      | count-digraph-freqs 

    Digraph counts:

                  q     o     c     s     y     g     x     t     i     r     f     h     p     k     v     j   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1398  1146   810   865   104   431   310     .     .    86    94   112     1    70     .     .  5427
      q     1     .  1383    18     2     1     .     .     .     .     .     1     .     .     .     .     .  1406
      o    19     5     1    40     8     3    18  1139     .    12   215  1190   455     8    60     .     5  3178
      c     1     .   150   964    36   731  1756     .  1201  1366     1   245   134     2    25     .     .  6612
      s    45     .    86    92    10     4     3     1   943     .     .     2     .     .     1     .     .  1187
      y  3161     2     3    17    23     .     9     7     .     .     4    46    26     2     .     .     1  3301
      g    52     .    47   403    35  1860     1     5     .     .     1     4     .     .     .     .     .  2408
      x  1101     1   116   262   126    98    59     3     .     .     2   183    18     4     6     .     .  1979
      t     4     .    73  1953     4   243   120    14     .     .     3    14    10     1     3     .     .  2442
      i     4     .     2     2     4     .     2   493     .   886   338     2     .     .     .   497    34  2264
      r   495     .    42    69    14    27     3     .     .     .     .     .     .     .     .     .     .   650
      f     6     .    47  1370    21   151     1     5   180     .     .     .     .     .     .     .     .  1781
      h     3     .    41   513    21    70     2     2   103     .     .     .     .     .     .     .     .   755
      p     .     .     2    14     1     .     .     .     1     .     .     .     .     .     .     .     .    18
      k     2     .    38    85    17     6     3     .    14     .     .     .     .     .     .     .     .   165
      v   493     .     1     .     .     3     .     .     .     .     .     .     .     .     .     .     .   497
      j    40     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    40
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5427  1406  3178  6612  1187  3301  2408  1979  2442  2264   650  1781   755    18   165   497    40 34110

  There is something funny about the `t'.  I must try to either (a) identify it with `c',
  or (b) join it with the preceding `c' or `s' as a single letter.  Since most `t's
  have been misidentified as `c's, it is safer to do (a).
  
97-07-08 stolfi
===============

  I decided to join `iv' to make `w' and identify `t' with `c':
  
    cat j.wds \
      | sed -f fsg2jsa.sed \
      > bio-j-hoc.wds

    cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic
      
    cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq
    
    cat bio-j-hoc.wds \
      | egrep '^[a-z67+^]*$' \
      > bio-j-hoc-gut.wds
  
    cat bio-j-hoc.dic \
      | egrep '^[a-z67+^]*$' \
      > bio-j-hoc-gut.dic
      
    cat bio-j-hoc-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc-gut.frq
    
    bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \
      > bio-j-hoc-bad.dic
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      7216    7216     44287 bio-j-hoc.wds
      1712    1712     13418 bio-j-hoc.dic
      5427    5427     33613 bio-j-hoc-gut.wds
      1035    1035      7223 bio-j-hoc-gut.dic
       677     677      6195 bio-j-hoc-bad.dic

  Digraph statistics:
  
    cat bio-j-hoc-gut.wds \
      | count-digraph-freqs 

    Digraph counts:

                  o     s     y     c     g     x     r     f     p     h     k     q     j     w     i   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1146   865   104   810   431   310    86    94     1   112    70  1398     .     .     .  5427
      o    19     1     8     3    40    18  1139   215  1190     8   455    60     5     5     2    10  3178
      s    45    86    10     4  1035     3     1     .     2     .     .     1     .     .     .     .  1187
      y  3161     3    23     .    17     9     7     4    46     2    26     .     2     1     .     .  3301
      c     5   223    40   974  4118  1876    14     4   259     3   144    28     .     .     4  1362  9054
      g    52    47    35  1860   403     1     5     1     4     .     .     .     .     .     .     .  2408
      x  1101   116   126    98   262    59     3     2   183     4    18     6     1     .     .     .  1979
      r   495    42    14    27    69     3     .     .     .     .     .     .     .     .     .     .   650
      f     6    47    21   151  1550     1     5     .     .     .     .     .     .     .     .     .  1781
      p     .     2     1     .    15     .     .     .     .     .     .     .     .     .     .     .    18
      h     3    41    21    70   616     2     2     .     .     .     .     .     .     .     .     .   755
      k     2    38    17     6    99     3     .     .     .     .     .     .     .     .     .     .   165
      q     1  1383     2     1    18     .     .     .     1     .     .     .     .     .     .     .  1406
      j    40     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    40
      w   493     1     .     3     .     .     .     .     .     .     .     .     .     .     .     .   497
      i     4     2     4     .     2     2   493   338     2     .     .     .     .    34   491   395  1767
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5427  3178  1187  3301  9054  2408  1979   650  1781    18   755   165  1406    40   497  1767 33613

  I computed a "strangeness number" by the formula
  
      function strangeness(n, xk, yk, xyk)
      {
        if ((xk == 0) || (yk == 0)) 
          { return 0 }
        else
          { fx = xk/n;
            fy = yk/n;
            fxy = xyk/n;
            fmax = (fx < fy ? fx : fy);
            fexp = fx*fy;
            fmin = 0;
            if (fxy <= fmin)
              { return -1 }
            else if (fxy >= fmax)
              { return +1 }
            else
              { tmax = (fmax - fxy)/(fmax - fexp);
                tmin = (fxy - fmin)/(fexp - fmin);
                tsum = (log(tmin) - log(tmax))/log(2.0);
                if ( tsum > 0 )
                  { texp = exp(-2*tsum); return (1 - texp)/(1 + texp) }
                else
                  { texp = exp( 2*tsum); return (texp - 1)/(texp + 1) }
              }
          }
      }
      
      function normalness(n, xk, yk, xyk)
      { 
        str = strangeness(n, xk, yk, xyk);
        return 1 - str*str
      }
  
  where n is the total number of pairs tested, xk the number of "x" occurences,
  yk the number of "y" occurrences, and xyk the number of "xy" pairs.
  The result is 0 is xyk is the expected number, +1 if it is maximum 
  possible = min(xk,yk), and -1 if it is the minimum possible (0). 

  Here is the table, scaled from [-1..+1] to [01..99]:

    Strangeness (� 99):

            s     y     c     o           g     f     p     h     k     q     x     r     j     w     i   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      s     1     .    99    30     1     .     .     .     .     1     .     .     .     .     .     .    50
      y     1     .     .     .    99     .     2    59     4     .     .     .     .     2     .     .    50
      c     .    58    90     1     .    99    10    14    21    15     .     .     .     .     .    99    50
      o     .     .     .     .     .     .    99    99    99    98     .    99    98    70     .     .    50
           99     1    10    95     .    58     3     3    42    97    99    47    33     .     .     .    50
      g     6    99    15     1     .     .     .     .     .     .     .     .     .     .     .     .    50
      f     4    38    99     2     .     .     .     .     .     .     .     .     .     .     .     .    50
      p    79     .    99    62     .     .     .     .     .     .     .     .     .     .     .     .    50
      h    33    45    99    15     .     .     .     .     .     .     .     .     .     .     .     .    50
      k    95     4    97    94     .     2     .     .     .     .     .     .     .     .     .     .    50
      q     .     .     .    99     .     .     .     .     .     .     .     .     .     .     .     .    50
      x    86    11     7    18    99     6    84    98     6    19     .     .     .     .     .     .    50
      r    19     6     4    23    99     .     .     .     .     .     .     .     .     .     .     .    50
      j     .     .     .     .    99     .     .     .     .     .     .     .     .     .     .     .    50
      w     .     .     .     .    99     .     .     .     .     .     .     .     .     .     .     .    50
      i     .     .     .     .     .     .     .     .     .     .     .    98    99    99    99    98    50
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    50    50    50    50    50    50    50    50    50    50    50    50    50    50    50    50 33613

    Normalness (� 99):

                  x     y     o     s     c     g     k     p     f     h     r     j     i     w     q   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .    99     2    16     .    37    96     8    12    10    97    89     .     .     .     .    99
      x     2     .    38    59    47    27    24    61     5    50    23     .     .     .     .     .    99
      y     .     .     .     .     3     .     .     .    95     6    15     .     6     .     .     .    99
      o     .     .     .     .     .     .     .     3     1     .     .     4    81     .     .     .    99
      s     4     .     .    83     6     .     .     2     .     .     .     .     .     .     .     .    99
      c     .     .    96     4     .    31     1    52    49    35    67     .     .     1     .     .    99
      g     1     .     .     3    24    50     .     .     .     .     .     .     .     .     .     .    99
      k     .     .    17    17    14     7     6     .     .     .     .     .     .     .     .     .    99
      p     .     .     .    93    64     .     .     .     .     .     .     .     .     .     .     .    99
      f     .     .    94     8    14     .     .     .     .     .     .     .     .     .     .     .    99
      h     .     .    98    51    87     .     .     .     .     .     .     .     .     .     .     .    99
      r     .     .    24    71    60    14     .     .     .     .     .     .     .     .     .     .    99
      j     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    99
      i     .     2     .     .     .     .     .     .     .     .     .     .     .     3     .     .    99
      w     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    99
      q     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    99    99    99    99    99    99    99    99    99    99    99    99    99    99    99    99 33613

  These below are all the `qc' words in the file.  They look like misreadings of popular `qo' words.
  
      egrep 'qc' bio-j-hoc-gut.frq

        qccgy qccgy qcfccgy qcfccgy qcccgy qccgccy qccy qcgy qchcccg
        qchccy qchcgy qchcgys qchcix qchcy qchy qci qcixox qcy
  
07-07-09 stolfi
===============

  Summarizing, so far it seems that breaking down all characters into strokes was a very good idea.
  It led (somewhat indirectly) to two discoveries: that the difference between Guy2 <t> and <c>/<e> is not 
  important, and highly contaminated by error; and that Guy2 `a' is probably not a letter --- it is
  a `c' stroke (possibly half of the preceding letter) accidentally connected to an `i' stroke 
  (probably the beginning of the next letter). 

  Looking at the above tables, it is now almost certain that `sc' and `qo' are
  letters on their own.  (Note that `sc' is represented as [2C], [2A], [S],
  [2T] in the interlinear file.
  
  In other words, the plume on the <c'> is not really attached to the <c> but to
  the following letter, which is always a `c' stroke.  This may be an
  explanation for the ligature in [S] = <c'-t>, and the reported <c'-a>
  ligature.

  Summarizing, I am now going to use the following FGS -> JSA preencoding
    
      IIIK -> iiiij   IE -> iix   A -> ci   N -> iiu  
      IIIL -> iiiiu   IR -> iis   C -> c    O -> o   
      IIIR -> iiiis   IK -> iij   D -> lj   P -> ag   
      IIIE -> iiiix   2 -> cs     E -> ix   R -> is   
      IIE -> iiix     4 -> a      F -> lg   S -> csc  
      IIR -> iiis     6 -> cj     G -> cy   T -> cc  
      IIK -> iiij     7 -> ig     H -> aj   V -> ^   
      HZ -> cajc      8 -> cg     I -> i    Y -> +   
      PZ -> cagc                  K -> ij         
      DZ -> cljc                  L -> iu   
      FZ -> clgc                  M -> iiiu 
   
  followed by the SA -> ad-hoc post-encoding:
  
      sc -> s    ij -> 7    ig -> 8    aj -> H    a -> 4 (if unpaired)
      ao -> A    ix -> e    cg -> 8    ag -> H
                 iu -> v    cy -> 9    lj -> H
                 is -> r               lg -> H

  Moreover, I am going to use this encoding before preparing the consensus transcription.
  The consensus-maker will have to be sort of a dynamic programming algorithm...

  OK, I coded the dynamic consensus-maker, and modified the script 
  fsg2jsa to work on the interlinear file.  So:
  
    cat bio-m-evt.evt \
      | fsg2jsa \
      > bio-m-jsa-bug.evt
      
  Now extracted the training dataset, and generated a new
  set of correction patterns from it:

    cat bio-m-jsa-bug.evt \
      | egrep '^<.*;[FC]> ' \
      | sed \
          -e 's/<.*;[FC]> */  /g' \
          -e 's/{[^}]*}//g' \
      | grep -v '[*]' \
      > .train.txt

     lines   words     bytes file        
    ------ ------- --------- ------------
      1470    1470    115821 .train.txt

    cat .train.txt \
      | generate-fix-patterns -vMINOCC=10 \
      > .fixit.sed
       
      lines   words     bytes file        
    ------ ------- --------- ------------
       596     716      9932 .fixit.sed
       
  Next I generated the consensus interlinear, and ran the automatic 
  context-fixer above:
  
    cat bio-m-jsa-bug.evt \
      | make-consensus-interlin \
      > bio-m-jsa.evt
      
  I extracted the consensus text from it, and applied the 
  automatic corrector:
  
    cat bio-m-jsa.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
          -e 's/[\!]//g' \
      > bio-j-jsa-raw.evt

    cat bio-j-jsa-raw.evt \
      | sed -f .fixit.sed \
      > bio-j-jsa-fix.evt
      
  I wrote a script "extract-words" that extracts the words from the 
  consensus file, remaps them through an arbitrary encoding,
  extracts the dictionary, and runs the digraph statistics: 
  
    ------------------------------
    ------------------------------
  
    extract-words-from-interlin \
        -recode jsa2hoc \
        bio-j-jsa-fix.evt \
        jh-1
        
    cat bio-j-hoc-1-gut.wds \
      | count-digraph-freqs \
          -vchars=' c9po8idervqs74gy'
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      7358    7358     46402 bio-j-hoc-1.wds
      1553    1553     14124 bio-j-hoc-1.dic
      5873    5873     36448 bio-j-hoc-1-gut.wds
      1001    1001      7199 bio-j-hoc-1-gut.dic
       552     552      6925 bio-j-hoc-1-bad.dic
     16337   16337    111098 total
    
    Digraph counts:

                  c     9     p     o     8     i     d     e     r     v     q     s     7     4     g     y   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1780   128   359  1206   467    68     .   322    28     .  1493     .     .    22     .     .  5873
      c     4  3528  1003   473   187  1875  1548  1129    11     4     .     .   159     .     .     .     .  9921
      9  3238    45     .    80     3    11     2     .    10     2     .     4     .     1     1     .     .  3397
      p    14  2629   245     .   142     7     .     .     8     .     .     .     .     .     .     .     .  3045
      o    15    26     1   605     .    12    44     .   972   195     .     5     .     6     .     .     .  1881
      8    58   475  1888     5    48     1     .     .     6     1     .     .     .     .     .     .     .  2482
      i     5     8     .     3     1     3  1558   130   482   326   828     .     .    40     .     .     .  3384
      d     2   937    24    10    34    36   160    27     4     1     .     .     .     .     .     8    43  1286
      e  1035   452    94   230   121    61     .     .     5     2     .     1     .     .     .     .     .  2001
      r   519     .     .     1    46     .     .     .     .     .     .     .     .     .     .     .     .   566
      v   824     .     3     .     1     .     .     .     .     .     .     .     .     .     .     .     .   828
      q     7    23     1  1273     1     8     4     .   179     7     .     1     .     .     .     .     .  1504
      s    63     .     .     5    90     .     .     .     1     .     .     .     .     .     .     .     .   159
      7    46     .     .     .     1     .     .     .     .     .     .     .     .     .     .     .     .    47
      4     1    18     3     1     .     .     .     .     .     .     .     .     .     .     .     .     .    23
      g     1     .     7     .     .     .     .     .     .     .     .     .     .     .     .     .     .     8
      y    41     .     .     .     .     1     .     .     1     .     .     .     .     .     .     .     .    43
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5873  9921  3397  3045  1881  2482  3384  1286  2001   566   828  1504   159    47    23     8    43 36448