I decided to join `iv' to make `w' and identify `t' with `c':
  
    cat j.wds \
      | sed -f fsg2jsa.sed \
      > bio-j-hoc.wds

    cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic
      
    cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq
    
    cat bio-j-hoc.wds \
      | egrep '^[a-z67+^]*$' \
      > bio-j-hoc-gut.wds
  
    cat bio-j-hoc.dic \
      | egrep '^[a-z67+^]*$' \
      > bio-j-hoc-gut.dic
      
    cat bio-j-hoc-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc-gut.frq
    
    bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \
      > bio-j-hoc-bad.dic
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      7216    7216     44287 bio-j-hoc.wds
      1712    1712     13418 bio-j-hoc.dic
      5427    5427     33613 bio-j-hoc-gut.wds
      1035    1035      7223 bio-j-hoc-gut.dic
       677     677      6195 bio-j-hoc-bad.dic

  Digraph statistics:
  
    cat bio-j-hoc-gut.wds \
      | count-digraph-freqs 

    Digraph counts:

                  o     s     y     c     g     x     r     f     p     h     k     q     j     w     i   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1146   865   104   810   431   310    86    94     1   112    70  1398     .     .     .  5427
      o    19     1     8     3    40    18  1139   215  1190     8   455    60     5     5     2    10  3178
      s    45    86    10     4  1035     3     1     .     2     .     .     1     .     .     .     .  1187
      y  3161     3    23     .    17     9     7     4    46     2    26     .     2     1     .     .  3301
      c     5   223    40   974  4118  1876    14     4   259     3   144    28     .     .     4  1362  9054
      g    52    47    35  1860   403     1     5     1     4     .     .     .     .     .     .     .  2408
      x  1101   116   126    98   262    59     3     2   183     4    18     6     1     .     .     .  1979
      r   495    42    14    27    69     3     .     .     .     .     .     .     .     .     .     .   650
      f     6    47    21   151  1550     1     5     .     .     .     .     .     .     .     .     .  1781
      p     .     2     1     .    15     .     .     .     .     .     .     .     .     .     .     .    18
      h     3    41    21    70   616     2     2     .     .     .     .     .     .     .     .     .   755
      k     2    38    17     6    99     3     .     .     .     .     .     .     .     .     .     .   165
      q     1  1383     2     1    18     .     .     .     1     .     .     .     .     .     .     .  1406
      j    40     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    40
      w   493     1     .     3     .     .     .     .     .     .     .     .     .     .     .     .   497
      i     4     2     4     .     2     2   493   338     2     .     .     .     .    34   491   395  1767
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5427  3178  1187  3301  9054  2408  1979   650  1781    18   755   165  1406    40   497  1767 33613

  I computed a "strangeness number" by the formula
  
      function strangeness(n, xk, yk, xyk)
      {
        if ((xk == 0) || (yk == 0)) 
          { return 0 }
        else
          { fx = xk/n;
            fy = yk/n;
            fxy = xyk/n;
            fmax = (fx < fy ? fx : fy);
            fexp = fx*fy;
            fmin = 0;
            if (fxy <= fmin)
              { return -1 }
            else if (fxy >= fmax)
              { return +1 }
            else
              { tmax = (fmax - fxy)/(fmax - fexp);
                tmin = (fxy - fmin)/(fexp - fmin);
                tsum = (log(tmin) - log(tmax))/log(2.0);
                if ( tsum > 0 )
                  { texp = exp(-2*tsum); return (1 - texp)/(1 + texp) }
                else
                  { texp = exp( 2*tsum); return (texp - 1)/(texp + 1) }
              }
          }
      }
      
      function normalness(n, xk, yk, xyk)
      { 
        str = strangeness(n, xk, yk, xyk);
        return 1 - str*str
      }
  
  where n is the total number of pairs tested, xk the number of "x" occurences,
  yk the number of "y" occurrences, and xyk the number of "xy" pairs.
  The result is 0 is xyk is the expected number, +1 if it is maximum 
  possible = min(xk,yk), and -1 if it is the minimum possible (0). 

  Here is the table, scaled from [-1..+1] to [01..99]:

    Strangeness (× 99):

            s     y     c     o           g     f     p     h     k     q     x     r     j     w     i   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      s     1     .    99    30     1     .     .     .     .     1     .     .     .     .     .     .    50
      y     1     .     .     .    99     .     2    59     4     .     .     .     .     2     .     .    50
      c     .    58    90     1     .    99    10    14    21    15     .     .     .     .     .    99    50
      o     .     .     .     .     .     .    99    99    99    98     .    99    98    70     .     .    50
           99     1    10    95     .    58     3     3    42    97    99    47    33     .     .     .    50
      g     6    99    15     1     .     .     .     .     .     .     .     .     .     .     .     .    50
      f     4    38    99     2     .     .     .     .     .     .     .     .     .     .     .     .    50
      p    79     .    99    62     .     .     .     .     .     .     .     .     .     .     .     .    50
      h    33    45    99    15     .     .     .     .     .     .     .     .     .     .     .     .    50
      k    95     4    97    94     .     2     .     .     .     .     .     .     .     .     .     .    50
      q     .     .     .    99     .     .     .     .     .     .     .     .     .     .     .     .    50
      x    86    11     7    18    99     6    84    98     6    19     .     .     .     .     .     .    50
      r    19     6     4    23    99     .     .     .     .     .     .     .     .     .     .     .    50
      j     .     .     .     .    99     .     .     .     .     .     .     .     .     .     .     .    50
      w     .     .     .     .    99     .     .     .     .     .     .     .     .     .     .     .    50
      i     .     .     .     .     .     .     .     .     .     .     .    98    99    99    99    98    50
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    50    50    50    50    50    50    50    50    50    50    50    50    50    50    50    50 33613

    Normalness (× 99):

                  x     y     o     s     c     g     k     p     f     h     r     j     i     w     q   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .    99     2    16     .    37    96     8    12    10    97    89     .     .     .     .    99
      x     2     .    38    59    47    27    24    61     5    50    23     .     .     .     .     .    99
      y     .     .     .     .     3     .     .     .    95     6    15     .     6     .     .     .    99
      o     .     .     .     .     .     .     .     3     1     .     .     4    81     .     .     .    99
      s     4     .     .    83     6     .     .     2     .     .     .     .     .     .     .     .    99
      c     .     .    96     4     .    31     1    52    49    35    67     .     .     1     .     .    99
      g     1     .     .     3    24    50     .     .     .     .     .     .     .     .     .     .    99
      k     .     .    17    17    14     7     6     .     .     .     .     .     .     .     .     .    99
      p     .     .     .    93    64     .     .     .     .     .     .     .     .     .     .     .    99
      f     .     .    94     8    14     .     .     .     .     .     .     .     .     .     .     .    99
      h     .     .    98    51    87     .     .     .     .     .     .     .     .     .     .     .    99
      r     .     .    24    71    60    14     .     .     .     .     .     .     .     .     .     .    99
      j     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    99
      i     .     2     .     .     .     .     .     .     .     .     .     .     .     3     .     .    99
      w     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    99
      q     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    99    99    99    99    99    99    99    99    99    99    99    99    99    99    99    99 33613

  These below are all the `qc' words in the file.  They look like misreadings of popular `qo' words.
  
      egrep 'qc' bio-j-hoc-gut.frq

        qccgy qccgy qcfccgy qcfccgy qcccgy qccgccy qccy qcgy qchcccg
        qchccy qchcgy qchcgys qchcix qchcy qchy qci qcixox qcy
  
07-07-09 stolfi
===============

  Summarizing, so far it seems that breaking down all characters into strokes was a very good idea.
  It led (somewhat indirectly) to two discoveries: that the difference between Guy2 <t> and <c>/<e> is not 
  important, and highly contaminated by error; and that Guy2 `a' is probably not a letter --- it is
  a `c' stroke (possibly half of the preceding letter) accidentally connected to an `i' stroke 
  (probably the beginning of the next letter). 

  Looking at the above tables, it is now almost certain that `sc' and `qo' are
  letters on their own.  (Note that `sc' is represented as [2C], [2A], [S],
  [2T] in the interlinear file.
  
  In other words, the plume on the <c'> is not really attached to the <c> but to
  the following letter, which is always a `c' stroke.  This may be an
  explanation for the ligature in [S] = <c'-t>, and the reported <c'-a>
  ligature.

  Summarizing, I am now going to use the following FGS -> JSA preencoding
    
      IIIK -> iiiij   IE -> iix   A -> ci   N -> iiu  
      IIIL -> iiiiu   IR -> iis   C -> c    O -> o   
      IIIR -> iiiis   IK -> iij   D -> lj   P -> ag   
      IIIE -> iiiix   2 -> cs     E -> ix   R -> is   
      IIE -> iiix     4 -> a      F -> lg   S -> csc  
      IIR -> iiis     6 -> cj     G -> cy   T -> cc  
      IIK -> iiij     7 -> ig     H -> aj   V -> ^   
      HZ -> cajc      8 -> cg     I -> i    Y -> +   
      PZ -> cagc                  K -> ij         
      DZ -> cljc                  L -> iu   
      FZ -> clgc                  M -> iiiu 
   
  followed by the SA -> ad-hoc post-encoding:
  
      sc -> s    ij -> 7    ig -> 8    aj -> H    a -> 4 (if unpaired)
      ao -> A    ix -> e    cg -> 8    ag -> H
                 iu -> v    cy -> 9    lj -> H
                 is -> r               lg -> H

  Moreover, I am going to use this encoding before preparing the consensus transcription.
  The consensus-maker will have to be sort of a dynamic programming algorithm...

  OK, I coded the dynamic consensus-maker, and modified the script 
  fsg2jsa to work on the interlinear file.  So:
  
    cat bio-m-evt.evt \
      | fsg2jsa \
      > bio-m-jsa-bug.evt
      
  Now extracted the training dataset, and generated a new
  set of correction patterns from it:

    cat bio-m-jsa-bug.evt \
      | egrep '^<.*;[FC]> ' \
      | sed \
          -e 's/<.*;[FC]> */  /g' \
          -e 's/{[^}]*}//g' \
      | grep -v '[*]' \
      > .train.txt

     lines   words     bytes file        
    ------ ------- --------- ------------
      1470    1470    115821 .train.txt

    cat .train.txt \
      | generate-fix-patterns -vMINOCC=10 \
      > .fixit.sed
       
      lines   words     bytes file        
    ------ ------- --------- ------------
       596     716      9932 .fixit.sed
       
  Next I generated the consensus interlinear, and ran the automatic 
  context-fixer above:
  
    cat bio-m-jsa-bug.evt \
      | make-consensus-interlin \
      > bio-m-jsa.evt
      
  I extracted the consensus text from it, and applied the 
  automatic corrector:
  
    cat bio-m-jsa.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
          -e 's/[\!]//g' \
      > bio-j-jsa-raw.evt

    cat bio-j-jsa-raw.evt \
      | sed -f .fixit.sed \
      > bio-j-jsa-fix.evt
      
  I wrote a script "extract-words" that extracts the words from the 
  consensus file, remaps them through an arbitrary encoding,
  extracts the dictionary, and runs the digraph statistics: 
  
    ------------------------------
    ------------------------------
  
    extract-words-from-interlin \
        -recode jsa2hoc \
        bio-j-jsa-fix.evt \
        jh-1
        
    cat bio-j-hoc-1-gut.wds \
      | count-digraph-freqs \
          -vchars=' c9po8idervqs74gy'
    
     lines   words     bytes file        
    ------ ------- --------- ------------
      7358    7358     46402 bio-j-hoc-1.wds
      1553    1553     14124 bio-j-hoc-1.dic
      5873    5873     36448 bio-j-hoc-1-gut.wds
      1001    1001      7199 bio-j-hoc-1-gut.dic
       552     552      6925 bio-j-hoc-1-bad.dic
     16337   16337    111098 total
    
    Digraph counts:

                  c     9     p     o     8     i     d     e     r     v     q     s     7     4     g     y   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1780   128   359  1206   467    68     .   322    28     .  1493     .     .    22     .     .  5873
      c     4  3528  1003   473   187  1875  1548  1129    11     4     .     .   159     .     .     .     .  9921
      9  3238    45     .    80     3    11     2     .    10     2     .     4     .     1     1     .     .  3397
      p    14  2629   245     .   142     7     .     .     8     .     .     .     .     .     .     .     .  3045
      o    15    26     1   605     .    12    44     .   972   195     .     5     .     6     .     .     .  1881
      8    58   475  1888     5    48     1     .     .     6     1     .     .     .     .     .     .     .  2482
      i     5     8     .     3     1     3  1558   130   482   326   828     .     .    40     .     .     .  3384
      d     2   937    24    10    34    36   160    27     4     1     .     .     .     .     .     8    43  1286
      e  1035   452    94   230   121    61     .     .     5     2     .     1     .     .     .     .     .  2001
      r   519     .     .     1    46     .     .     .     .     .     .     .     .     .     .     .     .   566
      v   824     .     3     .     1     .     .     .     .     .     .     .     .     .     .     .     .   828
      q     7    23     1  1273     1     8     4     .   179     7     .     1     .     .     .     .     .  1504
      s    63     .     .     5    90     .     .     .     1     .     .     .     .     .     .     .     .   159
      7    46     .     .     .     1     .     .     .     .     .     .     .     .     .     .     .     .    47
      4     1    18     3     1     .     .     .     .     .     .     .     .     .     .     .     .     .    23
      g     1     .     7     .     .     .     .     .     .     .     .     .     .     .     .     .     .     8
      y    41     .     .     .     .     1     .     .     1     .     .     .     .     .     .     .     .    43
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5873  9921  3397  3045  1881  2482  3384  1286  2001   566   828  1504   159    47    23     8    43 36448