Hacking at the Voynich manuscript - Side notes
019 Analyzing word frequencies per section

Last edited on 1999-07-28 13:12:14 by stolfi

[ Originally part of Notes/021; 
  First version done on 1998-04-28.
  Redone 1998-06-20 with fresher data.
  Redone 1998-07-02 with different dictionaries and axes.
  Redone 1999-01-30 with 1.6e6 majority transcription (Notes/045).
  Split off from Notes/021 on 1999-01-31.
]

1998-07-02 stolfi
=================

  The goal of this note is to compare the word distributions 
  among the various sections of the Voynich manuscript.
  
  [ NEEDS TO BE REDONE. A BUG IN lines-from-evt WOULD SPLIT WORDS AT "!"
    THUS GENERATING MANY FALSE n, y, and words ending in "ai". ]

I. EXTRACTING AND COUNTING WORDS

  The source file will be the majority version, with weirdos 
  mapped to "*" and other basic EVA chars, and chopped into 
  pages and subsections:

    ln -s ../045/subsecs-m text-subsecs
    ln -s ../045/pages-m text-pages
    
  However I must filter those files since they contain labels and
  other stuff besides plain text.  Labels are useful but here
  they would distort the analysis: pages with more labels will
  seem to be in a differnt language than pages with few labels.

    ln -s ../../L16+H-eva
    
    cat L16+H-eva/unit16e6.txt \
      | gawk -v FS=":" '/./{print $2,$6}' \
      > unit-to-type.tbl
      
    cat unit-to-type.tbl | gawk '//{print $2}' | sort | uniq
    
      -
      circular-lines
      circular-text
      labels
      letters
      parags
      radial-lines
      starred-parags
      titles
      words
  
  We will prepare two sets of statistics, one using raw words ("-RAW")
  and one using word equivalence classes ("-EQV").

    word-to-class -describe

      word equivalence:
        map_sh_to_ch
        ignore_gallows_eyes
        join_ei
        equate_aoy
        collapse_ii
        equate_eights
        equate_pt
        erase_q
        crush_invalid_words
        append_tilde

  This mapping will hopefully reduce noise in many ways.  For one thing,
  it collapses pairs of characters that are similar and most easily
  misread or confused by image noise.  It also reduces the sampling
  error, by increasing the number of occurrences of keywords in the
  page.  Finally, it also neutralizes certain transcriber bias, such as
  the frequent misreading of "daiin" versus "dain" in Friedman's
  transcription of Bio (as pointed out by John Grove on 1998-05-02).

  One thing it does not fix is inconsistent transcription of spaces. 

  Creating a combined file of the source text for archiving:
  
     ( cd text-pages && cat `cat all.names | sed -e 's/$/.evt/'` ) \
       | select-units \
           -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \
           -v table=unit-to-type.tbl \
       > all.evt

  Selecting the plain text: 

    foreach utype ( pages subsecs )
      foreach f ( `cat text-${utype}/all.names` )
        set ofile = "/tmp/${utype}-${f}.txt"
        echo ${ofile}
        cat text-${utype}/${f}.evt \
          | select-units \
              -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \
              -v table=unit-to-type.tbl \
          | lines-from-evt | egrep '.' \
          > ${ofile}
      end
    end

  Extracting words and mapping them to classes

    foreach ep ( word-to-clean.RAW word-to-class.EQV )
      set etag = ${ep:e}; set ecmd = ${ep:r}
      foreach utype ( pages subsecs )
        foreach f ( `cat text-${utype}/all.names` )
          set ofile = "/tmp/${utype}-${f}-${etag}.wds"
          echo ${ofile}
          cat /tmp/${utype}-${f}.txt \
            | words-from-evt | egrep '.' \
            | tr '*%' '??' \
            | ${ecmd} | egrep '.' \
            > ${ofile}
        end
      end
    end

  Counting words and computing relative frequencies:

    # mkdir -p RAW EQV
    # foreach etag ( RAW EQV )
    #   /bin/rm -rf ${etag}/wfreqs
    #   mkdir -p ${etag}/wfreqs
    #   foreach utype ( pages subsecs )
    #     set frdir = "${etag}/wfreqs/${utype}"
    #     mkdir -p ${frdir}
    #   end
    # end
    
    foreach etag ( RAW EQV )
      foreach utype ( pages subsecs )
        set frdir = "${etag}/wfreqs/${utype}"
        cp -p text-${utype}/all.names ${frdir}/
        foreach f ( `cat text-${utype}/all.names` )
          set ofile = "${frdir}/$f.frq"
          echo ${ofile}
          mv ${ofile} ${ofile}~
          cat /tmp/${utype}-${f}-${etag}.wds \
            | sort | uniq -c | expand \
            | sort -b +0 -1nr \
            | compute-freqs \
            > ${ofile}
        end
      end
    end

    /bin/rm /tmp/{pages,subsecs}-*{,-RAW,-EQV}.{txt,wds}

  Combining data by section instead of subsection:
  
    foreach etag ( RAW EQV )
      foreach sec ( `cat ${etag}/wfreqs/secs/all.names` )
        set ofile = "${etag}/wfreqs/secs/${sec}.frq"
        set ifiles = ( `cd ${etag}/wfreqs/subsecs/ && ls ${sec}.*.frq` )
        echo "$ifiles"
        mv ${ofile} ${ofile}~
        (cd ${etag}/wfreqs/subsecs/ && cat ${ifiles} ) \
          | gawk '/./{print $1, $3;}' \
          | combine-counts \
          | sort -b +0 -1nr \
          | compute-freqs \
          > ${ofile}
      end
    end

  Compute total frequencies:

    foreach etag ( RAW EQV )
      foreach utype ( pages subsecs secs )
        set fmt = "${etag}/wfreqs/${utype}/%s.frq"
        set frfiles = ( \
          `cat ${etag}/wfreqs/${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \
        )
        echo ${frfiles}
        cat ${frfiles} \
          | gawk '/./{print $1, $3;}' \
          | combine-counts \
          | sort -b +0 -1nr \
          | compute-freqs \
          > ${etag}/wfreqs/${utype}/tot.frq
      end
    end  

II. TABULATING WORD FREQUENCIES PER SUBSECTION

    cat text-subsecs/all.names
    
  (Edited ${subsectags} manually to match presumed writing order.)
  
    set subsectags = ( \
      pha.1 pha.2 \
      hea.1 hea.2 \
      unk.1 unk.2 \
      str.1 \
      cos.1 cos.2 \
      zod.1 \
      cos.3 \
      str.2 \
      heb.2 heb.1 \
      bio.1 \
      unk.3 unk.4 unk.5 unk.6 unk.7 unk.8 \
    )
    
    echo $subsectags | tr ' ' '\012' | sort > .foo
    diff text-subsecs/all.names .foo

    foreach etag ( RAW EQV )
      tabulate-frequencies \
        -dir ${etag}/wfreqs/subsecs \
        -title "word" \
        tot ${subsectags}
    end

  Frequencies of raw words (RAW/wfreqs/subsecs/all.cmp-frq)
  in each section (× 9999), minus the "unk"s:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     236   572   364   543   230   119   108   140    89   158   120   215   230   120  daiin
     146   129   175    65   126    52     .    80    29   260   101   197    75   352  ol
     137    10     .     1     .     .     .    26    39    90   179   143   203   319  chedy
     123   172   119    29    46   132   108   120   277   271   173   179   199    54  aiin
     116    10     .     .     .    13     .     6    49   113   109    71   110   367  shedy
     107   140   231   297   126   105     .    93    19    33    63    71    30    21  chol
      97    97   175    66   115    52     .    53    19   361    59   215   192    98  or
      96    10    13    21    22   291   108   221   217   203   136    71   151    35  ar
      93   107    84    75    22    13     .    80    59    11   115    53    58   137  chey
      90   140    63    90    69    66    54   154    79   135    46   107   155   105  dar
      81     .     .     .     .     .     .     .     .    11   126    17    34   224  qokeedy
      81   107    77    14    22    79     .    33     9    22   146    53    20   125  qokeey
      76    21   126    45    57    13     .    80    49     .    80    17    24   144  shey
      74    53    84   148    11    66     .   127    89    45    11    71   124    79  dy
      72    32    35     8    34   132     .   154   287   113   127    17    37    35  al
      71    10     6    23     .    79     .     .     .    45   109   125    44   125  qokaiin
      71    10     .     .     .     .     .     .     .    22    54    17   134   238  qokedy
      70   280    49    58    34   158   108   140    79    22    27    35    82    99  dal
      64   183    84   133   357    79   162    67    89    56    10    53    72    21  s
      58     .     .     1     .    13     .     .     .    11    56    17    10   219  qokain
      56   118    77   199    80    26     .    33     .     .    16    17    20     1  chor
      56    32     6    37    11    13     .    20    39    22    88   107    93    48  okaiin
      52    10    13     2     .   132     .    46     .    33    38    89    17   163  qokal
      50    21    49   133    80    66     .    33    39    33    23    17    37    26  shol
      48    86    28   104    22     .     .    20     .    11    27    17    27    65  dain
      46    53    84    20    22     .     .    40    49    11    69    17    13    61  cheey
      45   151   161    26    80    39     .    26    29     .    51     .    20    45  cheol
      44    43   140    10    11    13    54    60    39     .    87     .    27    24  okeey
      42    32     6     2     .     .     .     .     .     .    28     .     3   175  qol
      40    43     6   138    11    26     .    60    39    22    10    53    24    13  chy
      40    10    13    39     .    26     .     6    89    45    69    53    20    17  otaiin
      40     .     .     .     .     .     .     6     9   124    56    35    68    71  otedy
      40    32    13     4     .    92    54     .     .    11    38   107    75    65  qokar
      39    10     .     8     .    13     .    26    19    67    37   323   120    33  chdy
      39    21    28    34    22    52     .     .     .    11    25   107    17    89  qoky
      39    86    13    30   103    39    54     6     9    11    38    17    58    54  saiin
      39    64    63    26    22    39     .    40     .    56    46     .    20    55  sheey
      37    21    13    21     .     .     .     6     .    11    36   107   103    61  chckhy
      37    21    35    20    11   132   108    26    39    67    38   125    61    32  okal
      37    21     .     7     .    52     .    33    59    79    50   107    51    35  otar
      37    10     .    49    11    79   486    80    69    33    19    71    48    33  y
      36    21    13    13    11   119    54    20    39     .    51    71    41    30  otal
      35     .    35     5    46     .     .    87   148    67    56    71     6    23  oteey
      34    43    13     5    11    52     .    60    19    33    37   125    86    27  okar
      33     .    49   126    34    26     .    33    19    11    11    17     3     .  sho
      31     .     .     .     .    13     .     .     .     .    51    17     3    86  lchedy
      30    10    20   136    46     .     .    20     .     .     .     .    10     1  cthy
      30    53    91    55    57     .     .    20     .    11     7    35    13    45  dol
      30     .     .     5     .    26     .     .     .    11    48    35    34    59  okain
       .     .     .     .     .     .     .     .     .     .     .     .     .     .  ...
     603   734   799   203   483  1085   918   851  3049  1888   611   843   255   393  ?

  Unknown sections:
  
     tot  unk1  unk2  unk3  unk4  unk5  unk6  unk7  unk8  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     236   328   642   212   264   175   102   154     .  daiin
     146    46     .     .   231   204   163   180     .  ol
     137     .     .   212   165   116   143    51     .  chedy
     123     .     .     .    33   204   143   413     .  aiin
     116     .     .   212   198    58    40    77     .  shedy
     107   375   214     .     .    58   102    25     .  chol
      97   140     .     .     .   146   265   232     .  or
      96    46     .     .    99     .   204   258     .  ar
      93    93    71     .    33    29   102   103     .  chey
      90    93    71     .   132   175   224    51     .  dar
      81     .     .     .    33    29    40     .     .  qokeedy
      81     .     .     .    33    29    40    25     .  qokeey
      76   140     .     .    66   146    61   103     .  shey
      74     .   285   425    33    58    61    77     .  dy
      72     .     .     .    33    58    20   103     .  al
      71     .     .     .     .    29   245    25     .  qokaiin
      71     .     .     .    33    87     .     .     .  qokedy
      70    93    71     .   165   116   102    51     .  dal
      64    93    71     .     .    58     .     .     .  s
      58     .     .     .     .     .    20     .     .  qokain
      56   140   357     .     .    87    20    25     .  chor
      56    46     .     .    33    87    61    25     .  okaiin
      52     .     .     .    66    29    61    25     .  qokal
      50   140   214     .    66     .    40     .     .  shol
      48   281   142     .     .     .     .     .     .  dain
      46     .    71     .     .    87     .    51     .  cheey
      45     .    71     .    33    29    20     .     .  cheol
      44     .     .     .     .     .     .     .     .  okeey
      42     .     .     .     .     .     .     .     .  qol
      40    46    71     .     .    29    20     .     .  chy
      40     .     .     .     .    29    40   154     .  otaiin
      40     .     .     .     .   175     .     .     .  otedy
      40     .     .     .    66    58   224   154     .  qokar
      39    46     .     .    99    58    61    51     .  chdy
      39     .     .     .   165     .    61    25     .  qoky
      39     .     .     .     .    58    20     .     .  saiin
      39     .     .     .     .     .    81     .     .  sheey
      37     .     .     .    33     .     .    25     .  chckhy
      37     .     .     .    33    29    20    25     .  okal
      37     .     .     .     .   116   143    77     .  otar
      37   187     .   212     .     .     .    25     .  y
      36     .     .     .     .    58   102   180     .  otal
      35    46     .     .    33     .     .    25     .  oteey
      34     .     .     .    33    58   102    51     .  okar
      33    93   214     .     .     .     .     .     .  sho
      31     .     .     .    33     .     .     .     .  lchedy
      30   140    71     .     .     .     .    25     .  cthy
      30     .    71     .    66    58     .     .     .  dol
      30    46     .     .     .     .     .     .     .  okain
       .     .     .     .     .     .     .     .     .  ...
     603   516   285   638   198   964  1165   775     .  ?


  Frequencies of word classes (EQV/wfreqs/subsecs/all.cmp-frq) 
  in each section (× 9999), minus unknowns:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     353    97    35   190    34   317    54    46   158   339   494   502   255   574  otoin~
     294   669   406   675   253   119   108   167    89   169   153   233   261   188  doin~
     261   194   217    77   161   185     .   234   316   373   258   215   120   563  ol~
     255    21     .     1     .    13     .    33    89   203   289   215   317   686  chedo~
     242   345   231   203   126   529   216   214   128   124   201   359   213   342  otol~
     212   205   350    55   161   105    54   248   267   192   367   125    99   200  oteeo~
     201   215   280   139   172    39     .   227   148    22   244    71   106   284  cheo~
     200   107   189    97   138   357   108   274   237   588   202   287   344   143  or~
     194   151    91   116   115   317    54   194   118   226   192   466   330   172  otor~
     179    10     .     .     .     .     .    20     9   180   170   125   410   464  otedo~
     175   161   280   441   207   264     .   147    79    67   114    89    68    68  chol~
     168     .     6     1    11     .     .    26    59    79   269    89   141   386  oteedo~
     160   237   147    40    92   145   108   140   277   328   228   179   241   103  oin~
     142    64    63   221   115   132   108   120   118   113    85   341   134   181  oto~
     120    97    91   399   103    66     .   160    79    45    38    89    65    32  cho~
     110   151    98   336    92   145     .    46     9    56    61    89    48    21  chor~
     106   151   119   142    69    66    54   154    79   135    49   125   168   112  dor~
     101   334   140   115    92   158   108   160    79    33    35    71    96   144  dol~
      99    21    28   329   115    13     .    80    19    45    49   125    48    21  otcho~
      91   118   154    46    46    39     .    87    49    67   133    17    41   117  cheeo~
      87   237   217    39   195   158     .    60    79     .   102     .    37    98  cheol~
      85    32    13    66    11     .     .    20     9    33    87   197   130   155  chctho~
      83   107   168    24    34   145   162   140   178    45    88     .    79   115  oteo~
      79    53    91   165    11    66     .   134    89    56    12    71   130    81  do~
      74     .     .     1     .    13     .     6    19   101   142    71    82    95  otchedo~
      64    53    20    69    34   105  1189   140   148   101    37    71    55    55  o~
      64   183    84   133   357    79   162    67    89    56    10    53    72    21  s~
      60   107    49    49   115    39    54    20     9    33    60    17    58   101  soin~
      59   237   112    75   115    66     .    60    49    11    48    17    24    48  cheor~
      59    10    42    80    46    13    54    53    49   113    68    53    37    39  otcheo~
      54    43    35   214   138    39     .    20     9     .     4     .    34     4  ctho~
      54   140   385    33    80    66     .    67    49    45    52    17    44    13  oteol~
      52    10     .    11     .    13     .    46    19    90    49   376   168    43  chdo~
      51    10     6     4     .     .     .    40     .    79    62   161   155    42  otchdo~
      49    21     6    49    11    39     .     .    19    11    66   161    41    45  toin~
      46    32    70    53    57   132     .    46    29    22    29     .    34    62  tol~
      43    32    56    27    46    79     .     6     .    11    38   161   110    33  tor~
      42     .     .     .     .    13     .     .     .     .    64    35     3   121  lchedo~
      42    53    49    65    11    39     .    26     9   169    44    71    41     4  odoin~
      42    43     .    49    46    52     .     6    19    56    49    17    99    13  otod~
      40    75    42    56   161    13     .    33     .    22    34    89    68     1  chodo~
      38    21     6     .     .     .     .     .     9    22    49    17    30    99  cheedo~
      37    32    28    20    22    52   108    46    19     .    21    35   113    51  cheto~
      36   172    70    11    57    13     .    67    39    56    44    35    65     .  cheodo~
      36    97    35    45    46    39    54   154    29    33    29     .    30     5  doir~
      35   151   126     4    57    13     .   120   128    33    24     .    61     .  oteodo~
      34    10     .     4     .     .     .     .     .    11    34     .    48   106  chectho~
      34    43    28   131    80    13     .    33     9    22     8    17     3     2  otchol~
      34    53    70    24   103     .     .    46     9     .    12    17     3    93  sol~
       .     .     .     .     .     .     .     .     .     .     .     .     .     .  ...
     603   734   799   203   483  1085   918   851  3049  1888   611   843   255   393  ?~

     tot  unk1  unk2  unk3  unk4  unk5  unk6  unk7  unk8  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     353   234     .     .    66   175   613   620     .  otoin~
     294   657   785   212   264   175   102   154     .  doin~
     261    46     .     .   264   263   183   284     .  ol~
     255     .     .   425   364   175   183   129     .  chedo~
     242   140    71   212   231   263   388   490     .  otol~
     212    46     .   212    66    58    61   180     .  oteeo~
     201   281   285     .   132   175   163   206     .  cheo~
     200   187     .     .    99   146   470   490     .  or~
     194     .     .   212   165   467   633   671     .  otor~
     179     .     .     .   198   380    61    51     .  otedo~
     175   516   428     .    66    58   163    25     .  chol~
     168     .     .     .   132    58    81    25     .  oteedo~
     160    46     .     .    33   204   163   413     .  oin~
     142    93     .   212   231    29   183   258     .  oto~
     120   187   571     .    66    29    20    51     .  cho~
     110   187   357     .    66   116    81   103     .  chor~
     106   140    71     .   231   175   224    51     .  dor~
     101    93   142     .   231   175   102    51     .  dol~
      99    46   285     .   132   116    81   103     .  otcho~
      91     .    71     .    33    87    81    51     .  cheeo~
      87     .    71     .   198    58    81    25     .  cheol~
      85     .   214   212   132    29     .    51     .  chctho~
      83     .     .   425    99     .     .    25     .  oteo~
      79     .   285   425    33    58    61    77     .  do~
      74     .     .   212   165   204   102     .     .  otchedo~
      64   234     .   425     .     .     .    25     .  o~
      64    93    71     .     .    58     .     .     .  s~
      60     .     .     .     .    58    20     .     .  soin~
      59    93    71     .    99     .    61     .     .  cheor~
      59    93     .   425    66    87    81    77     .  otcheo~
      54   234    71   212     .     .    40    25     .  ctho~
      54    46    71     .     .    29     .     .     .  oteol~
      52    46     .     .   165    87    81    77     .  chdo~
      51     .     .     .    66   438   122    25     .  otchdo~
      49    93     .     .     .     .   183   154     .  toin~
      46    46    71   212    33     .    61   103     .  tol~
      43   140     .     .     .    58   102   154     .  tor~
      42     .     .     .    99     .     .     .     .  lchedo~
      42   140     .     .    33    58    40   103     .  odoin~
      42     .     .     .    66    29    40   180     .  otod~
      40   140     .     .   132    29    20   154     .  chodo~
      38     .     .     .    66   116     .     .     .  cheedo~
      37    46    71   212   132     .     .    51     .  cheto~
      36     .     .   425    99    29    20    77     .  cheodo~
      36     .     .     .   132    87    40     .     .  doir~
      35     .     .   212   231   116    20    51     .  oteodo~
      34     .     .     .    33     .    20     .     .  chectho~
      34    46     .     .     .    29     .    25     .  otchol~
      34     .    71     .     .     .     .    51     .  sol~
       .     .     .     .     .     .     .     .     .  ...
     603   516   285   638   198   964  1165   775     .  ?~


III. CLASSIFYING THE WORDS

  Let's manually sort the words according to their relative 
  frequencies over the subsecs.  We will exclude the "unk" section
  for clarity.
  
    Words with fairly uniform frequencies:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     236   572   364   543   230   119   108   140    89   158   120   215   230   120  daiin
     146   129   175    65   126    52     .    80    29   260   101   197    75   352  ol
     123   172   119    29    46   132   108   120   277   271   173   179   199    54  aiin
      97    97   175    66   115    52     .    53    19   361    59   215   192    98  or
      93   107    84    75    22    13     .    80    59    11   115    53    58   137  chey
      90   140    63    90    69    66    54   154    79   135    46   107   155   105  dar
      76    21   126    45    57    13     .    80    49     .    80    17    24   144  shey

    Words that are almost specific to herbal-A:
    
     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      50    21    49   133    80    66     .    33    39    33    23    17    37    26  shol
      48    86    28   104    22     .     .    20     .    11    27    17    27    65  dain
      30    10    20   136    46     .     .    20     .     .     .     .    10     1  cthy
      40    43     6   138    11    26     .    60    39    22    10    53    24    13  chy
      33     .    49   126    34    26     .    33    19    11    11    17     3     .  sho

    Words almost specific of Pharma:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      45   151   161    26    80    39     .    26    29     .    51     .    20    45  cheol
 
    Words almost specific of language A:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     107   140   231   297   126   105     .    93    19    33    63    71    30    21  chol
      64   183    84   133   357    79   162    67    89    56    10    53    72    21  s
      56   118    77   199    80    26     .    33     .     .    16    17    20     1  chor

    Words more common in language A:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----


    Words almost specific to herbal-B:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      39    10     .     8     .    13     .    26    19    67    37   323   120    33  chdy
      40    32    13     4     .    92    54     .     .    11    38   107    75    65  qokar
      34    43    13     5    11    52     .    60    19    33    37   125    86    27  okar
      39    21    28    34    22    52     .     .     .    11    25   107    17    89  qoky
      37    21     .     7     .    52     .    33    59    79    50   107    51    35  otar
   
    Words almost specific to the Biological section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      42    32     6     2     .     .     .     .     .     .    28     .     3   175  qol
   
    Words almost specific to language B:
   
     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     137    10     .     1     .     .     .    26    39    90   179   143   203   319  chedy
     116    10     .     .     .    13     .     6    49   113   109    71   110   367  shedy
      81     .     .     .     .     .     .     .     .    11   126    17    34   224  qokeedy
      71    10     .     .     .     .     .     .     .    22    54    17   134   238  qokedy
      58     .     .     1     .    13     .     .     .    11    56    17    10   219  qokain
      31     .     .     .     .    13     .     .     .     .    51    17     3    86  lchedy
      40     .     .     .     .     .     .     6     9   124    56    35    68    71  otedy


    Words more common in language B:
   
     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      71    10     6    23     .    79     .     .     .    45   109   125    44   125  qokaiin
      56    32     6    37    11    13     .    20    39    22    88   107    93    48  okaiin
      37    21    13    21     .     .     .     6     .    11    36   107   103    61  chckhy
      30     .     .     5     .    26     .     .     .    11    48    35    34    59  okain

    Words more common in the Cosmo section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      37    10     .    49    11    79   486    80    69    33    19    71    48    33  y

    Words more common in the Stars section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      96    10    13    21    22   291   108   221   217   203   136    71   151    35  ar


    Words more common in the Zodiac section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      72    32    35     8    34   132     .   154   287   113   127    17    37    35  al
      35     .    35     5    46     .     .    87   148    67    56    71     6    23  oteey

    Words more common in the Stars-1 section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      36    21    13    13    11   119    54    20    39     .    51    71    41    30  otal


    Words with peculiar distributions:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
      74    53    84   148    11    66     .   127    89    45    11    71   124    79  dy
      70   280    49    58    34   158   108   140    79    22    27    35    82    99  dal
      81   107    77    14    22    79     .    33     9    22   146    53    20   125  qokeey
      52    10    13     2     .   132     .    46     .    33    38    89    17   163  qokal
      46    53    84    20    22     .     .    40    49    11    69    17    13    61  cheey
      44    43   140    10    11    13    54    60    39     .    87     .    27    24  okeey
      40    10    13    39     .    26     .     6    89    45    69    53    20    17  otaiin
      39    86    13    30   103    39    54     6     9    11    38    17    58    54  saiin
      39    64    63    26    22    39     .    40     .    56    46     .    20    55  sheey
      37    21    35    20    11   132   108    26    39    67    38   125    61    32  okal
      30    53    91    55    57     .     .    20     .    11     7    35    13    45  dol

  There seems to be no simple pattern for the differences, except that
  words ending with <edy> are almost specific to language B.
  
  The word <chdy> seems specific to herbal-B (not to other language B
  subsecs), and <chy>, <sho>, <cthy> to herbal-A (not to Pharma).

  Let's now do the same with the word classes:

    Classes with fairly uniform distribution:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes almost exclusive of language A:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes more common in language A:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes almost exclusive of language B:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes more common in language B:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes more common in the Cosmo section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes more common in the Zodiac section:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

    Classes to sort:

     tot  pha1  pha2  hea1  hea2  str1  cos1  cos2  zod1  cos3  str2  heb2  heb1  bio1  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     353    97    35   190    34   317    54    46   158   339   494   502   255   574  otoin~
     294   669   406   675   253   119   108   167    89   169   153   233   261   188  doin~
     261   194   217    77   161   185     .   234   316   373   258   215   120   563  ol~
     255    21     .     1     .    13     .    33    89   203   289   215   317   686  chedo~
     242   345   231   203   126   529   216   214   128   124   201   359   213   342  otol~
     212   205   350    55   161   105    54   248   267   192   367   125    99   200  oteeo~
     201   215   280   139   172    39     .   227   148    22   244    71   106   284  cheo~
     200   107   189    97   138   357   108   274   237   588   202   287   344   143  or~
     194   151    91   116   115   317    54   194   118   226   192   466   330   172  otor~
     179    10     .     .     .     .     .    20     9   180   170   125   410   464  otedo~
     175   161   280   441   207   264     .   147    79    67   114    89    68    68  chol~
     168     .     6     1    11     .     .    26    59    79   269    89   141   386  oteedo~
     160   237   147    40    92   145   108   140   277   328   228   179   241   103  oin~
     142    64    63   221   115   132   108   120   118   113    85   341   134   181  oto~
     120    97    91   399   103    66     .   160    79    45    38    89    65    32  cho~
     110   151    98   336    92   145     .    46     9    56    61    89    48    21  chor~
     106   151   119   142    69    66    54   154    79   135    49   125   168   112  dor~
     101   334   140   115    92   158   108   160    79    33    35    71    96   144  dol~
      99    21    28   329   115    13     .    80    19    45    49   125    48    21  otcho~
      91   118   154    46    46    39     .    87    49    67   133    17    41   117  cheeo~
      87   237   217    39   195   158     .    60    79     .   102     .    37    98  cheol~
      85    32    13    66    11     .     .    20     9    33    87   197   130   155  chctho~
      83   107   168    24    34   145   162   140   178    45    88     .    79   115  oteo~
      79    53    91   165    11    66     .   134    89    56    12    71   130    81  do~
      74     .     .     1     .    13     .     6    19   101   142    71    82    95  otchedo~
      64    53    20    69    34   105  1189   140   148   101    37    71    55    55  o~
      64   183    84   133   357    79   162    67    89    56    10    53    72    21  s~
      60   107    49    49   115    39    54    20     9    33    60    17    58   101  soin~
      59   237   112    75   115    66     .    60    49    11    48    17    24    48  cheor~
      59    10    42    80    46    13    54    53    49   113    68    53    37    39  otcheo~
      54    43    35   214   138    39     .    20     9     .     4     .    34     4  ctho~
      54   140   385    33    80    66     .    67    49    45    52    17    44    13  oteol~
      52    10     .    11     .    13     .    46    19    90    49   376   168    43  chdo~
      51    10     6     4     .     .     .    40     .    79    62   161   155    42  otchdo~
      49    21     6    49    11    39     .     .    19    11    66   161    41    45  toin~
      46    32    70    53    57   132     .    46    29    22    29     .    34    62  tol~
      43    32    56    27    46    79     .     6     .    11    38   161   110    33  tor~
      42     .     .     .     .    13     .     .     .     .    64    35     3   121  lchedo~
      42    53    49    65    11    39     .    26     9   169    44    71    41     4  odoin~
      42    43     .    49    46    52     .     6    19    56    49    17    99    13  otod~
      40    75    42    56   161    13     .    33     .    22    34    89    68     1  chodo~
      38    21     6     .     .     .     .     .     9    22    49    17    30    99  cheedo~
      37    32    28    20    22    52   108    46    19     .    21    35   113    51  cheto~
      36   172    70    11    57    13     .    67    39    56    44    35    65     .  cheodo~
      36    97    35    45    46    39    54   154    29    33    29     .    30     5  doir~
      35   151   126     4    57    13     .   120   128    33    24     .    61     .  oteodo~
      34    10     .     4     .     .     .     .     .    11    34     .    48   106  chectho~
      34    43    28   131    80    13     .    33     9    22     8    17     3     2  otchol~
      34    53    70    24   103     .     .    46     9     .    12    17     3    93  sol~
       .     .     .     .     .     .     .     .     .     .     .     .     .     .  ...
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----

1999-07-26 stolfi
=================

IV. TABULATING WORD FREQUENCIES PER SECTION

    cat text-secs/all.names
    
  (Edited ${sectags} manually to match presumed writing order.)
  
    set sectags = ( \
      pha \
      hea \
      str \
      cos \
      zod \
      heb \
      bio \
      unk \
    )
    
    echo $sectags | tr ' ' '\012' | sort > .foo
    diff RAW/wfreqs/secs/all.names .foo

    foreach etag ( RAW EQV )
      tabulate-frequencies \
        -dir ${etag}/wfreqs/secs \
        -title "word" \
        tot ${sectags}
    end

     tot   pha   hea   str   cos   zod   heb   bio   unk  word
    ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
     603   773   235   642  1214  3049   349   393   749  ?
     236   446   508   120   144    89   228   120   218  daiin
     146   157    72    98   136    29    95   352   156  ol
     137     4     1   167    46    39   193   319    98  chedy
     123   140    30   170   171   277   196    54   161  aiin
     116     4     .   103    42    49   104   367    72  shedy
     107   195   277    65    66    19    37    21    98  chol
      97   144    72    58   156    19   196    98   156  or
      96    12    21   146   206   217   138    35   124  ar
      93    93    69   108    50    59    57   137    72  chey
      90    93    87    47   140    79   147   105   135  dar
      81     .     .   117     3     .    31   224    20  qokeedy
      81    89    15   142    27     9    25   125    25  qokeey
      76    84    46    76    46    49    23   144    88  shey
      74    72   133    14    89    89   115    79    78  dy
      72    33    11   127   128   287    34    35    41  al
      71     8    20   107    15     .    57   125    72  qokaiin
      71     4     .    51     7     .   115   238    20  qokedy
      70   140    55    36    97    79    75    99    98  dal
      64   123   158    14    70    89    69    21    25  s
      58     .     1    53     3     .    11   219     5  qokain
      56    93   186    17    19     .    20     1    67  chor
      56    16    34    83    19    39    95    48    46  okaiin
      52    12     2    45    39     .    28   163    36  qokal
      50    38   127    25    31    39    34    26    52  shol
      48    50    95    25    15     .    25    65    41  dain
      46    72    20    65    27    49    14    61    31  cheey
      45   157    32    51    15    29    17    45    20  cheol
      44   101    10    82    39    39    23    24     .  okeey
      42    16     2    26     .     .     2   175     .  qol
      40    21   124    11    42    39    28    13    20  chy
      40    12    34    66    19    89    25    17    46  otaiin
      40     .     .    52    46     9    63    71    31  otedy
      40    21     3    42     7     .    80    65   109  qokar
      39     4     7    35    39    19   153    33    57  chdy
      39    25    33    27     3     .    31    89    46  qoky
      39    42    38    38    11     9    52    54    15  saiin
      39    63    25    45    42     .    17    55    20  sheey
      37    16    19    33     7     .   104    61    10  chckhy
      37    29    19    45    46    39    72    32    20  okal
      37     8     6    50    46    59    60    35    72  otar
      37     4    45    23    93    69    52    33    31  y
      36    16    12    55    15    39    46    30    72  otal
      35    21    10    52    74   148    17    23    15  oteey
      34    25     6    38    46    19    92    27    52  okar
      33    29   116    12    23    19     5     .    25  sho
      31     .     .    48     .     .     5    86     5  lchedy
      30    16   126     .    11     .     8     1    25  cthy
      30    76    55     6    15     .    17    45    25  dol
      30     .     5    46     3     .    34    59     5  okain
       .     .     .     .     .     .     .     .     .  ...