Last edited on 1999-01-16 00:54:08 by stolfi

98-01-25 stolfi
===============
  
  I will start from the raw concordance I created before for the word
  occurence map (Notes/010).  Recall that its format was
  
    PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS
    1    2    3    4    5     6     7      8   9      10 

  where

    PNUM       is a sequential page number, "001" to "234".
    
    FNUM       is the corresponding folio-based page number, 
               "f1r" thru "f86v5" to "f106v"

    UNIT       is the code of a text unit within the page, e.g. "P" or "R1"
    
    LINE       is the code of a line within that unit, 27 or 10a

    TRANS      is a letter identifying a transcriber, e.g. "F" for Friedman

    START      is the index of the first byte of the occurrence
               in the text line (counting from 1).

    LENGTH     is the original length of the occurrence in the text, 
               including fillers, comments, spaces, etc..

    POS        is a number giving the approximate position
               of the occurrence within the whole text; used
               for sorting, etc.

    STRING     is the non-empty string in question, without any fillers,
               comments, non-significant spaces, line breaks, etc..
    
    OBS        An arbitrary non-empty string, without embedded blanks.
    
  First, let's make a list of the herbal text units and pages:
  
    cat L16-eva/INDEX \
      | gawk -v FS=':' '($2=="herbal" && $5=="parags"){print}' \
      > her.index
      
    cat her.index \
      | gawk -v FS=':' '/./{print $1;}' \
      > her.units

    cat her.index \
      | gawk -v FS=':' '/./{print substr($6,2,3);}' \
      | sort | uniq \
      > her.pages
      
  
  Next, we select the records from herbal pages,
  and do some cleanup on the STRINGs: 
  
    remove every word-initial "q"
    change word-initial "y" to "o"
    change word-final "y" to "o"
    change "eeee" to "chch"
    change "eee" to "che"
    change "ee" to "ch"
    (perhaps) change "t" into "k"
    (perhaps) change "f" into "p"
    delete any interword spaces.
    
  Here it is:
  
    set equatekt = 1
    set equatepf = 1

    cat ../010/vtx-f-eva-15.roc \
      | gawk '/./ { print $1, $2, $9; }' \
      | select-herbal-pages \
      | fix-words -f word-equiv.gawk \
          -v field=3 \
          -v stripq=1 \
          -v equatekt=${equatekt} \
          -v equatepf=${equatepf} \
      > her-f-eva-15.roc
    dicio-wc her-f-eva-15.roc
      
     lines   words     bytes file        
    ------ ------- --------- ------------
     29506   88518    547377 her-f-eva-15.roc

  Next we reduce the data to occurrence counts per page:

    FREQ PNUM FNUM STRING 
    1    2    3    4

  where FREQ is the count of occurrences of STRING on page PNUM/FNUM.
  
    cat her-f-eva-15.roc \
      | sort +2 -3 +0 -1 | uniq -c | expand \
      > her-f-eva-15.frq
    dicio-wc her-f-eva-15.frq
     
     lines   words     bytes file        
    ------ ------- --------- ------------
     26397  105588    713624 her-f-eva-15.frq