Last edited on 1999-01-16 07:06:57 by stolfi

Must get this to work:

  We assume that the only "possible" patterns are those that occur in
  the whole herbal file.  Let N be their number.
  Next we estimate the probability of occurrence of each word pattern
  in each "extrapolated" page, i.e. assuming that the page's text is a
  random sample of some infinte language peculiar to that page. 
  If a page p contains
  M words and a certain word w occurs m times, we estimate its
  frequency on that page as the ratio
  
    Prob(w|p) = (m + 1)/(M + N)
  
  From this we can compute the information about the page number given by
  the occurrences of each word pattern.  That is, suppose we pick a word WRD at
  random in an extrapolated page, and it comes out to be "w"; how much 
  information do we have about the page number PNUM?  The answer is
  
    H(w) = sum { - G_{p,w} log_2(G_{p,w}) : p in PNUMS }
    
  where 
  
    G_{p,w} = Prob(PNUM=p | WRD = w)
    
                       Prob(WRD = w | PNUM = p) Prob(PNUM = p)
            =  ---------------------------------------------------------------
                 sum { Prob(WRD = w | PNUM = q) Prob(PNUM = q) : q in PNUMS }
                 
  Assuming all 127 pages are equally likely,

                       Prob(WRD = w | PNUM = p)
    G_{p,w} =  ------------------------------------------------
                 sum { Prob(WRD = w | PNUM = q) : q in PNUMS }
  
  Now let's list for each word w the information H(w) and 
  the most likely page p given that a randomly picked word is w:
  
    cat her-17.wfr \
      | compute-entropies \
      | sort +0 -1gr \
      > her-17.wen