Last edited on 1999-01-16 07:06:57 by stolfi Must get this to work: We assume that the only "possible" patterns are those that occur in the whole herbal file. Let N be their number. Next we estimate the probability of occurrence of each word pattern in each "extrapolated" page, i.e. assuming that the page's text is a random sample of some infinte language peculiar to that page. If a page p contains M words and a certain word w occurs m times, we estimate its frequency on that page as the ratio Prob(w|p) = (m + 1)/(M + N) From this we can compute the information about the page number given by the occurrences of each word pattern. That is, suppose we pick a word WRD at random in an extrapolated page, and it comes out to be "w"; how much information do we have about the page number PNUM? The answer is H(w) = sum { - G_{p,w} log_2(G_{p,w}) : p in PNUMS } where G_{p,w} = Prob(PNUM=p | WRD = w) Prob(WRD = w | PNUM = p) Prob(PNUM = p) = --------------------------------------------------------------- sum { Prob(WRD = w | PNUM = q) Prob(PNUM = q) : q in PNUMS } Assuming all 127 pages are equally likely, Prob(WRD = w | PNUM = p) G_{p,w} = ------------------------------------------------ sum { Prob(WRD = w | PNUM = q) : q in PNUMS } Now let's list for each word w the information H(w) and the most likely page p given that a randomly picked word is w: cat her-17.wfr \ | compute-entropies \ | sort +0 -1gr \ > her-17.wen