Hacking at the Voynich manuscript - Side notes
067 Looking for words with lumpy distribution 

Last edited on 2004-09-26 21:50:36 by stolfi

  Many words in the VMS have a lumpy distribution.
  Let's identify them and see how they correlate.

LINK SETUP

  ln -s /home/staff/stolfi/voynich/work
  
VOYNICHESE SAMPLE

  Got some copy of the VMS text, majority version, one word per line.
  We map weirdos to "*", remove all #-comments and {}-comments,
  and clean up of specal characters:
  
    "." assumed to be a word break
    "," deleted
    "!" deleted
    "%" mapped to "*"
    "*" retained as letter
    "-" retained as separate word (line break or plant intrusion)
    "=" retained as separate word (parag break, end of label)
    
  The file is "voy-n.wds".
  
  (Someday should redo this with a more official version.) 
  
ENGLISH SAMPLE

  For controls, got also an English sample: H. G. Well's "War of the
  Worlds", from my langbank project, directory "engl/wow". Mapped
  upper to lower case, removed all chapter and book titles, Kepler's
  quote. Mapped hyphens "-" to "~", dashes to "--" (as single word),
  line beaks to "-", and parag breaks to "=". Removed all punctuation
  except:
  
    "'" (posessive and elision marker) retained as letter
    "~" (hyphen) retained as letter
    "-" (line break) retained as word
    "=" (parag break) retained as word
    
  The file is "wow-n.wds".
  
SHELL VARIABLES

    set books = ( voy wow )
    set fmts = ( png )
    set kmax = 9
  
SCRAMBLED WORD LISTS

  For control experiments, we prepare another version of the word lists,
  where the words have been permuted in random order:
    
    foreach book ( ${books} )
      set nfile = "${book}-n.wds"
      set rfile = "${book}-r.wds"
      echo "${nfile} -> ${rfile}" 
      cat ${nfile} \
        | gawk \
            ' BEGIN{ srand(12345678); } \
              //{ printf "%17.15f %s\n", rand(), $1; } \
            ' \
        | sort -b +0 -1g \
        | gawk '//{ print $2; }' \
        > ${rfile}
    end
    
GATHERING WORD OCCURRENCE RANGES

  Let's make a table with the number of occurrences, index of first
  occurrence, index of last occurrence, and range width (last minus
  first) for all words in each book:
  
    foreach book ( ${books} )
      foreach ord ( n r )
        set wdsfile = "${book}-${ord}.wds"
        set flrfile = "${book}-${ord}.flr"
        echo "${wdsfile} -> ${flrfile}"
        cat ${wdsfile} \
          | find-first-last-occs \
          | sort -b +4 -5 \
          > ${flrfile}
      end
    end

SCATTERPLOTS OF WORD FREQUENCY VERSUS WORD RANGE WIDTH
    
  Let's make a scatterplot of this data, where each point is a
  word, and the axes are the number of occurrences {K} and
  the range width {S}:
    
    make -f range-scatterplots.make \
      BOOKS="${books}" FMTS="${fmts}" \
      all
    
HISTOGRAMS OF RANGES WIDTHS FOR CERTAIN FREQS
    
  Let's make histograms of range widths for selected values of K:

    make -f range-hists.make \
        BOOKS="${books}" FMTS="${fmts}" \
        KMIN=03 KMAX=10 \
        all

THEORETICAL ANALYSIS

  If the K occurrences of a certain word were distributed randomly
  among N tokens, the probability of the first occurrence being at
  position x, and the last one being at position y, is
  
               K(K-1)  (x-y-1)(x-y-2)ЗЗЗ(x-y-K+2)
    P(K,x,y) = ------  --------------------------
               N(N-1)     (N-2)(N-3)ЗЗЗ(N-K+1)   
           
             K(K-1)  choose(x-y-1,K-2)
           = ------  -----------------
             N(N-1)   choose(N-2,K-2)                     
             
  The probability that max-min = R is therefore

                                         
    Q(K,R) = (N-1-R)P(K,x,x+R) 
           
                     K(K-1)  choose(R-1,K-2)
           = (N-1-R) ------  ---------------
                     N(N-1)  choose(N-2,K-2)                     
                                     
  The probability that max-min \leq S is
  
    F(K,S) = SUM{ Q(K,R) : R \in 1..S }
    
  We now compute the value of -log F(K,S) for every word:
  
    for smp ( ${samples} )
      set N=`cat ${smp}.wds | wc -l`
      echo "${smp}.flr -> ${smp}.prr N = ${N}"
      cat ${smp}.flr \
        | compute-range-prob -v N=${N} \
        | sort -b +3 -4gr \
        > ${smp}.prr
    end
      
    
  Let's determine the value of N (including parag breaks but not 
  line breaks or figure intrusions):
  
    set N=`cat voy-n.wds | wc -l`
    echo $N
    
      39475
  
   
WORD-WORD GAPS

  Let N be the number of tokens. If a word occurs K times, 
  the distances between successive occurences (in cyclic fashion)
  will have mean N/K. If the words were distributed at random,
  distance d would be expected to occur N*