Hacking at the Voynich manuscript - Side notes 067 Looking for words with lumpy distribution Last edited on 2004-09-26 21:50:36 by stolfi Many words in the VMS have a lumpy distribution. Let's identify them and see how they correlate. LINK SETUP ln -s /home/staff/stolfi/voynich/work VOYNICHESE SAMPLE Got some copy of the VMS text, majority version, one word per line. We map weirdos to "*", remove all #-comments and {}-comments, and clean up of specal characters: "." assumed to be a word break "," deleted "!" deleted "%" mapped to "*" "*" retained as letter "-" retained as separate word (line break or plant intrusion) "=" retained as separate word (parag break, end of label) The file is "voy-n.wds". (Someday should redo this with a more official version.) ENGLISH SAMPLE For controls, got also an English sample: H. G. Well's "War of the Worlds", from my langbank project, directory "engl/wow". Mapped upper to lower case, removed all chapter and book titles, Kepler's quote. Mapped hyphens "-" to "~", dashes to "--" (as single word), line beaks to "-", and parag breaks to "=". Removed all punctuation except: "'" (posessive and elision marker) retained as letter "~" (hyphen) retained as letter "-" (line break) retained as word "=" (parag break) retained as word The file is "wow-n.wds". SHELL VARIABLES set books = ( voy wow ) set fmts = ( png ) set kmax = 9 SCRAMBLED WORD LISTS For control experiments, we prepare another version of the word lists, where the words have been permuted in random order: foreach book ( ${books} ) set nfile = "${book}-n.wds" set rfile = "${book}-r.wds" echo "${nfile} -> ${rfile}" cat ${nfile} \ | gawk \ ' BEGIN{ srand(12345678); } \ //{ printf "%17.15f %s\n", rand(), $1; } \ ' \ | sort -b +0 -1g \ | gawk '//{ print $2; }' \ > ${rfile} end GATHERING WORD OCCURRENCE RANGES Let's make a table with the number of occurrences, index of first occurrence, index of last occurrence, and range width (last minus first) for all words in each book: foreach book ( ${books} ) foreach ord ( n r ) set wdsfile = "${book}-${ord}.wds" set flrfile = "${book}-${ord}.flr" echo "${wdsfile} -> ${flrfile}" cat ${wdsfile} \ | find-first-last-occs \ | sort -b +4 -5 \ > ${flrfile} end end SCATTERPLOTS OF WORD FREQUENCY VERSUS WORD RANGE WIDTH Let's make a scatterplot of this data, where each point is a word, and the axes are the number of occurrences {K} and the range width {S}: make -f range-scatterplots.make \ BOOKS="${books}" FMTS="${fmts}" \ all HISTOGRAMS OF RANGES WIDTHS FOR CERTAIN FREQS Let's make histograms of range widths for selected values of K: make -f range-hists.make \ BOOKS="${books}" FMTS="${fmts}" \ KMIN=03 KMAX=10 \ all THEORETICAL ANALYSIS If the K occurrences of a certain word were distributed randomly among N tokens, the probability of the first occurrence being at position x, and the last one being at position y, is K(K-1) (x-y-1)(x-y-2)···(x-y-K+2) P(K,x,y) = ------ -------------------------- N(N-1) (N-2)(N-3)···(N-K+1) K(K-1) choose(x-y-1,K-2) = ------ ----------------- N(N-1) choose(N-2,K-2) The probability that max-min = R is therefore Q(K,R) = (N-1-R)P(K,x,x+R) K(K-1) choose(R-1,K-2) = (N-1-R) ------ --------------- N(N-1) choose(N-2,K-2) The probability that max-min \leq S is F(K,S) = SUM{ Q(K,R) : R \in 1..S } We now compute the value of -log F(K,S) for every word: for smp ( ${samples} ) set N=`cat ${smp}.wds | wc -l` echo "${smp}.flr -> ${smp}.prr N = ${N}" cat ${smp}.flr \ | compute-range-prob -v N=${N} \ | sort -b +3 -4gr \ > ${smp}.prr end Let's determine the value of N (including parag breaks but not line breaks or figure intrusions): set N=`cat voy-n.wds | wc -l` echo $N 39475 WORD-WORD GAPS Let N be the number of tokens. If a word occurs K times, the distances between successive occurences (in cyclic fashion) will have mean N/K. If the words were distributed at random, distance d would be expected to occur N*