Hacking at the Voynich manuscript - Side notes 064 Looking for word quadruples with skew frequencies Last edited on 2012-05-03 20:41:08 by stolfilocal INTRODUCTION Schemas for generating pseudo-Voynichese, like Gordon Rugg's grille and my Voynichese probabilistic word grammar, generally produce word frequency distributions that are rather regular, because each choice made by the model generally depend on a small subset of th preceding choices. In particular, in my Voynichese grammar and in low-order Markov chains, the chioce of suffix is generally independent of the choice of prefix. Thus, if A,B are two possible prefixes that can be attached to a midfix M, and X,Y are two possible suffixes, then the four words AMX, AMY, BMX, BMY will be generated with probabilities of the form p(AMX) = p(A)·p(M).p(X) | p(AMY) = p(A)·p(M).p(Y) | p(BMX) = p(B)·p(M).p(X) | (1) p(BMY) = p(B)·p(M).p(Y) | Here p(M) is the total probability of the four words; p(A),p(B) are the relative probabilities of A and B, normalized so that p(A)+p(B) = p(X)+p(Y) = 1; and ditto for p(X) and p(Y). Note that those four numbers are not independent, because they satisfy the equation p(AMX)/p(AMY) = p(BMX)/p(BMY) (2) Thus, one way of discrediting the mechanical gibberish theory is to find quadruples of words with long M and very short A,B,X,Y, for which equation (2) is grossly violated. If we let z(E) = lg p(E) (lg = logarithm to base 2), then the equations become z(AMX) = z(A) + z(M) + z(X), etc, and z(AMX) - z(AMY) = z(BMX) - z(BMY) (3) This is a linear equation, so it is natural to measure the "skewness" of the four probabilities by the square of the mismatch, i.e. S(M,A,B,X,Y) = (z(AMX) - z(AMY) - z(BMX) + z(BMY))^2 (4) Note that p(M) cancels out in formulas (2)--(4). I.e. the skewness S does not depend on the frequency of the four words as a group, but only on their relative frequencies within the group. Some things that we must take care of: * Some letter pairs or groups may be meaningless calligraphic variants, e.g. {k,t}, {ch,ee}. * There may be many transcription errors between pairs of similar letters, e.g. {r,s}, {o,a}. * If the count n(AMX) of AMX is too small, the logarithm z(AMX) is tricky to estimate. In particular, if the count is zero, we cannot let p(AMX) = 0 because then z(AMX) = -oo. To alleviate the last problem, we will use the estimate p(word) = (n(word)+b)/(N+b), where N is the total number of tokens and b is a positive parameter, say b = 1. I.e. z(word) = lg(n(word)+b) - lg(N+b). Then S(M,A,B,X,Y) = ( k(AMX) - k(AMY) - k(BMX) + k(BMY) )^2 (5) where k(word) = lg(n(word)+b). For b = 1 we have n(W) 0 1 2 3 4 5 7 k(W) 0.0 1.0 1.6 2.0 2.3 2.6 3.0 The following table shows the value of S(M,A,B,X,Y) as a function of the counts of the four words AMX, AMY, BMX, BMY. The columns correspond to bias b=1,2,4,8. test-skewness < test-skewness-in.txt ( 7, 0, 0, 7) = 36.00 18.83 8.52 3.29 ( 7, 0, 0, 0) = 9.00 4.71 2.13 0.82 ( 8, 4, 2, 5) = 3.42 2.38 1.37 0.63 ( 2, 0, 0, 2) = 10.05 4.00 1.37 0.41 ( 2, 0, 0, 1) = 6.68 2.51 0.82 0.24 ( 8, 4, 8, 1) = 1.75 1.00 0.46 0.17 ( 1, 0, 0, 1) = 4.00 1.37 0.41 0.12 ( 2, 0, 0, 0) = 2.51 1.00 0.34 0.10 ( 2, 1, 0, 1) = 2.51 1.00 0.34 0.10 ( 8, 4, 8, 2) = 0.54 0.34 0.17 0.07 ( 1, 0, 0, 0) = 1.00 0.34 0.10 0.03 ( 1, 1, 1, 0) = 1.00 0.34 0.10 0.03 ( 2, 1, 0, 0) = 0.34 0.17 0.07 0.02 ( 2, 1, 1, 1) = 0.34 0.17 0.07 0.02 ( 3, 3, 3, 4) = 0.10 0.07 0.04 0.02 ( 2, 1, 1, 0) = 0.17 0.03 0.00 0.00 ( 8, 4, 5, 2) = 0.02 0.00 0.00 0.00 LINK SETUP We need the good word frequencies per section: ln -s ../tr-stats/dat ln -s ../.. work ln -s work/factor-field-general ln -s work/factor-text-trivial.gawk ln -s work/factor-text-viqr-to-phon.gawk ln -s work/factor-text-eva-to-basic.gawk COMPUTING SKEWNESS OF QUADRUPLES Define the sample texts to analyze, the ortographical elems (glyphs, letters, etc.), and the number of elems to take in prefix, midfix, suffix: For tests: set sampelems = ( \ voyn/maj/hea.1/bgly/2.2.4 \ ) For real: set sampelems = ( \ voyn/maj/{hea.1,heb.1,bio.1}/bgly/2.2.4 \ voyp/{grs,grm}/tot.1/bgly/2.2.4 \ engl/wow/tot.1/lets/1.5.3 \ viet/ptt/tot.1/phon/1.1.1 \ ) Create the output directories, and check for necessary files: foreach sun ( ${sampelems} ) set su = "${sun:h}" set sample = "${su:h}" set elem = "${su:t}" set opus = "${sample:h}" mkdir -p res/${sample} echo " " ls -lL dat/${opus}/factor-text-to-${elem}.gawk ls -lL dat/${sample}/gud.wfr end Generate all prefix/midfix/suffix combinations from the good words: set bias = 8 foreach sun ( ${sampelems} ) set su = "${sun:h}" set sample = "${su:h}" set elem = "${su:t}" set opus = "${sample:h}" set nums = ( `echo ${sun:t} | tr '.' ' '` ) cat dat/${sample}/gud.wfr \ | factor-field-general \ -f dat/${opus}/factor-text-to-${elem}.gawk \ -v inField=3 -v outField=4 \ | gawk '//{ print $1, $4; }' \ | generate-all-mid-pre-suf-combs \ -v bias=${bias} \ -v maxpre=${nums[1]} -v minmid=${nums[2]} -v maxsuf=${nums[3]} \ | sort -b +1 -2 +0 -1nr \ > res/${sample}/gud.mps end Enumerate all quadruples foreach sun ( ${sampelems} ) set su = "${sun:h}" set sample = "${su:h}" set elem = "${su:t}" set opus = "${sample:h}" cat res/${sample}/gud.mps \ | evaluate-all-word-quads \ -v bias=${bias} \ -v minskew=0.10 -v minhits=7 \ | sort -b +0 -1gr \ > res/${sample}/gud.skq end