Hacking at the Voynich manuscript - Side notes 055 Occurrences of the OL/OR words Last edited on 1999-11-25 17:11:21 by stolfi INTODUCTION The topic of this note is the study of occurrences and contexts of the short words "ol", "or" and look-alikes. DATA COLLECTION The starting data was the machine-readable concordance (See Notes/037), which has one entry for each word occurrence and many short phrases) in the format LOC TRANS START LENGTH LCTX STRING RCTX PATT STAG PNUM HNUM 1 2 3 4 5 6 7 8 9 10 11 We extracted all occurrences of "al", "ol", "ar", "or", "as", "os" in majority version ("A"): ln -s ../../Notes/037/vms-17-ok.hoc.gz zcat vms-17-ok.hoc.gz \ | gawk '((substr($2,1,1)=="A")&&($6 ~ /^[ao][lrs]$/)){print;}' \ | sort -b +5 -6 +6 -7 +4 -5 +9 -10n +0 -1 +2 -3n \ > orol-sep.hoc dicio-wc orol-sep.hoc lines words bytes file ------- ------- --------- ------------ 1589 17479 122346 orol-sep.hoc Extracted also all words that begin or end with those target words: zcat vms-17-ok.hoc.gz \ | gawk '((substr($2,1,1)=="A")&&($6 ~ /^[ao][lrs][*a-z]+$/)){print;}' \ | sort -b +5 -6 +6 -7 +4 -5 +9 -10n +0 -1 +2 -3n \ > orol-wds-r.hoc zcat vms-17-ok.hoc.gz \ | gawk '((substr($2,1,1)=="A")&&($6 ~ /^[*a-z]+[ao][lrs]$/)){print;}' \ | sort -b +5 -6 +6 -7 +4 -5 +9 -10n +0 -1 +2 -3n \ > orol-wds-l.hoc dicio-wc orol-wds-{l,r}.hoc lines words bytes file ------- ------- --------- ------------ 9234 101574 746851 orol-wds-l.hoc 1684 18524 137013 orol-wds-r.hoc Extracted the forward and backward 1-word context from the separated words file. To allow fair comparison with the joined-words set, we deleted any leading and trailing [qaoy] strings from the context. cat orol-sep.hoc \ | gawk \ ' /./{ \ str = $6; \ lnw = split($5, lw, /[-=/.,]/); \ lc = lw[lnw-1]; \ gsub(/^[qaoy]*/, "", lc); gsub(/[qaoy]*$/, "", lc); \ if (lc == "") { lc = "_"; } \ printf "%s %s\n", str, lc; \ } \ ' \ | sort | uniq -c | expand \ | sort -b +1 -2 +0 -1nr +2 -3 \ > orol-sep.lfr cat orol-sep.hoc \ | gawk \ ' /./{ \ str = $6; \ rnw = split($7, rw, /[-=/.,]/); \ rc = rw[2]; \ gsub(/^[qaoy]*/, "", rc); gsub(/[qaoy]*$/, "", rc); \ if (rc == "") { rc = "_"; } \ printf "%s %s\n", str, rc; \ } \ ' \ | sort | uniq -c | expand \ | sort -b +1 -2 +0 -1nr +2 -3 \ > orol-sep.rfr cat orol-wds-l.hoc \ | gawk \ ' /./{ \ str = substr($6,length($6)-1,2); \ lc = substr($6,1,length($6)-2); \ gsub(/^[qaoy]*/, "", lc); gsub(/[qaoy]*$/, "", lc); \ if (lc == "") { lc = "_"; } \ printf "%s %s\n", str, lc; \ } \ ' \ | sort | uniq -c | expand \ | sort -b +1 -2 +0 -1nr +2 -3 \ > orol-wds.lfr cat orol-wds-r.hoc \ | gawk \ ' /./{ \ str = substr($6,1,2); \ rc = substr($6,3,length($6)-2); \ gsub(/^[qaoy]*/, "", rc); gsub(/[qaoy]*$/, "", rc); \ if (rc == "") { rc = "_"; } \ printf "%s %s\n", str, rc; \ } \ ' \ | sort | uniq -c | expand \ | sort -b +1 -2 +0 -1nr +2 -3 \ > orol-wds.rfr STATISTICS Next we tabulated the counts, frequencies, and anomalies. The anomaly for a string S and a context C is the ratio of the observed frequency of C next to S, by the expected value of that frequency (product of frequencies of C and of S). The frequencies are computed by the "stabilized" estimator freq[case] = (count[case] + 1)/(totalCount + nCases). We omitted the strings "as" and "os" because they are too rare and generated mostly noise. foreach rw ( freestanding.sep attached.wds ) set r = "${rw:r}" set w = "${rw:e}" foreach tx ( left.l right.r ) set t = "${tx:r}" set x = "${tx:e}" foreach sy ( counts.c freqs.f types.t anomalies.a ) set s = "${sy:r}" set y = "${sy:e}" echo "${s} of ${t} contexts - ${r}" cat orol-${w}.${x}fr \ | gawk '($2 ~ /s$/){next;} /./{print;}' \ | format-context-${s} -v ncmin=12 \ > orol-${w}.${x}t${y} end end end Computing the same tables with equivalence mappings: foreach rw ( freestanding.sep attached.wds ) set r = "${rw:r}" set w = "${rw:e}" foreach tx ( left.l right.r ) set t = "${tx:r}" set x = "${tx:e}" foreach sy ( counts.c freqs.f types.t anomalies.a ) set s = "${sy:r}" set y = "${sy:e}" echo "${s} of ${t} contexts - ${r}" /bin/rm -f orol-${w}-eqv.${x}t${y} cat orol-${w}.${x}fr \ | map-field \ -v forgiving=1 \ -v inField=3 \ -v outField=3 \ -v table=eqv-${x}.tbl \ | gawk '($2 ~ /s$/){next;} /./{print $1, $2, $3;}' \ | format-context-${s} -v ncmin=8 \ > orol-${w}-eqv.${x}t${y} end end end Computing the distance matrix between the four target words: foreach rw ( freestanding.sep attached.wds ) set r = "${rw:r}" set w = "${rw:e}" foreach tx ( left.l right.r ) set t = "${tx:r}" set x = "${tx:e}" echo "distance matrix based on ${t} contexts - ${r}" cat orol-${w}.${x}fr \ | gawk '($2 ~ /s$/){next;} /./{print $1, $2, $3;}' \ | compute-distance-matrix \ > orol-${w}.${x}dm end end Comparing the context frequencies of joined vs. separated words. foreach tx ( left.l right.r ) set t = "${tx:r}" set x = "${tx:e}" foreach rw ( freestanding.sep attached.wds ) set r = "${rw:r}" set w = "${rw:e}" cat orol-${w}.${x}tf \ | egrep '[0-9] *$' \ | sort +0 -1 \ > .${w} end /n/gnu/bin/join \ -1 1 -2 1 \ -a 1 -a 2 \ -e '.' \ -o'0,1.2,1.3,1.4,1.5,1.6,2.2,2.3,2.4,2.5,2.6' \ .sep .wds \ | gawk \ ' /./{ \ printf "%-20s %4s %3s %3s %3s %3s %4s %3s %3s %3s %3s %4d\n", \ $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$2+$7; \ } \ ' \ | sort -b +11 -12nr \ > orol-cmp.${x}tf end ANALYSIS The most common words before each string are: al: ar(29) aiin(9) or(9) s(8) dair(7) al(6) daiin(6) dar(6) sar(6) ol: qokain(11) aiin(10) or(10) qokaiin(10) ar(9) daiin(9) dar(9) ol(9) ar: okar(12) otar(10) al(9) ar(9) dar(9) ol(8) or(8) s(8) or: ar(9) daiin(9) s(7) chol(5) dal(5) or(5) aiin(4) dar(4) ol(4) as: chtar(1) otalef(1) os: aiin(2) cheor(2) The most common words after each string are: al: ar(9) al(6) chedy(6) s(6) aiin(5) dar(4) ol(4) ol: chedy(23) shedy(23) daiin(17) aiin(13) cheey(11) sheey(10) ar: al(29) aiin(23) ar(9) ol(9) or(9) =(7) am(7) y(6) or: aiin(56) ol(10) al(9) ar(9) shedy(9) air(7) cheey(7) chey(7) chol(7) as: ainam(1) kai*n(1) os: al(3) aiin(2) Anomalous frequencies (10 = normal) before the target strings: al: os~(27) dair(21) sar(21) ched(20) otaiin(20) ar(19) cheor(14) aiin(13) s(13) ol: shedy(22) chey(18) qokal(18) sain(16) qokedy(15) dain(14) okain~(14) ar: tar(24) otain(19) al(16) okar~(16) chear(15) dain(14) or: dal(22) chear(21) qokal(21) saiin(20) daiin(17) ched(16) otal(16) qokedy(16) al: shedy(4) qokal(4) qokain(4) chear(4) shey(5) sain(5) okain~(5) ol(6) okar~(6) ol: s(2) ched(3) ar(5) dal(6) dair(6) al(6) otain(7) okar~(7) ar: qokal(4) daiin(5) aiin(5) sar(6) otal(6) ar(6) or: tar(5) otaiin(5) al(5) okar~(6) Anomalous frequencies after the target strings: al: s(33) ar(22) shey(18) chedy(15) shedy(5) aiin(5) cheey(7) ol: daiin(19) chedy(17) shedy(17) s(15) al(2) aiin(4) or(7) ar(7) ar: al(25) or(20) ol(12) shedy(3) chedy(3) s(5) or: aiin(18) chey(12) s(2) shey(5) daiin(5) chedy(5) Among the 273 occurrences of "al", only 11 (4%) are preceded by a word that ends with "y": airody aldy chedacphy chey okchey oty qokedy qoteedy shefeeedy shkchody ykeedy These words occur once each, and do not occur before the other four target words. Among the 370 occurences of "ar", only 32 (8.6%) are preceded by words that end with "y": otedy(2) shey(2) *aly chckhy chedy cheeoy cheody chey ckheeody daly dshdy kedy o*eey ody okcheey ota*ky oteey otoy oty qokedy qokeey qokey shedy sheedy tchdy teodeey typchey ykeshy ytedy yteeey DISUSSION The most common pairs, such as "ol chedy" and variants, also correspond to common words (namely "olchedy"). Thus it is possible that the words "al/ar/ol/or" are not isolated words but part of the following word.