Hacking at the Voynich manuscript - Side notes
023 Characterizing Voynichese sub-languages by QOKOKOKO element freqs
Last edited on 1998-07-04 10:25:19 by stolfi
1998-06-20 stolfi
=================
[ First version done on 1998-05-04, now redone with fresher data. ]
In Note 021, I tried to classify pages according to the
frequencies of certain keywords. John Grove pointed out that the
transcription which I used (Friedman's) has inconsistencies which may
masquerade as language differences, e.g. "dain" in place of "daiin" or
vice-versa. Also, it seems that spacing (word division) is quite
inconsistent.
So, in an attempt to avoid those problems, I thought of using, instead
of words, the "elements" of the QOKOKOKO paradigm. See Notes 017 and 018.
Since I am not still clear on how to group the O's with K's (with the
following K, with the preceding K, with both, with neither), I will
leave them as separate elements. Also, for simplicity (without any
conviction at all) I will split every double-letter O into two
elements.
Also, given Grove's observations on anomalous "p" and "t"
distributions at beginning-of-line, and the well-known attraction
of certain elements for end-of-line, it seems advisable to discard
the first few and last few elements of every line.
I. EXTRACTING AND COUNTING ELEMENTS
We will prepare two sets of statistics, one using raw words ("-r")
and one using word equivalence classes ("-c").
elem-to-class -describe
element equivalence:
map_ee_to_ch
ignore_gallows_eyes
join_ei
equate_aoy
collapse_ii
equate_eights
equate_pt
erase_word_spaces
append_tilde
This mapping will hopefully reduce transcription and sampling noise.
Factoring the text into elements:
mkdir -p RAW EQV
/bin/rm -rf {RAW,EQV}/efreqs
mkdir -p {RAW,EQV}/efreqs
foreach utype ( pages sections )
foreach f ( `cat text-${utype}/all.names` )
cat text-${utype}/${f}.evt \
| lines-from-evt | egrep '.' \
| factor-OK | egrep '.' \
> /tmp/${utype}-${f}.els
end
end
Counting elements and computing relative frequencies:
foreach ep ( cat.RAW elem-to-class.EQV )
set etag = ${ep:e}; set ecmd = ${ep:r}
/bin/rm -rf ${etag}/efreqs
mkdir -p ${etag}/efreqs
foreach utype ( pages sections )
set frdir = "${etag}/efreqs/${utype}"
mkdir -p ${frdir}
cp -p text-${utype}/all.names ${frdir}/
foreach f ( `cat text-${utype}/all.names` )
echo ${frdir}/$f.frq
cat /tmp/${utype}-${f}.els \
| trim-three-from-ends \
| tr '{}' '\012\012' \
| ${ecmd} | egrep '.' \
| sort | uniq -c | expand \
| sort -b +0 -1nr \
| compute_freqs.gawk \
> ${frdir}/${f}.frq
end
end
end
Computing total frequencies:
foreach ep ( cat.RAW elem-to-class.EQV )
set etag = ${ep:e}; set ecmd = ${ep:r}
foreach utype ( pages sections )
set fmt = "${etag}/efreqs/${utype}/%s.frq"
set frfiles = ( \
`cat text-${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \
)
echo ${frfiles}
cat ${frfiles} \
| gawk '/./{print $1, $3;}' \
| combine-counts \
| sort -b +0 -1nr \
| compute_freqs.gawk \
> ${etag}/efreqs/${utype}/tot.frq
end
end
II. TABULATING ELEMENT FREQUENCIES PER SECTION
set sectags = ( `cat text-sections/all.names` )
echo $sectags
foreach etag ( RAW EQV )
tabulate-frequencies \
-dir ${etag}/efreqs/sections \
-title "elem" \
tot ${sectags}
end
Elements sorted by frequency (× 99), per section:
tot unk pha str hea heb bio ast cos zod
------- ------- ------- ------- ------- ------- ------- ------- ------- -------
17 o 16 o 23 o 15 o 20 o 14 y 15 o 18 o 16 o 20 o
12 y 12 a 9 y 11 y 11 y 14 o 15 y 13 y 13 y 13 a
9 a 11 y 8 l 11 a 9 ch 11 d 10 d 9 a 11 a 8 y
8 d 8 d 6 a 8 d 7 a 10 a 8 l 7 d 9 l 8 l
7 l 7 l 6 d 6 l 7 d 6 k 7 a 5 ch 6 d 7 t
5 k 5 r 4 r 6 k 6 l 5 l 6 q 5 ee 4 ch 5 r
4 ch 5 k 4 k 4 q 5 r 4 ch 6 k 5 r 4 k 5 d
4 r 4 ch 4 ch 4 ee 4 k 4 r 3 ee 4 k 4 r 5 ee
4 q 3 t 3 q 4 r 3 t 3 iin 3 che 3 s 4 t 3 ch
3 ee 3 iin 3 ee 3 iin 3 iin 3 che 3 r 3 l 3 iin 3 k
3 iin 3 q 3 che 3 ch 2 sh 2 q 2 she 3 t 2 sh 2 te
3 t 2 che 3 iin 3 che 2 q 2 t 2 t 2 che 2 q 2 iin
3 che 2 sh 2 ke 3 t 2 s 2 ee 2 ch 2 iin 2 ee 1 s
1 sh 1 ee 1 s 1 she 1 che 1 ke 1 in 2 ke 2 che 1 che
1 she 1 she 1 ? 1 sh 1 cth 1 sh 1 ke 1 q 1 s 1 sh
1 ke 1 p 1 sh 1 ke 1 ee 1 she 1 iin 1 ? 1 she 1 she
1 s 1 s 1 she 1 in 0 she 1 s 1 sh 1 e? 0 ke 0 p
0 in 0 ke 1 t 0 p 0 p 0 te 1 s 1 she 0 e? 0 ?
0 p 0 te 0 ckh 0 s 0 ckh 0 p 0 te 1 te 0 p 0 in
0 te 0 ir 0 ckhe 0 te 0 in 0 ckh 0 ckh 1 sh 0 te 0 ke
0 cth 0 cth 0 te 0 ir 0 m 0 f 0 p 0 p 0 ? 0 eee
0 ckh 0 f 0 p 0 eee 0 ke 0 in 0 cth 0 ir 0 cth 0 e?
0 ir 0 in 0 e? 0 ckh 0 cph 0 ir 0 ckhe 0 cth 0 in 0 m
0 ? 0 m 0 iiir 0 cth 0 te 0 m 0 f 0 ckh 0 m 0 cth
0 eee 0 ckh 0 in 0 f 0 cthe 0 cth 0 eee 0 eee 0 ckh 0 cthe
0 f 0 cph 0 cth 0 e? 0 f 0 e? 0 e? 0 cthe 0 eee 0 iir
0 m 0 eee 0 cthe 0 ckhe 0 ? 0 cthe 0 ir 0 in 0 cthe 0 q
0 e? 0 e? 0 ir 0 ? 0 ckhe 0 ? 0 cthe 0 iir 0 ir
0 ckhe 0 ? 0 f 0 m 0 e? 0 ckhe 0 iiin 0 i? 0 iir
0 cthe 0 cfh 0 m 0 iir 0 eee 0 eee 0 h? 0 il 0 cfh
0 cph 0 cthe 0 eee 0 i? 0 ir 0 iir 0 cph 0 j 0 ckhe
0 iir 0 cphe 0 j 0 cthe 0 n 0 iiin 0 cphe 0 ckhe 0 ij
0 iiin 0 ckhe 0 cphe 0 iiin 0 cfh 0 cphe 0 ? 0 cph
0 n 0 iir 0 cph 0 cph 0 cphe 0 cph 0 ck 0 f
0 i? 0 im 0 iiin 0 il 0 i? 0 cfhe 0 ikh 0 im
0 cphe 0 n 0 iir 0 n 0 iiin 0 i? 0 il 0 m
0 cfh 0 i? 0 i? 0 cphe 0 iir 0 im 0 m
0 iiir 0 x 0 cfhe 0 de 0 ct 0 cfh 0 n
0 de 0 iil 0 de 0 x 0 de 0 n 0 cfh
0 il 0 il 0 is 0 is 0 im 0 x 0 de
0 im 0 n 0 im 0 cfhe 0 de 0 i?
0 cfhe 0 id 0 cfhe 0 b 0 id 0 is
0 x 0 pe 0 cfh 0 ck 0 ith
0 is 0 iil 0 id 0 c?
0 j 0 iid 0 iil 0 iir
0 h? 0 id 0 cf 0 b
0 ck 0 iiid 0 g 0 cp
0 ct 0 iiil 0 h? 0 ct
0 iil 0 iim 0 iiil 0 g
0 id 0 iis 0 iiir 0 iil
I have compared these counts with those obtained by removing two, one, or zero
elements from each line end. The conclusion is that the ordering of the first
six entries in each column is quite stable; it is probably not an artifact.
Some quick observations: there seem to be three "extremal" samples:
hea ("ch" abundant), bio ("q" important), and zod ("t" important).
There are too many "e?" elements; I must check where they come from
and perhaps modify the set of elements to account for them.
[ It seems that many came from groups of the form "e[ktpf]e",
"e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without
ligatures. Most of the remaining come from Friedman's
transcription; there are practically none in the more careful
transcriptions. ]
All valid elements that occur at least 10 times in the text:
o y a
q
n in iin iiin
r ir iir iiir
d
s is
l il
m im
j
de
k t ke te
p f
cth ckh cthe ckhe
cph cfh cphe cfhe
ch che
sh she
ee eee
x
Valid elements that occur less than 10 times in the whole text:
iil
ij
pe
ct ck
id
Created a file "RAW/plots/vald/keys.dic" with all the valid elements.
Equiv-reduced elements sorted by frequency (× 99), per section:
tot unk pha str hea heb bio ast cos zod
-------- -------- -------- -------- -------- -------- -------- -------- -------- --------
38 o~ 40 o~ 40 o~ 38 o~ 39 o~ 40 o~ 37 o~ 40 o~ 41 o~ 42 o~
10 t~ 10 t~ 8 l~ 11 t~ 11 ch~ 12 d~ 10 d~ 10 ch~ 9 t~ 11 t~
8 d~ 8 d~ 7 ch~ 8 d~ 9 t~ 10 t~ 9 t~ 8 t~ 9 l~ 8 ch~
8 ch~ 7 l~ 7 d~ 8 ch~ 7 d~ 6 ch~ 8 l~ 7 d~ 7 ch~ 8 l~
7 l~ 6 ch~ 6 t~ 6 l~ 6 l~ 5 l~ 6 q~ 5 r~ 6 d~ 5 d~
4 r~ 5 r~ 4 r~ 4 q~ 5 r~ 4 r~ 5 ch~ 3 s~ 4 r~ 5 r~
4 q~ 3 in~ 3 q~ 4 in~ 4 in~ 3 in~ 3 che~ 3 l~ 3 in~ 3 te~
4 in~ 3 q~ 3 in~ 4 r~ 2 sh~ 3 che~ 3 in~ 3 te~ 2 sh~ 3 in~
3 che~ 2 che~ 3 che~ 4 che~ 2 cth~ 2 q~ 3 r~ 3 che~ 2 q~ 2 che~
1 te~ 2 sh~ 2 te~ 1 te~ 2 q~ 2 te~ 2 she~ 2 in~ 2 che~ 1 s~
1 sh~ 1 te~ 1 s~ 1 she~ 2 s~ 1 sh~ 2 te~ 1 q~ 1 te~ 1 she~
1 she~ 1 she~ 1 ?~ 1 sh~ 1 che~ 1 she~ 1 sh~ 1 ?~ 1 s~ 1 sh~
1 cth~ 1 cth~ 1 cth~ 0 s~ 0 te~ 1 s~ 1 cth~ 1 e?~ 1 she~ 0 ?~
1 s~ 1 s~ 1 sh~ 0 cth~ 0 she~ 1 cth~ 1 s~ 1 she~ 0 cth~ 0 e?~
0 ir~ 0 ir~ 1 she~ 0 ir~ 0 cthe~ 0 ir~ 0 cthe~ 1 cth~ 0 e?~ 0 cthe~
0 cthe~ 0 cthe~ 1 cthe~ 0 cthe~ 0 ir~ 0 cthe~ 0 e?~ 1 sh~ 0 ?~ 0 cth~
0 ?~ 0 e?~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ir~ 0 ir~ 0 ir~ 0 ir~
0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 h?~ 0 cthe~ 0 cthe~ 0 q~
0 n~ 0 id~ 0 i?~ 0 i?~ 0 n~ 0 id~ 0 ith~ 0 i?~ 0 id~
0 i?~ 0 n~ 0 de~ 0 il~ 0 i?~ 0 i?~ 0 ct~ 0 il~
0 il~ 0 i?~ 0 is~ 0 n~ 0 ct~ 0 n~ 0 ?~ 0 id~
0 id~ 0 il~ 0 n~ 0 de~ 0 id~ 0 x~ 0 il~
0 de~ 0 x~ 0 id~ 0 id~ 0 de~ 0 de~ 0 n~
0 ct~ 0 x~ 0 il~ 0 de~
0 is~ 0 is~ 0 b~ 0 i?~
0 x~ 0 is~ 0 is~
0 h?~ 0 h?~ 0 c?~
0 ith~ 0 b~
0 b~
0 c?~
There are 23 valid elements with frequency > 20 under the equivalence:
o
t te
cth cthe
ch che
sh she
d de
id
l r q s m n
in ir im il
Valid elements with frequency below 20:
ct is g b x
Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the
above equivalence.
III. PAGE SCATTER-PLOTS
See Notes/021 for explanation of these plots.
Let's now compute the frequencies of these keywords in each page and section:
foreach dic ( vald )
foreach etag ( RAW EQV )
foreach utype ( pages sections )
set frdir = "${etag}/efreqs/${utype}"
set ptdir = "${etag}/plots/${dic}/${utype}"
echo "${frdir}" "${ptdir}"
/bin/rm -rf ${ptdir}
mkdir -p ${ptdir}
cp -p ${frdir}/all.names ${ptdir}
foreach fnum ( tot `cat ${frdir}/all.names` )
printf "%30s/%-7s " "${ptdir}" "${fnum}:"
cat ${frdir}/${fnum}.frq \
| gawk '/./{print $1, $3;}' \
| est-dic-probs -v dic=${etag}/plots/${dic}/keys.dic \
> ${ptdir}/${fnum}.pos
end
end
end
end
III. SCATTER PLOTS
Let's plot them:
set sys = "tot-hea"
foreach dic ( vald )
foreach etag ( RAW EQV )
set ptdir = "${etag}/plots/${dic}/pages"
set scdir = "${etag}/plots/${dic}/sections"
set fgdir = "${etag}/plots/${dic}/${sys}"
/bin/rm -rf ${fgdir}
mkdir -p ${fgdir}
cp -p ${ptdir}/all.names ${fgdir}/all.names
make-3d-scatter-plots \
${ptdir} \
${fgdir} \
${scdir}/{tot,hea,heb,bio}.pos
end
end
Again, trying to separate Herbal-A from Pharma:
set sys = "hea-pha"
foreach dic ( vald )
foreach etag ( RAW EQV )
set ptdir = "${etag}/plots/${dic}/pages"
set scdir = "${etag}/plots/${dic}/sections"
set fgdir = "${etag}/plots/${dic}/${sys}"
/bin/rm -rf ${fgdir}
mkdir -p ${fgdir}
cp -p ${ptdir}/all.names ${fgdir}/all.names
make-3d-scatter-plots \
${ptdir} \
${fgdir} \
${scdir}/{hea,pha,heb,bio}.pos
end
end
The scatter plots made with colapsed letters still show the main
sections as separate clusters, but touching each other.