Hacking at the Voynich manuscript - Side notes 057 Statistics of crust-mantle-core components Last edited on 2000-10-22 12:22:12 by stolfi INTRODUCTION This note analyzes the statistics of the `multiple layer' (crust/mantle/core) components of Voynichese words [ Should be merged with some previous notes. ] [ Redone on 2000-10-05 to exclude the letter (which occurs only in cos.1, mostly alone), and weird combinations such as , , , etc. (which are very few, and just as weird as other weirdos). ] THE DATA The source data will be majority edition derived from interlinear release 1.6e6, already chopped into pages and sections: ln -s ../045/subsecs-m text-subsecs ln -s ../019/unit-to-type.tbl ln -s ../../columnate ln -s ../../combine-counts ln -s ../../compare-counts ln -s ../../compute-cum-freqs ln -s ../../compute-freqs ln -s ../../count-digraph-freqs ln -s ../../count-diword-freqs ln -s ../../count-word-lengths ln -s ../../select-units ln -s ../../tabulate-frequencies ln -s ../../words-from-evt Get (sub)section names: set secs = ( `cat text-subsecs/all.names` ) set secscm = `echo ${secs} | tr ' ' ','` Paper directory: set trdir = "/home/staff/stolfi/papers/voynich-words/techrep" Extract text words and label words, separately, for each section: mkdir data-raw mkdir data-raw/words data-raw/labels foreach sec ( ${secs} ) echo ${sec} cat text-subsecs/${sec}.evt \ | select-units \ -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \ -v table=unit-to-type.tbl \ | words-from-evt \ > data-raw/words/${sec}.wds end foreach sec ( ${secs} ) echo ${sec} cat text-subsecs/${sec}.evt \ | select-units \ -v types='labels,words' \ -v table=unit-to-type.tbl \ | words-from-evt \ > data-raw/labels/${sec}.wds end Separate the good and bad words, for each good section: mkdir data mkdir data/{words,labels} mkdir data-bad mkdir data-bad/{words,labels} foreach f ( words labels ) foreach sec ( ${secs} ) echo ${sec} cat data-raw/$f/${sec}.wds \ | condense-valid-groups \ | egrep -v '[^CESTKPFadefiklmnopqrsty]' \ | expand-valid-groups \ > data/$f/${sec}.wds cat data-raw/$f/${sec}.wds \ | condense-valid-groups \ | egrep '[^CESTKPFadefiklmnopqrsty]' \ | expand-valid-groups \ > data-bad/$f/${sec}.wds end end Copy section names to handy places: foreach dir ( data-raw data data-bad ) foreach f ( words labels ) cp -av text-subsecs/all.names ${dir}/${f}/all.names cat text-subsecs/all.names \ | grep -v 'unk' \ > ${dir}/${f}/.ok.names end end BASIC CHARACTER PAIR FREQUENCIES foreach f ( words labels ) cat data/$f/*.wds \ | condense-valid-groups \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | count-digraph-freqs \ -v pad='_' \ -v chars='_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' \ -v showentropy=1 \ -v showstrangeness=1 \ > data/$f/.digraphs.tbl end ANALYZING CORRELATIONS BETWEEN VARIOUS PARTS OF THE WORD In what follows we analyze the frequency and correaltions between various word components (precrust, premantle, core, sufmantle, sufcrust). PREPARING A TEST SAMPLE Generate a word sample: set sec = "bio.1" echo "data/*/*.evt -> .sample.wds" /bin/rm -f .sample.wds foreach f ( words labels ) foreach sec ( `cat data/${f}/all.names` ) cat data/${f}/${sec}.wds \ | head -100 \ >> .sample.wds end end cat .sample.wds \ | sort | uniq \ | gawk '/./{printf "%7.5f %s\n", rand(), $0;}' \ | sort -b +0 -1g \ > .xxx cat .xxx | gawk '/./{print $2;}' > .sample.wds TESTING THE WORD ANALYSIS SCRIPTS Testing the element-factoring script: echo ".sample.wds -> .sample.els" cat .sample.wds \ | factor-words -f factor-text.gawk \ -v hicsmash=1 -v esplit=1 \ > .sample.els head -10 .sample.els {_}{a}{d}{_}{e}{_}{e}{o}{d}{y} {_}{_}{k}{_}{e}{_}{sh}{_}{e}{y} {_}{_}{p}{_}{ch}{_}{e}{_}{d}{y} {_}{o}{f}{a}{in}{_} {_}{y}{t}{_}{e}{_}{d}{y} {_}{_}{ch}{_}{cth}{o}{d}{y} {_}{_}{sh}{_}{e}{_}{e}{o}{r}{_} {_}{y}{t}{a}{in}{_} {_}{_}{s}{a}{in}{o} {_}{_}{ch}{_}{e}{_}{t}{y} ... cat .sample.wds \ | factor-words -f factor-text.gawk \ -v eelump=1 \ > .sample.els head -10 .sample.els {_}{a}{d}{_}{ee}{o}{d}{y} {_}{_}{ke}{_}{she}{y} {_}{_}{p}{_}{che}{_}{d}{y} {_}{o}{f}{a}{in}{_} {_}{y}{te}{_}{d}{y} {_}{_}{ch}{_}{cth}{o}{d}{y} {_}{_}{sh}{_}{ee}{o}{r}{_} {_}{y}{t}{a}{in}{_} {_}{_}{s}{a}{in}{o} {_}{_}{che}{_}{t}{y} ... Testing the component-splitting script: echo ".sample.wds -> .sample.wsp" mv .sample.wsp .sample.wsp~ cat .sample.wds \ | factor-words -f factor-text.gawk -v eelump=1 \ | split-words -v omods=0 \ > .sample.wsp diff .sample.wsp~ .sample.wsp head -10 .sample.wsp {a}{d}({ee}){o}{d}{y} (<{ke}>{she}){y} (<{p}>{che}){d}{y} {o}(<{f}>){a}{in} {y}(<{te}>){d}{y} ({ch}<{cth}>){o}{d}{y} ({sh}{ee}){o}{r} {y}(<{t}>){a}{in} {s}{a}{in}{o} ({che}<{t}>){y} ... Testing effect of "omods=1": echo ".sample.wds -> .sample.wsp" mv .sample-o.wsp .sample-o.wsp~ cat .sample.wds \ | factor-words -f factor-text.gawk -v eelump=1 \ | split-words -v omods=1 \ > .sample-o.wsp diff .sample-o.wsp~ .sample-o.wsp head -10 .sample.wsp diff .sample{,-o}.wsp | prettify-diff-output --- 14c14 ----------------------- < ({ch}){o}(<{t}>){o}{l} > ({ch}{o}<{t}>){o}{l} --- 35c35 ----------------------- < ({ch}){o}(<{k}>{ch}){y} > ({ch}{o}<{k}>{ch}){y} --- 71c71 ----------------------- < {o}(<{t}>){o}(<{k}>){o}{l} > {o}(<{t}{o}><{k}>){o}{l} --- 94c94 ----------------------- < ({che}){o}(<{k}>{che}<{t}>) > ({che}{o}<{k}>{che}<{t}>) --- 102c102 ----------------------- < ({ch}){o}(<{t}>{sh}){o}{l} > ({ch}{o}<{t}>{sh}){o}{l} --- 117c117 ----------------------- < ({sh}){o}({sh}){y} > ({sh}{o}{sh}){y} ... OK, let's create the split-word statistics per section. The option settings below are appropriate for model building (see Note 058). foreach f ( words labels ) foreach sec ( `cat data/${f}/all.names` ) set ifile = "data/${f}/${sec}.wds" set ofile = "data/${f}/${sec}.wsp" echo "${ifile} -> ${ofile}" mv ${ofile} ${ofile}~ cat ${ifile} \ | factor-words -f factor-text.gawk -v eelump=1 \ | split-words -v omods=1 \ > ${ofile} diff ${ofile}~ ${ofile} \ | prettify-diff-output \ > data/${f}/${sec}.diffs end (cd data/${f} && dicio-wc *.wsp ) (cd data/${f} && dicio-wc *.diffs ) end lines words bytes file ------- ------- --------- ------------ 6555 6555 104974 bio.1.wsp 146 146 1295 cos.1.wsp 1353 1353 21389 cos.2.wsp 713 713 10885 cos.3.wsp 6703 6703 103030 hea.1.wsp 823 823 13194 hea.2.wsp 2820 2820 44853 heb.1.wsp 510 510 8163 heb.2.wsp 858 858 12969 pha.1.wsp 1309 1309 20578 pha.2.wsp 670 670 10955 str.1.wsp 10097 10097 170197 str.2.wsp 202 202 3012 unk.1.wsp 134 134 2216 unk.2.wsp 44 44 687 unk.3.wsp 292 292 4927 unk.4.wsp 309 309 5147 unk.5.wsp 431 431 7067 unk.6.wsp 357 357 5533 unk.7.wsp 0 0 0 unk.8.wsp 701 701 10843 zod.1.wsp lines words bytes file ------- ------- --------- ------------ 142 142 2530 bio.1.wsp 9 9 223 cos.1.wsp 237 237 4164 cos.2.wsp 82 82 1551 cos.3.wsp 1 1 21 hea.1.wsp 0 0 0 hea.2.wsp 0 0 0 heb.1.wsp 0 0 0 heb.2.wsp 86 86 1696 pha.1.wsp 143 143 2665 pha.2.wsp 0 0 0 str.1.wsp 0 0 0 str.2.wsp 0 0 0 unk.1.wsp 0 0 0 unk.2.wsp 0 0 0 unk.3.wsp 14 14 219 unk.4.wsp 0 0 0 unk.5.wsp 0 0 0 unk.6.wsp 0 0 0 unk.7.wsp 2 2 27 unk.8.wsp 287 287 5389 zod.1.wsp Consistency check: foreach f ( words labels ) foreach sec ( ${secs} ) echo ${f}/${sec} cat data/${f}/${sec}.wsp | tr -d '()<>{}' > .tmp diff .tmp data/${f}/${sec}.wds end end /bin/rm -f data/{words,labels}/*.diffs EXTRACTING COMPONENT STATISTICS Usually we consider only words that have a simple structure (at most one core component and at most one coremantle component.) The non-simple words are tabulated separately. Here are the components considered, and their brief explanations: set comps = ( pmcns mcn c p s m n pm ns ) tag word subset counted component(s) ------- ------------------------- --------------------------------- pmcns simple, all words entire word. mcn simple, all words core+mantle. c simple, all words core p simple, w/ core or mantle. crust prefix. s simple, w/ core or mantle. crust suffix. m simple, w/ core mantle prefix. n simple, w/ core mantle suffix. pm simple, w/ core crust+mantle prefix ns simple, w/ core crust+mantle suffix Testing the component extractor: echo ".sample.wds -> .sample-o.wsp" cat .sample.wds \ | factor-words -f factor-text.gawk -v eelump=1 \ | split-words -v omods=1 \ > .sample-o.wsp foreach item ( ${comps} ) echo ".sample.wsp -> .sample-${item}.pct" cat .sample-o.wsp \ | extract-components \ -f get-components.gawk \ -v select=${item} \ > .sample-${item}.pct end Testing the simple/nonsimple separation: cat .sample-o.wsp \ | select-simple-words -v complex=0 \ > .simple.wsp cat .sample-o.wsp \ | select-simple-words -v complex=1 \ > .nonsimple.wsp diff .sample-o.wsp .simple.wsp \ | sed -e '/^[^<]/d' -e 's/< //' \ > .notsimple.wsp diff .nonsimple.wsp .notsimple.wsp dicio-wc .sample-o.wsp .simple.wsp .nonsimple.wsp .notsimple.wsp Let's create the component item statistics per section: mkdir stats/words stats/labels foreach ctag ( ${comps} ) analyze-components ${ctag} end EXTRACTING COMPONENT PAIR STATISTICS We next gather statistics about pairs of components. Here are the pairs we consider: set pairs = ( k-w tc-y tf-z tw-w pm-c p-mcn c-ns mcn-s p-s pm-ns m-n ) tag word subset left component right component ------- ------------------------- ------------------- -------------------- pm-c simple, w/ core crust+mantle prefix core. c-ns simple, w/ core core mantle+crust suffix. p-mcn simple, w/ core or mantle crust prefix complete mantle+core. mcn-s simple, w/ core or mantle complete mantle+core crust suffix. pm-ns: simple, w/ core crust+mantle prefix mantle+crust suffix. m-n: simple, w/ core mantle prefix mantle suffix. p-s: simple, w/ core or mantle crust prefix crust suffix. The following pairs are a bit special: tag word subset left component right component ------- ------------------------- ------------------- -------------------- tc-y: all simple type of component coarse component. tf-z: all simple type of component fine component. tw-w: all simple type of word the word. k-w: all non-simple number of "peaks" the word. The `fine components' of a word are the crust prefix, mantle prefix, core, mantle suffix, and crust suffix, whose `types' are the letters "p", "m", "c", "n", "s", respectively. When the core is empty, the mantle components "m" and "n" cannot be separated so they are extracted as a single component "mn". Similarly when the core and mantle are empty the crust is a single component "ps". The `coarse components' are the crust+mantle prefix (type "pm"), the core ("c"), and the mantle+crust suffix ("ns"). In words without core, there is only one coarse component (type "pmns"). The `type' of a word is the string of types of its non-empty components; thus "qoteedy" has fine type "pcns" (assuming that the "ee" are treated as mantle not crust). In all cases, the output word has the element brace delimiters `{}' deleted. Empty elements are omited. On the other hand, the words listed under the pair tags `tw-w' and `k-w' have their contiguous crust/mantle/core components marked off with `()' and `<>'. Testing the pairs extractor: foreach item ( ${pairs} ) echo ".sample.wsp -> .sample-${item}.pct" cat .sample-o.wsp \ | extract-components \ -f get-components.gawk \ -v select=${item} \ > .sample-${item}.pct end Let's create the pair counts: foreach ptag ( ${pairs} ) analyze-pairs ${ptag} end TESTS WITH "DISASSEMBLED" PLATFORM GALLOWS For the tests in this section, we considered "c", "h" (except in "ch" and "sh"), "i" (before gallows), and "e" (anywhere) as separate mantle elements. Moreover, we mapped those letters to "e". So a platform gallows letter like "cth" or "ith" was turned into three elements "{e}{t}{e}". NOTE: The tables in this section are still based on the word files generated prior to the 2000-10-05 remake. That version of the text allowed a few more words, e.g. words with "hh" or isolated "c" It is probably not worth redoing this analysis just because of a few tokens. foreach sec ( ${secs} ) echo ${sec} cat data/words/${sec}.wds \ | select-simple-words \ | factor-words -f factor-text.gawk \ -v hicsmash=1 -v esplit=1 \ | split-words \ -v omods=0 \ | tr -d '{}' \ > data/words/${sec}.wsp end Let's create the component statistics per section: foreach ctag ( ${comps} ) analyze-components ${ctag} end foreach ptag ( ${pairs} ) analyze-pairs ${ptag} end The following two tables may tell us whether the mantle is more attached to the prefix or to the core: Anomalies for p-mcn (crust prefix)×(mantle+core): ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - - - - - - c - - k k - - t - - - p - - c s h T - k k c e - t t c e e p c - - c s h h e - O - k e c h e - t e c h k t - c h c s h h e e k e T k e e h e e t e e h e e e p h e h h e e e e e e ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- o- +14 +5 +6 +4 +3 +2 . +10 +8 +8 +5 +3 -1 -1 +4 +2 +4 -6 -8 -9 -9 -10 -9 -12 -1 qo- +12 +8 +9 +8 +6 +4 +1 +9 +6 +6 +6 +4 +2 -2 +1 +2 +2 -9 -12 -10 -14 -11 -12 -10 +2 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- dy- +0 . +1 -1 +1 . . +3 +2 +2 +4 . . . . +1 . -4 -2 -2 -4 -1 . +2 -1 y- +10 +1 +2 +3 +3 . -2 +6 +4 +5 +4 +1 -6 -6 -1 . . . -1 . . +3 . -9 -9 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- q- -1 . . -1 -2 . +1 -2 -2 -1 -1 . +7 +6 +1 . . . -2 -7 -4 -1 . +2 +8 a- -1 . -2 -5 . . . +1 -1 . . . +6 +5 +5 . . -5 -1 -6 -1 -1 . +3 . so- -1 -2 -3 . +2 . . +2 +2 -1 +4 . +1 +3 . . . -5 +2 -6 -4 -1 . +2 +2 lo- -1 -2 -3 . -1 . . +4 . . . +1 +1 +1 . . . -4 -1 -3 -3 . . +3 +5 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- l- +7 +4 +4 +5 . +5 +3 -1 -1 . -6 -4 -7 -8 -3 -4 . +4 +2 +7 +5 +4 +3 -4 -6 ol- +6 +4 +5 +6 . . +3 -2 -1 -2 -3 -3 -6 -5 -5 -7 . +2 +1 +4 +5 +3 +2 -3 +1 al- +0 +3 . +1 -1 -1 -1 -4 -1 -2 -2 . . . . +2 -1 +2 +2 +1 +2 . . +1 +1 qol- +0 +1 -3 +3 -1 -1 . -4 -1 . -1 . . . . . -1 -2 +2 +4 +1 +2 +2 +1 -1 sol- -1 . +2 +1 . +1 . -3 -1 . -1 . . . . +1 . -2 -1 +1 . -1 . +2 . ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - +19 -3 -3 -4 -1 -3 -7 . -2 -3 . . . +5 . . . +6 +6 +4 +5 +2 +3 +4 -8 d- +3 -5 -2 -7 -5 -4 -3 -3 -6 -5 -4 -4 . . -4 . -3 +10 +10 +6 +9 +8 +7 +3 +6 s- +0 . -4 -3 -1 -1 -1 -4 -3 -2 -2 . . . . . -1 +6 . +5 +6 . +1 +3 +1 r- +0 -6 -3 -3 -1 -1 . -4 -1 -1 -1 . . . . . . +4 -1 +6 +5 +3 +4 +2 -1 or- -1 -4 -3 -5 +1 . +1 -3 -1 . . . +1 . . . . +1 . +2 . +2 +1 +3 . dal- -1 -3 . -5 -1 . +1 -3 . . . . +1 . . . . +4 +4 . -1 -1 . +3 . ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +43 +35 +31 +32 +27 +24 +21 +32 +28 +27 +27 +23 +23 +26 +24 +23 +24 +33 +29 +33 +31 +26 +25 +23 +22 Anomalies for pm-c (crust+mantle prefix)×(core): ----- ----- --- --- --- --- T O - - - - T k t p f ----- ----- --- --- --- --- o- +20 . +2 . -2 qo- +18 +2 +1 . -3 y- +13 . +2 . -2 - +19 -3 . +3 . ----- ----- --- --- --- --- e- +14 -5 +2 +2 . ----- ----- --- --- --- --- che- +12 . . . . cho- +9 -1 +1 . . ch- +7 . . +2 -2 she- +6 +1 . . -1 chee- +5 . +1 -3 +1 choe- +4 -2 . . +1 cheo- +1 . . +1 -1 ----- ----- --- --- --- --- shee- +4 . +1 . -2 sho- +2 . +1 . -1 sh- +1 +1 . -4 +2 shoe- +2 -3 . . +4 ----- ----- --- --- --- --- qoe- +3 +1 . . -2 oe- +4 -1 . +1 . qe- +0 -1 . . +1 ----- ----- --- --- --- --- l- +8 +3 -3 . . ol- +8 +2 -3 -1 +1 al- -1 +2 -7 . +4 qol- -2 +2 -4 . +1 ----- ----- --- --- --- --- dy- +0 -1 +1 -2 +1 ----- ----- --- --- --- --- TOT +42 +39 +37 +30 +25 Note the much lower anomalies for pairs "pm-c" than for "p-mcn". Same for suffixes: Anomalies for mcn-s (core+mantle)×(crust suffix): ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - o d - d - - - a - o a a - - - T - d d i - - - - - o l i i a a - - - a O - d a a i - - - o o o o o d d i i i i a a a l T d y r l n s y o l r s d m y y n - n n r r l m y ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- k- +9 -12 -16 -11 -10 -10 -12 -1 -3 +1 . -3 . -1 -2 +3 -3 +3 +14 +16 +10 +10 +11 +8 +8 t- +8 -10 -18 -9 -8 -8 -11 -1 -3 +2 +1 -2 -2 -4 -1 +4 +2 +4 +12 +12 +9 +10 +9 +8 +5 p- +1 -5 -12 . +1 +1 -4 -8 -9 . . -5 . +1 -6 +2 +3 +5 +8 -1 +9 +5 +4 +3 +3 ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- chee- +2 . +2 +2 -2 +3 +10 +4 +4 -3 . +3 . -2 +1 . -2 . -4 -3 . -4 -2 -3 -1 shee- +1 +4 +4 . +3 -2 +7 +3 +1 -2 . -1 -2 -1 -3 . -1 +8 -5 . . -3 -4 -2 . ee- +1 . +1 +2 -2 +4 +8 -1 . -3 . . -2 . -3 . -2 . -3 -1 . . . +1 +1 che- +9 +2 +4 +4 +3 +4 +3 . +2 +1 +1 +2 -1 +1 +1 -3 . -10 -5 -2 -7 +2 . -1 -3 she- +7 +3 +6 -1 +2 +3 . . +1 +1 . +1 +2 -2 +1 -1 -2 +5 -2 -6 -5 . . -5 -4 keee- -1 +2 +1 -1 +1 . +6 +3 . -5 -8 +2 . +1 -1 +2 . . -2 . +2 -4 -3 . +2 kee- +5 +4 +7 +1 +3 -2 +4 +5 +4 . . +2 . -2 +3 -3 -1 -6 -3 -6 -4 . -2 -1 -4 tee- +3 +4 +6 +3 +2 . +5 +3 +2 -2 -6 +5 . -2 +3 . . -4 -5 -3 -1 -4 -1 . -1 ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- te- +5 . +5 +5 +1 +4 -1 -2 . +1 -1 +5 . -1 +3 . +3 -5 -3 -2 -3 -1 . -3 -3 ke- +7 +1 +6 +3 +3 +1 -3 -1 . +2 +1 +1 +1 +2 +4 -1 -2 -7 -4 . -5 -2 . -1 -2 ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ch- +11 -3 -2 . . . . -4 . +3 +3 +1 -1 +1 . . +4 -5 +2 +1 -5 +2 +1 +1 -2 sh- +7 -2 -3 -2 -2 -1 -4 -3 +6 +3 +3 -1 +1 -1 +1 +2 +4 +3 +1 . -4 +2 . . -2 kch- +4 +3 +1 . +1 -3 . . +3 +1 +3 -2 +4 +2 . -3 -5 . +1 . -2 . -1 -3 -1 tch- +4 . . -3 . . -1 . +1 +1 +3 . +1 . . -3 . +1 . -1 -2 . -1 +2 -1 pch- +1 . +1 +4 . +2 -4 -3 -3 . +1 -5 . . . . +4 . . -2 +1 +2 +2 -1 +1 pche- +1 . +5 +7 +5 +7 . -1 -2 . -3 . -2 -1 . . -1 -1 -3 -2 . . -6 -1 . kche- +0 +1 +6 -2 -1 . . +2 +2 -1 . -3 . . +1 . +3 +1 -3 . +1 -5 -5 -1 +1 tche- +0 +4 +5 +4 +1 . +1 +1 +2 -3 -3 . . . . . . -2 . . +1 -5 -5 -1 . ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- cheke- +0 +4 -1 -1 . . . +5 -7 -4 -3 -2 . +2 -6 +1 . +4 +2 . +2 -3 +3 . +3 ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ete- +3 -2 -7 -5 -4 -4 -2 . +1 +2 +4 -3 +2 +6 +2 . . +2 +3 +3 -1 +3 . . -1 eke- +0 -3 -5 -2 -1 -1 -2 +1 . +2 +1 +3 . +1 +1 +1 -1 +1 . . +1 -4 +1 +4 . ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +43 +23 +36 +22 +21 +21 +25 +38 +29 +33 +31 +24 +21 +19 +29 +18 +21 +24 +30 +28 +21 +30 +29 +24 +19 Anomalies for c-ns (core)×(mantle+crust suffix): --- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - - - - c - - - - e - e - e a - - - - - h c c c c T o e e e e i a a - - - - - - e e c e h h h h O d d d e e i i i a a a o o - e o o h d e d o o T y y y y y n n r r l m l r y y l r y y y y l r - --- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- k- +15 +2 +4 +4 +3 +2 +1 +6 -1 . +2 +1 -1 -1 +1 . . . -2 -5 -3 -3 -3 -3 -3 t- +13 +2 +4 +2 +1 . . +3 -1 . +1 +2 . . +1 . . . -1 -4 -2 -3 -2 -1 -3 p- +4 -1 -6 -1 -3 -1 . -7 . -1 -1 . +1 +1 -1 . . . +2 +6 +4 +3 +3 +2 +2 f- +0 -3 -2 -6 -1 . -1 -2 +2 . -1 -3 . +1 . . -1 . +1 +4 +1 +3 +2 +2 +4 --- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +42 +23 +29 +29 +30 +21 +30 +28 +21 +28 +28 +22 +26 +23 +30 +30 +25 +22 +26 +25 +24 +23 +22 +21 +22 Note again that the anomalies are much lower for "c-ns" than for "mcn-s". Considering the anomalies of "mcn-s" above, it seems that "ke-" and "te-" look quite unlike "k-" and "t-" to the following suffix. That may be a sign that "e" is an independent letter, akin to the tables. Also "ete-" ("cth-" or "ith-" in the original) looks quite different from "te-" and very similar to "ch-". Looking at the "p-s" and "pm-ns" tables: Anomalies (crust prefix)×(crust suffix): ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - o - d d a - - - a a - T - - - i - - o a - - d i i a - O - d o o i a a d - i - o a - a i i i o T y y l r n r l y o n s - s m d r n n r d ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - +18 . -1 +2 +2 -1 . . +2 +1 -3 . +1 . -2 -1 . . +2 -3 -1 o- +15 -1 -1 . . +2 +2 +3 . . +1 . -1 . . -3 -1 -1 -1 . -1 qo- +13 . +1 +1 . +3 +2 +4 . . +4 -3 -1 -2 . . -2 -1 -5 . . y- +9 . -2 . +2 +1 +1 . +3 +2 -2 . -1 . . -2 -1 -1 . . . l- +6 . +3 . -1 +2 +1 +1 +1 +1 +1 +1 . -4 +1 +2 -1 -3 -5 -1 -1 ol- +5 +1 +2 . . +2 +1 +1 . . +3 +1 -2 -3 +1 +1 -3 . -4 +1 -2 d- +3 +1 . +3 +6 -4 -5 -1 . +3 -4 +1 . +3 -3 . +1 -2 +1 -3 . s- +0 . -2 +1 +1 +1 -2 . . +1 -2 . +1 +3 -2 -2 -1 +1 +2 . . q- -1 -1 -2 . +1 . +1 -1 +1 -3 -1 -2 . . +2 . +2 . +1 . . al- -1 -1 -1 -3 -4 . +1 -3 -4 -3 +3 . . . +3 +3 +3 +1 +2 +2 . r- -2 . . -2 -4 -5 -1 . -2 . -4 +3 . +1 . +2 +3 +1 +2 +1 +1 qol- -3 +1 +2 -4 -3 -1 -3 -3 -2 -3 +4 . +1 . . . +1 +2 +3 +2 +2 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +44 +38 +36 +33 +31 +31 +30 +29 +29 +29 +28 +25 +25 +24 +24 +24 +23 +22 +21 +21 +21 Anomalies (crust+mantle prefix)×(mantle+crust suffix): ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - - c - - - - a - e - h c c c - - e - - T i a - - - - - e c e h h h - e e o e e O i i a a a o o - d h d e d o e o o d e d T n n r l m l r y y y y y y l y l r y y y ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- l- +5 +5 +4 +3 . +1 . -6 -2 +5 -4 +5 +2 +3 -4 -8 -2 -6 . . +4 ol- +5 +4 +5 +1 +1 +1 -3 . . +4 -1 +4 . -1 -5 -5 -1 -5 -2 . +3 ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- o- +17 +2 +2 +2 +3 . +1 . -2 . -1 +1 . . . -6 . -2 -1 -2 +1 qo- +16 +2 +3 . +3 -1 +1 . -2 +3 . +1 -1 +1 -1 -5 -2 -3 -3 -1 +3 - +13 +1 . +1 . -2 +4 +2 -3 . . +4 +2 +2 +3 -7 . -3 -1 -4 . y- +10 +2 -2 +1 . . . +1 -2 . +1 +2 +1 +1 +1 -7 -1 -1 +1 -1 . ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- cho- +5 +2 . +1 +2 +2 +2 +3 +3 -6 +4 -1 +2 -1 +2 -4 . -5 -1 -4 -3 ch- +3 +4 +3 +3 +5 +1 +3 +1 +2 -7 +2 -3 -1 -4 +4 -3 -6 -1 -2 -1 . sh- +0 +1 -1 . +2 . +2 . +3 -1 +3 +2 . . . -6 -1 . +3 -4 -4 sho- +0 . -1 . +1 . +5 +1 +2 -5 +7 -1 +3 . . -2 -1 . -2 -4 -3 cheo- +0 . +3 -1 -1 +4 +1 +3 +2 -5 . -1 +2 . . -3 . . -1 -1 -1 ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- shee- +0 -4 -1 . -1 . . . +5 +1 +1 . -1 . . +7 -2 . -1 . -1 chee- +0 . . . -3 . -2 . +4 . -4 -1 . . . +8 . -1 . +2 . ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- e- +3 -5 -6 -3 -7 -4 -4 -4 -2 +4 -2 -3 -5 -2 -2 +8 +13 +13 +7 +6 +2 che- +5 . +1 +1 +1 -3 -3 -6 +5 +2 . -3 . -1 -4 +7 -1 . -2 +1 +3 qoe- +0 -4 -3 -1 -2 . . . -5 +2 -2 -1 -1 . . +4 +6 +3 +3 +4 . oe- +0 . -1 -3 -3 . -2 +1 -7 +3 -4 -1 -1 . +1 +4 +2 +5 +6 +1 -1 choe- -1 -4 -2 -4 -2 . -1 . . . -1 . . . +1 +7 +1 +5 . +2 . shoe- -2 -3 -1 -1 -1 +1 . +1 -5 -2 -2 . . +1 +2 +5 +2 +2 +1 +1 -3 she- +1 -5 -3 -1 +1 +1 -4 . +5 +1 +4 -1 -1 +1 . +7 -3 . -2 +2 -1 ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +42 +30 +28 +28 +28 +22 +26 +23 +30 +29 +26 +25 +24 +24 +22 +31 +25 +22 +23 +31 +29 We see that the anomalies for "p-s" are quite small, meaning that the core+mantle essentially decouples the crust prefix from the crust suffix. The anomalies for "pm-ns" are larger but still relatively modest, meaning that the core alone isolates the prefix from the suffix fairly effectively. Actually the main anomaly seems to be a general attraction between those prefixes that end with single "e-" and those suffixes that begin with "-e". Most of this effect is probably due to the splitting of "cth" and "ith" as "e-t-e". TESTS WITH INTACT PLATFORM GALLOWS Considering the observation about correlation between "e" in prefix and in suffix, we repeated the tests above with the gallows kept intact. Still, the "e"s were treated as separate elements. Also, preliminary tests showed that "k" and "t" behave pretty much alike, and the same mostly holds for "p" and "f", "ch" and "sh". So we map those pairs to the most popular member ("k", "p", and "ch", resectively), to save space in tables. Note: these tests were done WITH labels included in the database. Testing the factoring script: echo ".sample.wds -> .sample.els" cat .sample.wds \ | factor-words -f factor-text.gawk \ -v ktsmash=1 -v chshsmash=1 -v esplit=1 \ | sort | uniq \ > .sample.els OK, let's do it: mkdir stats/words foreach ptag ( ${pairs} ) analyze-pairs -eelump ${ptag} end ANALYSIS [ old data ] Anomalies for p-mc (crust prefix)×(mantle+core): ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - - c c - - - - - - c - - h h - k - k k p - - c - - - c h c c - e c T k - c k e e c p c k c - e c - h c h h c c k O - e k h c e c - h c k h p e e h c e k o e h k h T k e e e h e h p e h h e h e e e h e h k k k h e ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- qo- +10 +10 +10 +10 +8 +9 +4 +3 +3 +5 +4 +4 +4 -3 +4 . -9 -8 -9 -10 -8 -9 -8 -7 -8 o- +11 +9 +8 +10 +7 +8 +4 +3 +7 +7 +5 +2 +1 -2 +2 . -6 -4 -7 -9 -10 -9 -10 -10 -9 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- y- +7 +5 +6 +6 +5 +7 +2 . +3 +3 +3 -3 -3 -7 -6 -8 +2 +1 +5 -2 -6 -5 +1 -5 -5 l- +5 +4 +6 +4 +5 . +4 . -2 +1 -1 -8 -6 -5 -4 -5 +7 +4 +5 . -4 . -4 +2 -3 ol- +5 +3 +6 +4 +2 . +3 -2 -1 +3 -4 -6 -6 -1 +1 +2 +4 +2 +4 . -2 -3 -2 -3 -3 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - +19 -3 -4 -3 . . -7 -9 . . . +4 +1 . -8 -7 +4 +5 +3 +5 +5 +4 +3 +2 +1 d- +2 -5 -7 -4 -2 -2 -3 -2 -5 -3 -1 +1 -3 -1 +7 +3 +8 +10 +9 +3 +2 . -1 -1 . s- +0 -3 -3 -6 -3 -4 -2 . . -2 -1 -1 . . +2 +2 +5 +4 +1 +3 +2 +3 . +2 +1 al- +0 +1 . -1 -3 -4 -2 . -1 . +1 -2 . +2 . . +1 +1 . . +1 +2 +2 . +1 r- +0 -6 -4 -3 -2 -3 -1 . -1 . -1 -1 . +1 -1 +1 +5 +3 +4 . +1 +4 +1 +1 +2 qol- +0 . +3 -3 -1 -3 . +1 . -2 -1 -1 . +1 -2 . +3 -1 +2 . +1 +1 +1 +1 +2 or- -1 -7 -7 -6 -2 . . +1 -1 -1 . . . +2 . +1 +1 . +1 +2 +2 +2 +2 +3 +3 dal- -1 -4 -6 -1 -1 -2 . +1 -1 . . . . +2 -1 . . +3 -3 +1 +2 +2 +2 +2 +3 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- q- +0 -1 -2 -3 -3 -5 . . +1 -2 -1 +6 +6 +1 +7 +6 -10 -3 -4 . +1 +1 +1 +1 +2 so- +0 -1 . . -2 +3 -1 . -1 -1 -1 +3 +3 +1 +1 +1 -8 -1 -4 +2 +2 +1 +1 +1 +2 dy- +0 . . +1 -2 +4 -1 . -1 -1 +2 -1 . +1 -1 +1 -4 -6 -3 . +2 +1 +2 +3 +2 sol- -1 -2 +1 . -1 -3 -1 . -1 -1 . -1 . +1 +1 . +1 -4 -3 . +2 +1 +3 +1 +2 a- -1 . -5 -4 -2 -2 . +1 +3 -1 . +6 +3 +7 -1 . -7 -7 -3 +1 +2 +2 +2 +2 +3 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +44 +37 +33 +33 +28 +30 +22 +18 +25 +25 +25 +27 +23 +20 +22 +19 +35 +35 +29 +25 +23 +23 +22 +21 +19 We see strong dependencies between crust prefix and core+mantle. Essentially, the proper prefixes "o-", "qo-", "y-", "l-", and "ol-" are attracted to cores that begin with naked gallows, and repelled by cores that begin with "ch" and "sh". The strong attraction of "o-" and "qo-" for naked gallows suggests that "a" and "o" are modifiers for the following letters. However, the prefixes "sho-" and "cho-" do not seeem particularly attracted to gallows (see below), and prefixes like "lo-" are fairly rare. These facts argue against "o" being a pre-modifier. On the other hand, there seems to be little dependency between the prefix and the tail of the core (minus the first letter). Interestingly, the platform gallows seem to behave intermediately between the naked gallows and the "ch"/"sh" elements. More curously, the isolated "ee" and "eee" elements resemble platform gallows in that they are intermediate between naked gallows and "ch"/"sh". Anomalies for pm-c (crust+mantle prefix)×(core): ----- ----- --- --- --- --- --- --- --- - k - - - - c k T c c k h e O - - k p o o o T k p h h k k k ----- ----- --- --- --- --- --- --- --- ch- +8 . . +13 +6 -6 -5 -5 che- +7 +1 +3 +10 . -5 -4 -4 cho- +8 +1 . +6 +3 -2 -4 -4 cheo- +3 -1 -2 +5 +3 -1 -1 -1 ----- ----- --- --- --- --- --- --- --- - +15 . +4 +7 +4 -3 -3 -10 ----- ----- --- --- --- --- --- --- --- o- +13 +5 +5 . -3 -1 -5 . qo- +10 +8 +4 +3 -3 -2 -6 -4 y- +5 +7 +6 -2 -6 -2 . -2 ----- ----- --- --- --- --- --- --- --- l- +2 +7 +4 -7 -3 . . . ol- +2 +5 +3 -6 . -1 . . ----- ----- --- --- --- --- --- --- --- chee- +1 . . . -2 . . . chy- +0 -1 . -5 . +1 +2 +2 al- +0 -1 . -5 . +1 +2 +2 dy- +0 . -2 -4 . +2 +2 +2 chol- +0 -1 . -4 . +1 +2 +2 qol- +0 . -4 -4 . +2 +3 +3 d- +0 -3 -1 . -1 +1 +1 +2 e- +0 -3 . -4 . +3 +2 +2 qe- +0 -2 -2 -4 . +2 +3 +3 oe- +0 -2 -4 -2 . +2 +3 +3 sol- -1 -2 -4 -3 . +2 +3 +3 ----- ----- --- --- --- --- --- --- --- a- +1 -6 -2 +2 +4 . +1 +1 q- +0 -5 -3 +5 . +1 +1 +1 so- +0 -2 -6 +2 . +2 +2 +2 ----- ----- --- --- --- --- --- --- --- TOT +42 +41 +31 +32 +23 +15 +13 +13 The double-gallows cores are quite rare - 32, 20 and 20 cases, respectively, against ~17,000 for the single-gallows ones. While the "ch*-" prefixes show a moderate preference for platform gallows, overall the prefix and core seem quite independent. Comparing the tables for pm-c and p-mcn pairs, we can say that the main discontinuity lies at the mantle-core boundary. Anomalies for c-ns (core)×(mantle+crust suffix): ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - - c - - - - - - a - - e - - h c c - e c c - - e T i e e e - - - a - c e h - h e o - h h a o e O - i e d d a a e i o h d e o d o d a o o i d e T y n y y y r l y n l y y y r y l y m l - r r y y ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- k- +21 -3 +3 +4 +3 +7 . +2 -1 +4 -2 . . . -4 . . -1 -1 . -3 -1 -1 -4 . p- +6 -3 +4 -11 -11 -7 +2 +2 -12 -3 +1 +7 +13 +9 . +9 -7 -8 -2 +8 +5 +6 +4 -4 -5 ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ckh- +6 +9 . +4 +7 -5 . +2 +11 -1 +5 -7 -9 -8 +5 -8 +4 +3 +1 -7 +2 -7 -7 +6 -4 cph- +1 +4 +2 +1 +5 -1 . . +6 -3 +4 -5 -3 -3 +3 -3 +2 +3 . -2 -4 -2 -1 . . ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- kok- -2 -1 -3 . -2 +3 +1 -2 -3 . . +2 . . -2 . . . +1 . -1 . +1 +2 +4 kchok- -2 -1 -3 -1 . +2 -2 -2 -3 . -3 +3 . +1 -3 . . +3 . +1 . +2 +1 . +3 keok- -2 -3 -3 +3 . +1 -2 -2 +2 +3 -5 -1 . . . . . . . +1 . +1 +1 . +3 ------ ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +42 +32 +30 +30 +29 +29 +29 +28 +28 +28 +27 +27 +26 +25 +25 +24 +24 +23 +23 +22 +22 +22 +21 +21 +20 Apparently, cores with naked "k" and "t" gallows are fairly similar to cores with platform gallows, while the "p" and "f" gallows are quite aberrant. Except for "p" and "f" cores, there is a fairly good independence between the core and suffix. Anomalies for mcn-s (core+mantle)×(crust suffix): ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - o - - d - d a - - - a - o - - - a T i a a - - a - - - i o - - l o d d i - O i i i a a l a o o i d o - o d l - a a i d - - T n n r r l y m l r n y d o - s y y y l r n y d s ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- k- +12 +13 +13 +9 +10 +10 +7 +7 +3 +3 . . . -2 +2 -2 +3 . -2 -11 -10 -11 -15 -13 -15 p- +3 +8 . +10 +6 +4 +1 +1 +2 +1 +2 -5 -1 -7 +5 -5 +2 +1 -9 +1 . +1 -11 -5 -6 chok- +2 +8 +6 +4 +5 +5 . +5 +1 . -2 -2 -3 . +6 -4 . . . -2 -3 -2 -13 -4 -3 chek- +0 +5 +8 . +5 +5 +1 +3 -5 -6 -1 . -2 -6 +9 -3 . . +3 -1 -2 -1 -8 -2 -2 chk- +1 +9 +8 +5 +6 +7 +5 +3 . -4 . -6 . -3 -2 -3 +2 -1 -1 -1 -2 -1 -12 -3 -4 ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ckh- +4 +2 +1 -3 +2 . -3 +1 +5 +5 -1 +3 +1 +3 +1 +1 . +3 +1 -5 -6 -5 -4 -3 -2 chckh- +0 +2 -1 . -4 +3 +2 +2 -4 -2 -1 -1 +3 -3 +5 -2 . . +6 -1 -2 -1 . +3 -2 checkh- -1 -2 . +3 -3 +1 +2 +3 -4 -5 . -3 . -2 . . +2 +1 +5 +1 . +1 -2 +1 -1 cph- +0 +5 . +2 +2 . +1 . +3 +1 . -1 . -3 -2 -1 +4 +1 -2 . -1 . -7 -1 -1 ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ch- +13 +1 . -6 +2 . -3 . +4 +5 +4 +1 -1 +4 -2 +1 . . -5 . . . . -2 -2 kch- +8 . -2 -6 . -1 -1 . +3 +5 -1 +1 +2 +4 +1 . -6 -3 . . . -1 +3 +3 -1 pch- +4 -1 -3 . +1 . -1 . +2 +3 +4 . . -1 -3 -6 +3 +3 -4 . +5 +2 +2 -1 -5 kech- -1 -2 . +2 -2 -3 +2 . -4 -4 . -4 +3 +1 +1 . +1 +2 . . . . +2 . +1 ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- che- +12 -5 -5 -10 +1 . -5 -3 +3 +2 -1 +2 . +3 . +3 -2 -5 . +3 +2 +4 +7 +3 +1 ke- +9 -5 -2 -8 -1 . -5 -3 +4 +2 . +5 +1 +1 -9 +4 -1 . -2 +3 +4 +3 +7 +1 -2 kche- +3 -4 -3 -2 -8 -8 -2 -5 . +1 +1 +4 +2 +5 -1 +1 -3 . +2 . +3 . +8 +5 +2 pche- +2 -5 -3 . . -7 -1 -2 +1 . -1 +3 -3 -1 -3 . . -1 -1 +5 +7 +7 +8 +2 . ckhe- +0 -4 -1 +1 -2 -5 . -2 +3 +1 -1 +4 +1 +1 -3 +3 . . +3 -1 -2 -1 +3 -1 +1 chckhe- -1 -2 . +3 -3 -1 +3 . -6 -5 +1 -1 . -2 . . +2 +1 +2 +1 . +1 +4 +1 -1 ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- kee- +7 -4 -9 -6 -1 -2 -6 -1 +1 . . +4 . +5 -9 +4 -3 -1 +4 +3 +3 . +9 +5 +4 chee- +4 -7 -4 -3 -4 -3 -3 -6 . +2 -2 +1 -2 +5 +4 +3 -2 -3 +3 +1 +2 +1 +6 +3 +9 ee- +1 -4 -2 . . . . . -2 +1 -1 -1 -3 +1 -1 . -1 . -2 -2 +2 +4 +3 +1 +8 keee- +0 -4 -1 +1 -5 -4 +1 -1 -4 -3 . . -1 +1 . +2 . . +2 +1 -1 . +4 +4 +7 eee- -1 -2 . +2 -4 -3 +2 . -8 -5 +2 -4 +1 -3 -1 +2 +1 +2 -1 . +1 . +2 +2 +12 ------- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +44 +31 +28 +21 +30 +29 +20 +24 +33 +31 +21 +29 +21 +29 +25 +24 +19 +19 +38 +21 +23 +22 +36 +24 +25 Apparently, core-mantles ending in "e" are all fairly similar, and distinct from core-mantles that end with naked gallows. Core-mantles ending with "ch" and "sh", as well as platform gallows, are similar to each other, and roughly halfway between "e"-terminated core-mantles and naked gallows (actually closer to the former than to the latter). Core-mantles ending with "e" show fair independence to the suffix. Core-mantles ending with naked gallows are more selective - they strongly attract suffixes starting with "a", and strongly repel suffixes starting with a dealer. The platform gallows too sem intermediate between the "ch" elements and naked gallows. Comparing the c-ns and mcn-s tables, we again conclude that the main break is at the core-mantle boundary. Anomalies for pm-ns (crust+mantle prefix)×(mantle+crust suffix): ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - - - - - c - - - a - - - - - - - e c c c h e e c T - i a a - - - - - o e e c e o h h h e e e h O - e i i i a a a o o d e d h o d o o d d d e e T - y y n n r r l m l r y y y y l y l r y y y y y ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- o- +17 -4 -2 -1 +2 . . +2 +3 . +1 . -2 +1 +2 -1 +1 . . . . . +1 -2 . qo- +16 -4 -2 . +2 +2 -3 +1 +3 -2 +2 . -3 +3 +3 . . -2 -1 -2 . +1 +4 -2 . - +15 -2 . . . -2 -1 +1 . -4 +4 +3 . -1 . . +1 . +1 +2 . +2 . -4 . y- +10 -5 -3 -1 +1 -3 . +1 . -1 . . -1 +2 . +1 . +2 . +1 . +1 +1 . . l- +5 -6 -2 -3 +4 +3 . +3 . . . -7 -4 +4 +4 -5 -1 . -5 -4 +2 +5 +6 +2 +2 ol- +5 -6 -1 . +3 +4 +1 +1 +1 . -3 -1 -6 +4 +4 -1 . -2 -6 -4 -1 +4 +5 +2 +1 ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ch- +6 . +9 +6 +2 . -1 +1 +4 . +2 +1 . . +3 +2 -4 . . -3 -8 -4 -5 -5 -4 che- +5 +7 +9 +5 . . -5 +2 +3 . . -5 +2 . +2 +3 -5 -6 . -5 . -1 -3 -5 +1 cheo- +1 +4 +5 +4 -1 +1 -1 . -1 +2 . . -1 +2 -1 . +2 -2 -1 -1 -2 -1 -3 -1 . chee- +0 +7 +6 +4 -1 . . +1 -2 . -1 -1 . +3 -3 +1 -1 . -1 . -2 -2 -4 . . cho- +7 +3 +4 +3 . -2 -3 +1 +1 . +3 +2 . . -2 +4 . -1 . -1 -2 -3 -6 -1 . a- -1 +1 +4 . -2 -2 +2 -1 . +1 -1 . +1 . . -1 . . . +1 . . -2 +1 . ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- al- -1 . -3 -2 +1 +4 +1 +1 -1 +3 -1 . . -3 -2 -3 +1 . . . +2 -1 +2 . -1 dy- -1 . . . . . . -3 -1 . -1 . . -1 +2 +4 . . . +2 +2 -1 . . -1 chy- -1 +1 +2 . -1 -3 . . -2 . -1 . . -2 -3 +5 . +3 . +1 . . -1 . . chol- -1 . -1 -4 +1 . +1 . +2 . -1 . . -4 -2 . . . . +2 . . +2 +4 -1 q- -1 . -3 +1 -3 -1 . -1 -2 +2 +5 +1 . -2 . -3 +1 +3 . +1 . . -2 +2 . qol- -1 +1 -2 -2 -1 +3 . -3 -2 . -1 . . +3 -2 -1 . . . +1 . -1 +5 +2 . d- -1 . -1 -3 -4 -1 +2 -3 . . . +1 +2 -2 . +2 . . +2 +1 . . -2 +1 +2 e- -1 +1 . -1 . -2 +1 . -1 . -1 . +1 -2 -1 . . . +1 +1 . +1 . +1 . qe- -1 +1 -1 -3 -1 -1 . . +1 . -1 +3 . -2 -1 -3 +1 . . +1 +5 . -2 +1 . ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- so- -1 . -4 +2 -1 -3 . -1 . . . . +4 . -2 . +1 . +2 +2 +1 -1 +2 . . oe- -1 . -4 . . -1 +1 -1 -1 . -1 +1 +1 -2 -3 -1 . +5 +2 +1 +1 . . +1 . sol- -1 . -6 -1 . +2 +1 -3 -1 +1 -1 . +1 . +1 -1 . . . +1 . . +4 +1 . ----- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +42 +22 +32 +28 +30 +28 +21 +29 +28 +23 +27 +25 +21 +30 +29 +27 +24 +23 +22 +22 +24 +26 +29 +20 +25 Thre are only weak dependencies between the crust+mantle prefixes and suffixes. There is a slight attraction by "ch"-containing prefixes and short suffixes "-", "-y", "-ey"; and a slight repulsion between the `standard' prefixes "o-", "-", "qo-". "l-", "ol-" and "y-" and the empty suffix "-". Anomalies for p-s (crust prefix)×(crust suffix): ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - - o - d d - a - - - a a - - - o - T - - - i - - o a - - d i i a - d a l o O - d o o i a a d - i - o a - a i i i o a l d l T y y l r n r l y o n s - s m d r n n r d l y y y ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- - +19 +1 . +5 +4 . +1 . +3 +3 -2 . +1 . -2 -1 . . +1 -4 -1 -1 -4 -2 -1 o- +16 . . +2 +1 +3 +3 +4 +1 . +2 -1 -1 . . -3 -1 -2 -2 . -2 -1 . -1 -1 qo- +13 +1 +3 +3 +2 +5 +3 +5 +1 . +5 -2 . -2 . . -1 -1 -5 . . -3 . -8 -6 y- +10 . . +2 +3 +2 +2 . +4 +3 -1 -1 -1 . . -2 -1 -1 . . . -2 -1 . -3 l- +6 +1 +5 +1 . +4 +3 +2 +2 +2 +2 +1 . -4 +2 +2 -1 -3 -6 -1 -2 . -3 -4 -6 ol- +5 +2 +3 +1 +1 +3 +2 +2 . +1 +4 +1 -2 -3 +1 +1 -3 . -5 +1 -2 -2 -2 -3 -1 d- +4 +1 +2 +4 +7 -3 -5 . +1 +4 -4 . . +3 -3 . . -3 . -3 . . -3 . . s- +0 . . +3 +2 +1 -2 . . +2 -1 . . +2 -2 -3 -1 . +1 -1 -1 . . . . q- +0 -1 -1 +2 +1 . +2 -1 +2 -3 -1 -3 . . +2 -1 +2 . . . . . . +1 . al- +0 . . -1 -3 . +2 -2 -3 -2 +4 -1 . -1 +3 +2 +2 . . +1 . . . . . r- +0 . +1 . -4 -5 -1 +1 -1 +1 -3 +2 . . -1 +1 +2 . +1 . . . . +1 +1 qol- -1 +1 +2 -3 -3 . -3 -2 -2 -2 +4 -1 . . . -1 . +1 +2 . +1 +1 +1 +2 +2 so- -1 -1 -4 . -2 . -2 -1 . -1 -3 +4 +2 . -1 . . +2 +1 . +2 . +1 +1 +3 dy- -1 . -1 -5 . . -2 . -2 -2 . . . . . -1 . +1 +1 . +1 +1 +3 +2 +3 sol- -1 -3 . -3 -3 . -3 -1 -2 -2 +2 . . . . +4 . +1 +2 +1 +1 +1 +1 +2 +2 a- -2 +1 -7 -4 -3 -2 -1 . -1 -2 -2 . +3 . +1 . . +1 +2 +2 +1 +1 +2 +2 +2 or- -1 -1 -4 -4 . -2 . -2 -2 . -2 . +1 +1 . . +1 +1 +2 +1 +1 +2 +1 +3 +2 dal- -2 -2 . -4 -1 -4 +1 -2 -2 -2 -2 . . . . +2 . +1 +2 +1 +1 +1 +2 +2 +2 ---- ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT +44 +38 +36 +33 +31 +31 +30 +29 +29 +29 +28 +25 +25 +24 +24 +24 +23 +22 +21 +21 +21 +21 +20 +19 +19 These anomalies are all very low, confirming the independence between the prefixes and suffixes. ANALYSIS OF UNCOLLAPSED COMPONENTS Now let's analyze the various pieces together, without modifications. The goal now is to obtain component lists for the probabilistic grammar. Testing the factoring script: echo ".sample.wds -> .sample.els" cat .sample.wds \ | factor-words -f factor-text.gawk \ -v esplit=1 \ > .sample.els OK, let's do it: mkdir stats/words foreach ptag ( ${pairs} ) analyze-pairs -esplit ${ptag} end STATISTICS OF WORDS BY TYPE The pairs tw-w are the word itself preceded by its `type' and a dash. The `type' has one letter "p", "m", "c", "n", "s" depending on whether the crust prefix, mantle prefix, core, mantle suffix, and crust suffix are non-empty (or could be, if ambiguous). foreach ptag ( tw-w ) analyze-pairs -esplit ${ptag} end CHECKING THE EQUIVALENCE OF GALLOWS Gather counts fo words with each gallows letter, minus the same. (Exclude words with two or more gallows.) foreach f ( t p k f ) echo .gal-${f}.cts cat stats/words/pmcns/tot.frq \ | egrep -v '[ktpf].*[ktpf]' \ | gawk ' \ ($3 ~/['"$f"']/){ \ gsub(/[ktpf]/,"-",$3); \ gsub(/[{}]/,"",$3); \ print $1,$3; \ } ' \ > .gal-${f}.cts end Gather counts for words with no gallows, and insert "-" in all possible positions. (Caution: total count of this file will be wrong --- add only lines with "-" in first position.) cat stats/words/pmcns/tot.frq \ | egrep -v '[ktpf]' \ | gawk ' \ /./{ \ gsub(/[{}]/,"",$3); \ w = $3; n = length(w); \ for (i=0; i <= n; i++) \ { print $1, (substr(w,1,i) "-" substr(w,i+1)); } \ } ' \ > .gal-z.cts Join the files and plot them: join-counts .gal-{t,p,k,f,z}.cts > .gal.mct plot-gallows-freqs .gal .gal Count consistency: cat stats/words/pmcns/tot.frq \ | egrep -v '[ktpf]' \ | totalize-fields 16548 0 cat stats/words/pmcns/tot.frq \ | totalize-fields 33352 1 Listing the words with most essential gallows: foreach which ( tk pf ) cat .gal.mct \ | gawk ' \ /./{ \ tk = $1+$3; pf = $2+$4; z = $5; \ s = '"${which}"' \ if (s < 5){next} \ printf "%+7.4f %s\n", (log(1+z)-log(1+s))/log(10), $0; \ } ' \ | sort -b +0 -1g \ > .gal-${which}.mcl end