It occurred to me that the labels should tell us a lot about valid word prefixes and suffixes, since their word boundaries shoudl be more reliable than those defined by spaces in the manuscript. Another possible source for that information is the words in line-initial and line-final position. Quick test: cat Note-010/labtit.evt \ | egrep -v '\.T[0-9]*\.[0-9].*>' \ > .labels-eva.evt Edited .labels-eva.evt, removing some non-labels and garbage, producing .labels-s-eva.evt. extract-words-from-interlin \ -chars 'aoeilmnrchtpkfsqjdvxyg' \ .labels-s-eva.evt \ .labels-s-eva lines words bytes file ------ ------- --------- ------------ 327 745 3485 .labels-s-eva.txt 754 754 3503 .labels-s-eva.wds 357 357 2500 .labels-s-eva.dic 410 410 2760 .labels-s-eva-gut.wds 344 344 2419 .labels-s-eva-gut.dic 334 334 668 .labels-s-eva-fun.wds 3 3 6 .labels-s-eva-fun.dic 10 10 75 .labels-s-eva-bad.wds 10 10 75 .labels-s-eva-bad.dic Sample from .labels-s-eva.txt: otaik dak alak = otaldy = otoky = seeyar = ykas asy = sosainr = oteey dar = ytodaiir = Digraph counts: TT . / = a o e i l m n r c h t p k f s q j d y g ? - ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 86 . . . 32 13 . . . . . . 11 . 2 . . . 9 . . 12 4 2 1 . / 3 . . . . 2 . . . . . . . . . . . . . . . . 1 . . . = 324 . . . 4 217 2 1 . . . 1 26 . 1 . 2 . 27 2 . 22 18 . 1 . a 327 . . 4 . 1 1 51 105 28 5 107 . . 1 . 6 . 5 . 4 4 2 1 2 . o 424 2 . 5 10 1 8 4 74 2 1 46 13 . 105 14 69 11 21 . 2 32 2 . 2 . e 130 . . . 8 39 32 1 . 1 . 2 1 . 5 3 4 2 6 . . 5 19 1 1 . i 107 . . . . . . 39 1 . 42 20 . . 1 . 2 1 . . . . . . 1 . l 184 14 . 45 31 9 6 3 . . . 1 10 . . . 5 . 14 . . 19 22 4 . 1 m 32 2 . 28 . . . . . . . . 1 . . . . . . . . . . 1 . . n 48 10 1 27 3 . . . . . . 1 . . . . . . . . 1 1 3 . . 1 r 180 27 . 56 47 11 2 2 . . . . 7 . . . . . 2 . . 1 21 2 1 1 c 109 . . . . . . . . . . . . 95 4 4 4 2 . . . . . . . . h 138 . . 1 17 41 36 . . . . 1 2 . 1 . . 1 3 . 1 18 15 . 1 . t 129 1 . . 51 31 21 . . . . 1 9 4 . . . . 1 . . 1 8 . 1 . p 23 . . . 5 4 1 . . . . . 8 4 . . . . 1 . . . . . . . k 104 2 . 1 37 22 17 . 2 . . . 8 4 . . . . 1 . . . 9 . 1 . f 18 1 . 1 8 2 . . . . . . 3 2 . . . . . . . . 1 . . . s 102 8 . 12 21 14 3 3 . . . . 1 29 . . 1 . . . . 1 7 . . 2 q 2 . . . . 1 . . . . . . . . . . 1 . . . . . . . . . j 8 1 . 6 . . . . . . . . . . . . . . . . . . 1 . . . d 130 1 . 7 49 13 . 3 . 1 . . 7 . . . . . 1 . . . 48 . . . y 192 17 1 127 1 1 . . . . . . 1 . 8 2 10 1 7 . . 13 1 . . 2 g 11 . . 3 2 . . . . . . . . . . . . . . . . . 6 . . . ? 17 . . 2 1 1 1 . 2 . . . . . 1 . . . 1 . . . 3 . 5 . - 7 . . . . 1 . . . . . . 1 . . . . . 3 . . 1 1 . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 2835 86 3 324 327 424 130 107 184 32 48 180 109 138 129 23 104 18 102 2 8 130 192 11 17 7 Next-symbol probability (× 99): TT . / = a o e i l m n r c h t p k f s q j d y g ? - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . . . 37 15 . . . . . . 13 . 2 . . . 10 . . 14 5 2 1 . / 99 . . . . 66 . . . . . . . . . . . . . . . . 33 . . . = 99 . . . 1 66 1 . . . . . 8 . . . 1 . 8 1 . 7 6 . . . a 99 . . 1 . . . 15 32 8 2 32 . . . . 2 . 2 . 1 1 1 . 1 . o 99 . . 1 2 . 2 1 17 . . 11 3 . 25 3 16 3 5 . . 7 . . . . e 99 . . . 6 30 24 1 . 1 . 2 1 . 4 2 3 2 5 . . 4 14 1 1 . i 99 . . . . . . 36 1 . 39 19 . . 1 . 2 1 . . . . . . 1 . l 99 8 . 24 17 5 3 2 . . . 1 5 . . . 3 . 8 . . 10 12 2 . 1 m 99 6 . 87 . . . . . . . . 3 . . . . . . . . . . 3 . . n 99 21 2 56 6 . . . . . . 2 . . . . . . . . 2 2 6 . . 2 r 99 15 . 31 26 6 1 1 . . . . 4 . . . . . 1 . . 1 12 1 1 1 c 99 . . . . . . . . . . . . 86 4 4 4 2 . . . . . . . . h 99 . . 1 12 29 26 . . . . 1 1 . 1 . . 1 2 . 1 13 11 . 1 . t 99 1 . . 39 24 16 . . . . 1 7 3 . . . . 1 . . 1 6 . 1 . p 99 . . . 22 17 4 . . . . . 34 17 . . . . 4 . . . . . . . k 99 2 . 1 35 21 16 . 2 . . . 8 4 . . . . 1 . . . 9 . 1 . f 99 6 . 6 44 11 . . . . . . 17 11 . . . . . . . . 6 . . . s 99 8 . 12 20 14 3 3 . . . . 1 28 . . 1 . . . . 1 7 . . 2 q 99 . . . . 50 . . . . . . . . . . 50 . . . . . . . . . j 99 12 . 74 . . . . . . . . . . . . . . . . . . 12 . . . d 99 1 . 5 37 10 . 2 . 1 . . 5 . . . . . 1 . . . 37 . . . y 99 9 1 65 1 1 . . . . . . 1 . 4 1 5 1 4 . . 7 1 . . 1 g 99 . . 27 18 . . . . . . . . . . . . . . . . . 54 . . . ? 99 . . 12 6 6 6 . 12 . . . . . 6 . . . 6 . . . 17 . 29 . - 99 . . . . 14 . . . . . . 14 . . . . . 42 . . 14 14 . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 3 0 11 11 15 5 4 6 1 2 6 4 5 5 1 4 1 4 0 0 5 7 0 1 0 Previous-symbol probability (× 99): TT . / = a o e i l m n r c h t p k f s q j d y g ? - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 3 . . . 10 3 . . . . . . 10 . 2 . . . 9 . . 9 2 18 6 . / 0 . . . . . . . . . . . . . . . . . . . . . 1 . . . = 11 . . . 1 51 2 1 . . . 1 24 . 1 . 2 . 26 99 . 17 9 . 6 . a 11 . . 1 . . 1 47 56 87 10 59 . . 1 . 6 . 5 . 50 3 1 9 12 . o 15 2 . 2 3 . 6 4 40 6 2 25 12 . 81 60 66 61 20 . 25 24 1 . 12 . e 5 . . . 2 9 24 1 . 3 . 1 1 . 4 13 4 11 6 . . 4 10 9 6 . i 4 . . . . . . 36 1 . 87 11 . . 1 . 2 6 . . . . . . 6 . l 6 16 . 14 9 2 5 3 . . . 1 9 . . . 5 . 14 . . 14 11 36 . 14 m 1 2 . 9 . . . . . . . . 1 . . . . . . . . . . 9 . . n 2 12 33 8 1 . . . . . . 1 . . . . . . . . 12 1 2 . . 14 r 6 31 . 17 14 3 2 2 . . . . 6 . . . . . 2 . . 1 11 18 6 14 c 4 . . . . . . . . . . . . 68 3 17 4 11 . . . . . . . . h 5 . . . 5 10 27 . . . . 1 2 . 1 . . 6 3 . 12 14 8 . 6 . t 5 1 . . 15 7 16 . . . . 1 8 3 . . . . 1 . . 1 4 . 6 . p 1 . . . 2 1 1 . . . . . 7 3 . . . . 1 . . . . . . . k 4 2 . . 11 5 13 . 1 . . . 7 3 . . . . 1 . . . 5 . 6 . f 1 1 . . 2 . . . . . . . 3 1 . . . . . . . . 1 . . . s 4 9 . 4 6 3 2 3 . . . . 1 21 . . 1 . . . . 1 4 . . 28 q 0 . . . . . . . . . . . . . . . 1 . . . . . . . . . j 0 1 . 2 . . . . . . . . . . . . . . . . . . 1 . . . d 5 1 . 2 15 3 . 3 . 3 . . 6 . . . . . 1 . . . 25 . . . y 7 20 33 39 . . . . . . . . 1 . 6 9 10 6 7 . . 10 1 . . 28 g 0 . . 1 1 . . . . . . . . . . . . . . . . . 3 . . . ? 1 . . 1 . . 1 . 1 . . . . . 1 . . . 1 . . . 2 . 29 . - 0 . . . . . . . . . . . 1 . . . . . 3 . . 1 1 . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 3.995 Next-symbol entropy: 2.452 Splitting prefix/midfix/suffix: cat .labels-s-eva-gut.wds \ | sed \ -e 's/sh/X/g' \ -e 's/$/}/' \ -e 's/^/{/' \ -e 's/{\([qoaydirslmngj][qoaydirslmngj]*\)/\1{/' \ -e 's/\([qoaydirslmngj][qoaydirslmngj]*\)}/}\1/' \ -e 's/X/sh/g' \ -e 's/{}/\./' \ -e 's/\.//g' \ -e 's/{/- -/' \ -e 's/}/- -/' \ > .labels-s-eva.fwd cat .labels-s-eva.fwd \ | grep -v -e '- -' \ > .labels-s-unifs-all.wds cat .labels-s-eva.fwd \ | grep -e '- -' \ | gawk '/./ {print $1}' \ > .labels-s-prefs-all.wds cat .labels-s-eva.fwd \ | grep -e '- -' \ | gawk '/./ {print $2}' \ > .labels-s-midfs-all.wds cat .labels-s-eva.fwd \ | grep -e '- -' \ | gawk '/./ {print $3}' \ > .labels-s-suffs-all.wds dicio-wc .labels-s-{prefs,midfs,suffs,unifs}-all.wds lines words bytes file ------ ------- --------- ------------ 312 312 940 .labels-s-prefs-all.wds 312 312 1673 .labels-s-midfs-all.wds 312 312 1488 .labels-s-suffs-all.wds 98 98 531 .labels-s-unifs-all.wds foreach f ( prefs midfs suffs unifs ) cat .labels-s-${f}-all.wds \ | sort | uniq -c | expand | sort +0 -1nr \ > .labels-s-${f}-all.frq end dicio-wc .labels-s-{prefs,midfs,suffs,unifs}-all.frq lines words bytes file ------ ------- --------- ------------ 32 64 396 .labels-s-prefs-all.frq 87 174 1317 .labels-s-midfs-all.frq 118 236 1621 .labels-s-suffs-all.frq 79 158 1095 .labels-s-unifs-all.frq pr -m -w 64 -e -t \ .labels-s-{prefs,midfs,suffs,unifs}-all.frq \ | expand \ > .labels-s-joint-all.frq freq prefix freq midfix freq suffix freq unifix ---- -------- ---- -------- ---- -------- ---- -------- 194 o- 66 -t- 41 -y 6 am 54 - 55 -k- 19 -ol 6 ar 19 y- 25 -ch- 17 -ar 3 ary 8 ol- 13 -che- 16 -al 2 dy 4 d- 13 -te- 14 -or 2 gy 3 dy- 8 -sh- 11 -ody 2 odor 2 a- 7 -f- 10 -dy 2 sal 2 da- 7 -tch- 9 -aly 2 sar 2 dar- 6 -ke- 7 - 2 sary 2 so- 6 -kee- 6 -os 2 siiir 1 adair- 6 -pch- 5 -alar 1 aiin 1 al- 5 -p- 4 -aiin 1 ainaly 1 ala- 5 -she- 4 -ain 1 ainam 1 alam- 4 -kch- 4 -air 1 airar 1 ali- 3 -tok- 4 -aram 1 al 1 arar- 3 -tolch- 4 -ary 1 alols 1 aro- 2 -chet- 4 -dal 1 aly 1 do- 2 -cph- 4 -oldy 1 araly 1 dol- 2 -cth- 4 -orain 1 arar 1 il- 2 -ee- 3 -am 1 araydy 1 oal- 2 -fch- 3 -o 1 arody 1 oar- 2 -pche- 3 -odar 1 asy 1 or- 2 -talsh- 3 -oly 1 daiin 1 oyd- 2 -tare- 3 -r 1 daiindy 1 q- 2 -tee- 3 -s 1 dainy 1 qo- 1 -cfh- 2 -alaiin 1 dal 1 s- 1 -chckhe 2 -aldy 1 dalary 1 siiir- 1 -chee- 2 -alody 1 daliir 1 soi- 1 -cheeee 2 -an 1 dalsy 1 sol- 1 -chek- 2 -araiin 1 dan 1 yd- 1 -cheoct 2 -aral 1 dar 1 yy- 1 -chep- 2 -as 1 daramga 1 -chete- 2 -d 1 dararai 1 -chf- 2 -oaiin 1 dariiir 1 -choee- 2 -oaly 1 dary 1 -chof- 2 -olar 1 diin 1 -chok- 2 -ols 1 dolaj 1 -cholsh 2 -om 1 dolaram 1 -chosar 2 -yd 1 dolary 1 -chotee 1 -aday 1 dolory 1 -ckhe- 1 -ainy 1 doly 1 -cphe- 1 -airdy 1 dydarii 1 -e- 1 -airy 1 oaiin 1 -eep- 1 -aj 1 odaiin 1 -ekeee- 1 -ala 1 odaiir 1 -eoe- 1 -alain 1 odiiir 1 -eolale 1 -alal 1 odory 1 -et- 1 -alalg 1 ody 1 -faef- 1 -alaly 1 oin 1 -fche- 1 -alam 1 olaran 1 -fysk- 1 -ald 1 olaras 1 -karch- 1 -aldar 1 oldam 1 -kche- 1 -aldm 1 oldar 1 -kchoch 1 -aldo 1 onary 1 -kchsh- 1 -algar 1 oral 1 -keech- 1 -aloiir 1 orald 1 -keee- 1 -alrar 1 oram 1 -keeep- 1 -alsain 1 orar 1 -kocfh- 1 -alsy 1 oraraly 1 -kocth- 1 -alyd 1 oroj 1 -koee- 1 -any 1 orol 1 -kolsh- 1 -ao 1 osaro 1 -kshdch 1 -aralar 1 salal 1 -kydse- 1 -araldy 1 saldam 1 -pee- 1 -aralgy 1 saloiin 1 -pocph- 1 -arar 1 salols 1 -psh- 1 -aro 1 soaiin 1 -shch- 1 -dagy 1 sodar 1 -shockh 1 -daiir 1 solsy 1 -sholsh 1 -dajy 1 soly 1 -taik- 1 -dar 1 sorala 1 -tak- 1 -din 1 sororal 1 -takaik 1 -dorgy 1 sorory 1 -talch- 1 -g 1 sosainr 1 -talef- 1 -iir 1 sydarar 1 -talek- 1 -lairgy 1 sysam 1 -tche- 1 -ldam 1 y 1 -tchosh 1 -m 1 yorain 1 -teee- 1 -oaldy 1 ys 1 -tockh- 1 -odady 1 -toee- 1 -odaiin 1 -tolcht 1 -odaiir 1 -tooee- 1 -odals 1 -torche 1 -odol 1 -tose- 1 -oj 1 -tosh- 1 -olaiin 1 -tshsh- 1 -olam 1 -olarol 1 -oldain 1 -olg 1 -olinj 1 -oloara 1 -olor 1 -ora 1 -orad 1 -oraj 1 -oraldy 1 -oram 1 -orol 1 -ory 1 -osal 1 -osam 1 -osar 1 -osarar 1 -osdy 1 -oys 1 -ral 1 -sas 1 -sody 1 -sos 1 -sy 1 -yar 1 -yda 1 -ydal 1 -ydary 1 -ydy 1 -ys 1 -ysam labels herbal-A herbal-B ----------- ----------- ----------- {o-,y-,a-} 215 (69%) 1234 (21%) 715 (29%) {-} 54 (17%) 3656 (61%) 1234 (50%) {qo-} 1 (0.3%) 603 (10%) 300 (12%) {ol-,al-} 9 (2.8%) 35 (0.6%) 62 (2.5%) {dy-,da-,do} 6 (1.9%) 22 (0.4%) 13 (0.5%) {d-} 4 (1.3%) 201 (3.3%) 35 (1.4%) There are also a few "micro-complex" prefixes with 1-2 occurrences each. The frequencies are roughly similar to the text, except that * the empty prefix got supplanted by {o-,y-,a-}: the frequencies are 50% and 29% in B text, 61% and 21% in A text, 17% and 67% in labels. * On the other hand, the qo- prefix ios practically non-existent in labels: 1 occurrence. (There is also an occurrence of "q-" alone, perhaps a transcription error?) qokoaiin.ockhey={Label on line West from central square} qkol={pharma label} In contrast, qo- occurs on 10% of herbal-A words, and 12% of herbal-B words. This is all the more remarkable given the increased frequency of the o- prefix in labels. The low frequency of "qo-"s in labels confirms the thesis that "q" is not part of the word, but a prefixed particle (article, conjuntion, preposition). The enhanced frequency of {o-,y-,a-} suggests that words with that prefix are nouns, or that the prefix is an article. We should compare the occurrences of a tailfix with and without the q-, o-, and qo- prefixes... The midfixes are dominated by {-t-,-k-} which again is a characteristic of herbal-B. (In herbal-A the midfixes {-ch-,-sh-} are twice as common as {-k-,-t-}. Compare: by many by Friedman by Currier by Friedman by Currier Labels language B language B language A language A freq midfix freq midfix freq midfix freq midfix freq pc midfix ---- ------ ---- ------ ---- --------- ---- ------ ---- -- ------ 66 -t- 407 -k- 288 -k- 1045 -ch- 985 19 -ch- 55 -k- 183 -t- 155 -ke- 526 -sh- 470 9 -sh- 25 -ch- 179 -ke- 138 -che- 469 -k- 438 8 -k- 13 -che- 172 -ch- 127 -ch- 444 -t- 427 8 -t- 13 -te- 163 -che- 116 -t- 353 -cth- 298 6 -cth- 8 -sh- 110 -she- 88 -kee- 335 -tch- 280 5 -tch- 7 -f- 101 -kee- 85 -te- 297 -kch- 260 5 -kch- 7 -tch- 95 -te- 75 -she- 251 -che- 201 4 -che- ... ... ... ... ... ... ... ... ... ... Note however the difference in the relative frequencies of -k- versus -t-: 10:12 in labels, 10:5 in B text, 10:9 in A text. Moreover, -te- replaces -ke- as the most common "e"-modified midfix. These numbers supports the thesis that -t- and -k- are merely variant shapes of the same letter, with -t- being more formal and -k- more cursive. As for suffixes, here is a summary: labels A-text B-text ----------- ----------- ----------- {-y,-o} 44 (14%) 2200 (37%) 583 (23%) {-ol,-al,-or,-ar} 67 (21%) 1886 (32%) 400 (16%) - 7 (2.2%) 124 (2.1%) 63 (2.5%) -ody 11 (3.5%) 218 (3.7%) 111 (4.5%) -dy 10 (3.2%) 23 (0.4%) 639 (11%) -aiin 4 (1.2%) 316 (5.3%) 143 (5.8%) The frequencies of {-ol,-al,-or,-ar}, {-}, and {-dy} seem roughly consistent with a mixture of A and B text. In particular the -y:-dy ratio is 90:21, which lies between the ratios for A text (90:1) and B text (90:105). However, the frequencies of {-y,-o} and {-ody} are a bit too low, and that of {-aiin} is significantly lower. In fact the tail of the distribution is longer than that of midfixes, whereas in the text the midfixes have a much longer tail. Perhaps these observations can be explained by selective omission or insertion of spaces in the text vs. labels.