I stole some text in pinyin from http://www-personal.umich.edu/~wbaxter/, cleaned it some and saved it to chin-mch.txt. This is a bad sample: in the first statistics I ran, "zhong1 guo2" (China) came out neat the top. That's because half the sample is a Voice of America semi-political speech... So I removed all (but one) occurrences of "zhong1 guo2" from the sample. Let's run some statistics. Fist, words overall: cat chin-mch.txt \ | tr ' ' '\012' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin.frq count freqy word ----- ----- ----------- 244 0.065 de 118 0.031 shi4 78 0.021 ren2 62 0.016 you3 61 0.016 ta1 55 0.015 xue2 54 0.014 wen2 50 0.013 shi2 50 0.013 zai4 42 0.011 guo2 41 0.011 yi2 40 0.011 yi4 37 0.010 le 35 0.009 ge 34 0.009 shuo1 33 0.009 bu4 ... ..... ..... Now, without tones: cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin-notone.frq count freqy word ----- ----- ----------- 245 0.065 de 223 0.059 shi 129 0.034 yi 93 0.025 ren 71 0.019 you 65 0.017 bu 61 0.016 ta 60 0.016 guo 58 0.015 wen 55 0.015 xue 55 0.015 zi 50 0.013 zai 47 0.012 ji 44 0.012 yu 44 0.012 zhi 43 0.011 ge 40 0.011 mei Now for the initial consonant: cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | sed -e 's/[aeiouü].*$//g' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin-initial.frq count freqy word ----- ----- ----------- 473 0.126 d 402 0.107 y 364 0.097 sh 215 0.057 j 209 0.056 x 198 0.053 h 197 0.053 zh 178 0.047 g 173 0.046 l 166 0.044 z 157 0.042 b 138 0.037 w 130 0.035 r 130 0.035 t 91 0.024 f 90 0.024 m 89 0.024 n 89 0.024 q 75 0.020 k 74 0.020 ch Now for the final (vowels plus terminators): cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | sed -e 's/^[^aeiouü]*//g' \ | grep '.' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .chin-final.frq count freqy word ----- ----- ----------- 654 0.173 i 434 0.115 e 311 0.082 u 238 0.063 en 179 0.047 ai 168 0.045 uo 145 0.038 ou 130 0.034 a 126 0.033 an 123 0.033 ing 122 0.032 ong 118 0.031 ei 113 0.030 ian 109 0.029 eng 102 0.027 ao 98 0.026 ang 73 0.019 ui 67 0.018 ue 59 0.016 iao Changing subject again, I have been looking at the differences between languages A and B, particularly the tail (midfix+suffix) distribution. They really look like different languages. Even taking into account possible letter confusion, there seems no simple correspondence between the tails of one and those of the other. Just to be sure, let's try to recompute the tail distributions after collapsing everything that could be equivalent: t,k ---------> t p,f ---------> p r,s ---------> e ei ----------> o o,a,y -------> o ch,sh -------> ee cth,ckh -----> tee cph,cfh -----> pee iiii,iii,ii -> i foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/sh/ee/g' \ -e 's/ch/ee/g' \ -e 's/s/e/g' \ -e 's/r/e/g' \ -e 's/k/t/g' \ -e 's/f/p/g' \ -e 's/cth/tee/g' \ -e 's/ckh/tee/g' \ -e 's/cph/pee/g' \ -e 's/cfh/pee/g' \ -e 's/ei/o/g' \ -e 's/a/o/g' \ -e 's/y/o/g' \ -e 's/iiii/i/g' \ -e 's/iii/i/g' \ -e 's/ii/i/g' \ > .he${lang}-f-ere.factored cat .he${lang}-f-ere.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-ere-midfs-all.frq cat .he${lang}-f-ere.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-ere-suffs-all.frq cat .he${lang}-f-ere.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-ere-tails-all.frq end dicio-wc .he{a,b}-f-ere-{midf,suff,tail}s-all.frq lines words bytes file ------ ------- --------- ------------ 179 358 2964 .hea-f-ere-midfs-all.frq 131 262 1815 .hea-f-ere-suffs-all.frq 655 1310 10600 .hea-f-ere-tails-all.frq 133 266 2169 .heb-f-ere-midfs-all.frq 82 164 1118 .heb-f-ere-suffs-all.frq 420 840 6716 .heb-f-ere-tails-all.frq foreach elem ( midf suff tail ) foreach lang ( A.a B.b ) set file = "he${lang:e}-f-ere-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by Friedman\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end foreach elem ( midf suff tail ) set tfiles = ( ) foreach lang ( a b ) set file = "he${lang}-f-ere-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end pr -m -t -i' '1 -w 54 ${tfiles} \ | expand \ > .herbal-f-ere-${elem}-cmp.txt end dicio-wc .herbal-f-ere-{midf,suff,tail}-cmp.txt Here are the results: by Friedman by Friedman language A language B freq pc midfix freq pc midfix ---- -- ------------------ ---- -- ------------------ 1595 27 -ee- 590 24 -t- 1313 22 -tee- 293 12 -tee- 913 15 -t- 279 12 -eee- 419 7 -eee- 274 11 -te- 241 4 -teee- 269 11 -ee- 191 3 -pee- 95 4 -teee- 155 3 -te- 66 3 -eetee- 132 2 -eeot- 60 3 -eeet- 100 2 -eeee- 52 2 -pee- 100 2 -eeotee- 49 2 -eeee- 99 2 -eetee- 48 2 -p- 60 1 -p- 45 2 -peee- 57 1 -eet- 25 1 -eeetee- 46 1 -peee- 24 1 -eet- 40 1 -teeee- 15 1 -eeete- 24 0 -eeet- 14 1 -eeteee- .... .. ....... .... .. ..... ---- -- ------------------ ---- -- ------------------ 5967 99 TOTAL 2431 99 TOTAL Tails: by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ------------------ ---- -- ------------------ 579 10 -teeo 153 6 -toe 395 7 -eeo 150 6 -tedo 370 6 -eeol 135 6 -teedo 337 6 -eeoe 131 5 -toin 226 4 -teeoe 118 5 -eeedo 212 4 -teeol 92 4 -teeo 197 3 -to 88 4 -tol 189 3 -tol 87 4 -eedo 178 3 -toin 65 3 -to 167 3 -eeeo 52 2 -eeteeo 156 3 -teeeo 51 2 -eeeo 119 2 -toe 41 2 -teeedo 96 2 -eeoin 39 2 -teeeo 91 2 -eeeoe 31 1 -teo Hmm, it seems that scribe A does not use "d" in the suffixes very much. Perhaps if we delete "d" we will get a better resemblance: foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/d//g' \ -e 's/sh/ee/g' \ -e 's/ch/ee/g' \ -e 's/s/e/g' \ -e 's/r/e/g' \ -e 's/k/t/g' \ -e 's/f/p/g' \ -e 's/cth/tee/g' \ -e 's/ckh/tee/g' \ -e 's/cph/pee/g' \ -e 's/cfh/pee/g' \ -e 's/ei/o/g' \ -e 's/a/o/g' \ -e 's/y/o/g' \ -e 's/iiii/i/g' \ -e 's/iii/i/g' \ -e 's/ii/i/g' \ > .he${lang}-f-erf.factored end foreach lang ( a b ) cat .he${lang}-f-erf.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erf-midfs-all.frq cat .he${lang}-f-erf.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erf-suffs-all.frq cat .he${lang}-f-erf.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erf-tails-all.frq end dicio-wc .he{a,b}-f-erf-{midf,suff,tail}s-all.frq lines words bytes file ------ ------- --------- ------------ 162 324 2666 .hea-f-erf-midfs-all.frq 85 170 1159 .hea-f-erf-suffs-all.frq 535 1070 8572 .hea-f-erf-tails-all.frq 125 250 2028 .heb-f-erf-midfs-all.frq 54 108 722 .heb-f-erf-suffs-all.frq 329 658 5186 .heb-f-erf-tails-all.frq foreach elem ( midf suff tail ) foreach lang ( A.a B.b ) set file = "he${lang:e}-f-erf-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by Friedman\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end foreach elem ( midf suff tail ) set tfiles = ( ) foreach lang ( a b ) set file = "he${lang}-f-erf-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end pr -m -t -i' '1 -w 54 ${tfiles} \ | expand \ > .herbal-f-erf-${elem}-cmp.txt end dicio-wc .herbal-f-erf-{midf,suff,tail}-cmp.txt lines words bytes file ------ ------- --------- ------------ 168 893 6707 .herbal-f-erf-midf-cmp.txt 91 449 3316 .herbal-f-erf-suff-cmp.txt 541 2624 20105 .herbal-f-erf-tail-cmp.txt by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ------------------ ---- -- ------------------ 611 10 -teeo 231 10 -teeo 422 7 -eeo 185 8 -teo 374 6 -eeol 172 7 -eeeo 338 6 -eeoe 153 6 -toe 228 4 -teeoe 131 5 -toin 218 4 -teeol 116 5 -eeo 216 4 -to 90 4 -tol 195 3 -tol 84 4 -teeeo 180 3 -toin 69 3 -to 172 3 -eeeo 59 2 -eeteeo 161 3 -teeeo 36 2 -eeeeo 120 2 -toe 34 1 -eeoe 98 2 -eeoin 34 1 -eeol 91 2 -eeeoe 30 1 -peeeo 79 1 -eeoteeo 30 1 -tom 76 1 -eeoo 29 1 -eeeto 70 1 -eeteeo 29 1 -eeoo 69 1 -teeoo 29 1 -peeo 64 1 -eeeeo 26 1 -eeeoe 62 1 -eeoto 26 1 -teoo Good news, at least we got the top entry to match. Now what else can we do? We could map "teeoe" and "teeol" to "teo", but that seems a bit ad-hoc... Let's try again. Let's compare the frequencies of "k" and "t", "sh" and "ch" in each language" foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/ch/C/' \ -e 's/sh/S/' \ -e 's/$/\./' \ | count-digraph-freqs \ -v pad="." \ -v showentropy=0 \ -v chars=".CSaoeilmnrchtpkfsqjdvxyg" end Language A: Digraph counts: TT . C S a o e i l m n r c h t p k f s q d y ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- n 1376 1341 1 . 1 7 . . 3 2 . 1 1 . 1 . . . 2 . 8 8 m 265 261 . . 1 3 . . . . . . . . . . . . . . . . r 1569 1302 32 6 59 59 1 5 4 3 . 2 4 . . . 1 . 2 . 16 73 l 1720 1310 35 14 11 88 7 . . 3 . 2 6 . 6 1 7 1 40 1 120 68 y 3189 2543 91 16 5 16 3 . 3 1 . 2 10 . 183 21 186 2 16 2 89 . s 669 316 51 5 104 107 16 . . 2 . . 4 12 . 1 3 . 1 . . 47 d 2234 160 145 51 1109 175 24 . 14 5 . 2 10 . 3 3 5 . 11 . 7 510 k 1650 24 377 75 223 252 258 . . 1 . . 43 226 . . . . 3 . 2 166 t 1790 17 423 57 161 273 124 . 1 . . . 28 522 1 1 . . 4 1 5 172 p 324 7 117 11 9 35 . . . . . . 16 101 1 . . . . . 3 24 f 106 9 28 5 6 14 . . . . . . 2 30 . . . . . . 4 8 . 7812 . 1507 745 79 1145 26 12 41 16 3 57 615 . 267 95 352 33 356 708 1266 489 c 1001 . . . . . . . . . . . . 122 522 101 226 30 . . . . o 5711 410 59 24 74 18 60 101 1325 91 7 993 141 . 726 83 742 35 117 4 641 60 a 2318 43 4 . 1 4 . 1305 311 131 54 412 3 . 4 2 7 2 10 . 19 6 i 2601 2 1 2 . 1 3 1173 4 6 1300 83 3 . 1 . 14 . 3 . 5 . e 1958 32 11 1 118 529 475 1 1 3 12 4 12 . 34 11 37 1 79 . 10 587 h 1013 10 4 1 93 335 177 1 2 . . 2 . . 1 . 1 . 9 . 4 373 S 1016 15 5 . 47 525 233 . 3 . . . 21 . 6 . 13 1 4 . 6 137 C 2892 10 . 3 217 1427 549 3 7 1 . 9 80 . 32 5 51 1 12 . 28 457 q 716 . 1 . . 698 2 . 1 . . . 2 . 2 . 5 . . . 1 4 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 41930 7812 2892 1016 2318 5711 1958 2601 1720 265 1376 1569 1001 1013 1790 324 1650 106 669 716 2234 3189 Next-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 19 9 1 15 . . 1 . . 1 8 . 3 1 4 . 5 9 16 6 C 99 . . . 7 49 19 . . . . . 3 . 1 . 2 . . . 1 16 S 99 1 . . 5 51 23 . . . . . 2 . 1 . 1 . . . 1 13 a 99 2 . . . . . 56 13 6 2 18 . . . . . . . . 1 . o 99 7 1 . 1 . 1 2 23 2 . 17 2 . 13 1 13 1 2 . 11 1 e 99 2 1 . 6 27 24 . . . 1 . 1 . 2 1 2 . 4 . 1 30 i 99 . . . . . . 45 . . 49 3 . . . . 1 . . . . . l 99 75 2 1 1 5 . . . . . . . . . . . . 2 . 7 4 m 99 98 . . . 1 . . . . . . . . . . . . . . . . n 99 96 . . . 1 . . . . . . . . . . . . . . 1 1 r 99 82 2 . 4 4 . . . . . . . . . . . . . . 1 5 c 99 . . . . . . . . . . . . 12 52 10 22 3 . . . . h 99 1 . . 9 33 17 . . . . . . . . . . . 1 . . 36 t 99 1 23 3 9 15 7 . . . . . 2 29 . . . . . . . 10 p 99 2 36 3 3 11 . . . . . . 5 31 . . . . . . 1 7 k 99 1 23 5 13 15 15 . . . . . 3 14 . . . . . . . 10 f 99 8 26 5 6 13 . . . . . . 2 28 . . . . . . 4 7 s 99 47 8 1 15 16 2 . . . . . 1 2 . . . . . . . 7 q 99 . . . . 97 . . . . . . . . . . 1 . . . . 1 d 99 7 6 2 49 8 1 . 1 . . . . . . . . . . . . 23 y 99 79 3 . . . . . . . . . . . 6 1 6 . . . 3 . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 18 7 2 5 13 5 6 4 1 3 4 2 2 4 1 4 0 2 2 5 8 Previous-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 18 . 52 73 3 20 1 . 2 6 . 4 61 . 15 29 21 31 53 98 56 15 C 7 . . . 9 25 28 . . . . 1 8 . 2 2 3 1 2 . 1 14 S 2 . . . 2 9 12 . . . . . 2 . . . 1 1 1 . . 4 a 5 1 . . . . . 50 18 49 4 26 . . . 1 . 2 1 . 1 . o 13 5 2 2 3 . 3 4 76 34 1 63 14 . 40 25 45 33 17 1 28 2 e 5 . . . 5 9 24 . . 1 1 . 1 . 2 3 2 1 12 . . 18 i 6 . . . . . . 45 . 2 94 5 . . . . 1 . . . . . l 4 17 1 1 . 2 . . . 1 . . 1 . . . . 1 6 . 5 2 m 1 3 . . . . . . . . . . . . . . . . . . . . n 3 17 . . . . . . . 1 . . . . . . . . . . . . r 4 17 1 1 3 1 . . . 1 . . . . . . . . . . 1 2 c 2 . . . . . . . . . . . . 12 29 31 14 28 . . . . h 2 . . . 4 6 9 . . . . . . . . . . . 1 . . 12 t 4 . 14 6 7 5 6 . . . . . 3 51 . . . . 1 . . 5 p 1 . 4 1 . 1 . . . . . . 2 10 . . . . . . . 1 k 4 . 13 7 10 4 13 . . . . . 4 22 . . . . . . . 5 f 0 . 1 . . . . . . . . . . 3 . . . . . . . . s 2 4 2 . 4 2 1 . . 1 . . . 1 . . . . . . . 1 q 2 . . . . 12 . . . . . . . . . . . . . . . . d 5 2 5 5 47 3 1 . 1 2 . . 1 . . 1 . . 2 . . 16 y 8 32 3 2 . . . . . . . . 1 . 10 6 11 2 2 . 4 . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Language B Digraph counts: TT . C S a o e i l m n r c h t p k f s q d x y ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- n 513 500 1 . . 1 . . . . . 4 . . . . . . 1 1 1 . 4 y 1754 1473 18 8 1 6 2 . 5 4 . 2 1 . 75 17 110 8 4 . 20 . . m 113 107 . . 2 . . . . 1 . . . . . . . . . . 3 . . r 670 532 6 2 77 16 2 4 1 . . . 1 . . . . . . . 8 . 21 l 612 315 35 15 36 34 2 . 1 . . 4 3 . 8 . 55 4 10 1 52 . 37 s 191 94 6 . 60 7 5 . 1 . . . . 6 . 2 1 . 1 . 2 . 6 d 1477 71 19 13 421 38 12 1 6 . . 1 2 . . 1 2 . 2 . . . 888 f 86 5 33 1 20 7 2 . . . . . 2 9 . . . . 1 . 1 . 5 p 142 5 65 9 22 12 1 . . . . . 5 12 . . . . . . 4 . 7 . 3223 . 540 256 171 760 14 7 49 5 . 21 42 . 137 53 163 21 75 330 341 2 236 x 4 1 . . . 3 . . . . . . . . . . . . . . . . . c 219 . . . . . . . . . . . . 23 63 12 112 9 . . . . . o 1695 51 5 2 16 8 21 4 297 3 1 174 35 . 216 35 517 28 40 . 230 1 11 a 1368 17 . 1 . 1 . 569 245 99 4 398 4 . 1 1 10 1 7 . 8 . 2 i 1051 . . . . 2 . 464 1 . 508 64 3 . . 1 6 . . . 2 . . k 1106 20 94 21 374 49 330 . 1 . . . 4 112 . . . . 4 . 3 . 94 t 530 5 73 18 128 53 145 . . . . . 5 63 . . . . . . . . 40 h 225 3 1 1 5 10 65 1 . . . . 1 . . . . 1 . . 26 1 110 S 350 2 . . 5 50 206 . . . . . 12 . . . 6 . 3 . 44 . 22 C 909 2 . 1 19 93 406 1 2 1 . 1 71 . 9 3 25 1 6 . 212 . 56 e 1497 20 13 2 11 219 279 . 3 . . 1 27 . 21 17 99 13 37 . 520 . 215 q 332 . . . . 326 5 . . . . . 1 . . . . . . . . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 18067 3223 909 350 1368 1695 1497 1051 612 113 513 670 219 225 530 142 1106 86 191 332 1477 4 1754 Next-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d x y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 99 . 17 8 5 23 . . 2 . . 1 1 . 4 2 5 1 2 10 10 . 7 C 99 . . . 2 10 44 . . . . . 8 . 1 . 3 . 1 . 23 . 6 S 99 1 . . 1 14 58 . . . . . 3 . . . 2 . 1 . 12 . 6 a 99 1 . . . . . 41 18 7 . 29 . . . . 1 . 1 . 1 . . o 99 3 . . 1 . 1 . 17 . . 10 2 . 13 2 30 2 2 . 13 . 1 e 99 1 1 . 1 14 18 . . . . . 2 . 1 1 7 1 2 . 34 . 14 i 99 . . . . . . 44 . . 48 6 . . . . 1 . . . . . . l 99 51 6 2 6 6 . . . . . 1 . . 1 . 9 1 2 . 8 . 6 m 99 94 . . 2 . . . . 1 . . . . . . . . . . 3 . . n 99 96 . . . . . . . . . 1 . . . . . . . . . . 1 r 99 79 1 . 11 2 . 1 . . . . . . . . . . . . 1 . 3 c 99 . . . . . . . . . . . . 10 28 5 51 4 . . . . . h 99 1 . . 2 4 29 . . . . . . . . . . . . . 11 . 48 t 99 1 14 3 24 10 27 . . . . . 1 12 . . . . . . . . 7 p 99 3 45 6 15 8 1 . . . . . 3 8 . . . . . . 3 . 5 k 99 2 8 2 33 4 30 . . . . . . 10 . . . . . . . . 8 f 99 6 38 1 23 8 2 . . . . . 2 10 . . . . 1 . 1 . 6 s 99 49 3 . 31 4 3 . 1 . . . . 3 . 1 1 . 1 . 1 . 3 q 99 . . . . 97 1 . . . . . . . . . . . . . . . . d 99 5 1 1 28 3 1 . . . . . . . . . . . . . . . 60 x 99 25 . . . 74 . . . . . . . . . . . . . . . . . y 99 83 1 . . . . . . . . . . . 4 1 6 . . . 1 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 18 5 2 7 9 8 6 3 1 3 4 1 1 3 1 6 0 1 2 8 0 10 Previous-symbol probability (× 99): TT . C S a o e i l m n r c h t p k f s q d x y -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 18 . 59 72 12 44 1 1 8 4 . 3 19 . 26 37 15 24 39 98 23 50 13 C 5 . . . 1 5 27 . . 1 . . 32 . 2 2 2 1 3 . 14 . 3 S 2 . . . . 3 14 . . . . . 5 . . . 1 . 2 . 3 . 1 a 7 1 . . . . . 54 40 87 1 59 2 . . 1 1 1 4 . 1 . . o 9 2 1 1 1 . 1 . 48 3 . 26 16 . 40 24 46 32 21 . 15 25 1 e 8 1 1 1 1 13 18 . . . . . 12 . 4 12 9 15 19 . 35 . 12 i 6 . . . . . . 44 . . 98 9 1 . . 1 1 . . . . . . l 3 10 4 4 3 2 . . . . . 1 1 . 1 . 5 5 5 . 3 . 2 m 1 3 . . . . . . . 1 . . . . . . . . . . . . . n 3 15 . . . . . . . . . 1 . . . . . . 1 . . . . r 4 16 1 1 6 1 . . . . . . . . . . . . . . 1 . 1 c 1 . . . . . . . . . . . . 10 12 8 10 10 . . . . . h 1 . . . . 1 4 . . . . . . . . . . 1 . . 2 25 6 t 3 . 8 5 9 3 10 . . . . . 2 28 . . . . . . . . 2 p 1 . 7 3 2 1 . . . . . . 2 5 . . . . . . . . . k 6 1 10 6 27 3 22 . . . . . 2 49 . . . . 2 . . . 5 f 0 . 4 . 1 . . . . . . . 1 4 . . . . 1 . . . . s 1 3 1 . 4 . . . . . . . . 3 . 1 . . 1 . . . . q 2 . . . . 19 . . . . . . . . . . . . . . . . . d 8 2 2 4 30 2 1 . 1 . . . 1 . . 1 . . 1 . . . 50 x 0 . . . . . . . . . . . . . . . . . . . . . . y 10 45 2 2 . . . . 1 4 . . . . 14 12 10 9 2 . 1 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 The relative frequencies of "t" and "k", "sh" and "ch" are as follows: Language A: t = 1790 k = 1650 ratio t/k = 1.085 Language B: t = 530 k = 1106 ratio t/k = 0.479 Language A: S = 1016 C = 2892 ratio S/C = 0.351 Language B: S = 350 C = 909 ratio S/C = 0.385 So it seems we must collapse t and k, otherwise it will be very hard to find a correspondence between the two languages. We could keep ch and sh distinct, but their next-symbol probabilities are so similar that it seems silly to distinguish them. Just to be sure, let's compare the sh and ch contexts in the two languages: foreach lang ( a b ) foreach f ( sh.ch ch.sh ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/k/t/' \ -e 's/p/f/' \ -e 's/ckh/K/' \ -e 's/cph/P/' \ -e 's/'"${f:r}"'/@/' \ -e 's/'"${f:e}"'/~/' \ | grep '@' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .tmp-he${lang}-${f:r}.frq end end dicio-wc .tmp-he{a,b}-{sh,ch}.frq lines words bytes file ------ ------- --------- ------------ 322 966 6438 .tmp-hea-sh.frq 860 2580 17698 .tmp-hea-ch.frq 173 519 3466 .tmp-heb-sh.frq 389 1167 7964 .tmp-heb-ch.frq Language A Language B ---------------------------------- ---------------------------------- contexts of sh contexts of ch contexts of sh contexts of ch ----------------- ---------------- ---------------- ----------------- 98 0.096 @ol 221 0.076 @ol 35 0.100 @edy 60 0.066 @edy 92 0.091 @o 150 0.052 @or 16 0.046 @dy 51 0.056 @dy 63 0.062 @or 95 0.033 @y 11 0.031 @ol 43 0.047 @cthy 61 0.060 @y 88 0.030 qot@y 10 0.029 @ey 22 0.024 @ety 39 0.038 @ey 55 0.019 @ey 10 0.029 @y 22 0.024 t@dy 23 0.023 @ody 53 0.018 t@y 9 0.026 @ody 20 0.022 qot@dy 19 0.019 @eey 44 0.015 ot@y 8 0.023 @eedy 15 0.017 @ey 15 0.015 @eol 37 0.013 @oty 8 0.023 @eody 13 0.014 @ol 14 0.014 @aiin 36 0.012 t@or 7 0.020 @eo 12 0.013 @ecthy 14 0.014 @e 34 0.012 @aiin 6 0.017 @ety 12 0.013 @ody 14 0.014 @odaiin 33 0.011 @eor 6 0.017 @or 12 0.013 ot@dy 12 0.012 @eor 32 0.011 ot@ol 5 0.014 @ed 11 0.012 @eody 11 0.011 t@o 31 0.011 @ody 5 0.014 @eey 10 0.011 @y 10 0.010 @eo 30 0.010 t@ol 5 0.014 @eol 9 0.010 @ty 10 0.010 @octhy 30 0.010 yt@y 5 0.014 d@edy 9 0.010 t@edy 10 0.010 ot@y 29 0.010 @cthy 5 0.014 t@dy 7 0.008 @daiin 9 0.009 @cthy 29 0.010 @o 4 0.011 @cthey 7 0.008 @eol Obviously "sh" and "ch" are very different. Just to make double sure, we can play the same game with t and k: foreach lang ( a b ) foreach f ( t.k k.t ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/p/f/' \ -e 's/'"${f:r}"'/@/' \ -e 's/'"${f:e}"'/~/' \ | grep '@' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .tmp-he${lang}-${f:r}.frq end end dicio-wc .tmp-he{a,b}-{t,k}.frq pr -m -t -i' '1 -w 104 .tmp-he{a,b}-{t,k}.frq \ | expand \ > .tmp-t-k-cmp.txt lines words bytes file ------ ------- --------- ------------ 642 1926 13627 .tmp-hea-t.frq 683 2049 14438 .tmp-hea-k.frq 271 813 5693 .tmp-heb-t.frq 437 1311 9223 .tmp-heb-k.frq Language A Language B ------------------------------------------------ --------------------------------------------- contexts of t contexts of k contexts of t contexts of k --------------------- ---------------------- -------------------- ------------------- 96 0.054 c@hy 39 0.024 qo@chy 18 0.034 o@edy 41 0.037 qo@edy 51 0.029 c@hol 33 0.020 o@y 16 0.030 o@ar 35 0.032 chc@hy 49 0.028 qo@chy 29 0.018 @chy 14 0.027 o@al 35 0.032 o@aiin 42 0.024 c@hor 28 0.017 c@hy 13 0.025 o@aiin 29 0.026 o@ar 38 0.021 o@y 27 0.017 o@aiin 12 0.023 o@y 27 0.025 qo@ar 34 0.019 c@hey 25 0.015 qo@y 12 0.023 qo@edy 25 0.023 o@edy 29 0.016 o@chy 22 0.014 qo@ol 11 0.021 y@edy 23 0.021 o@al 28 0.016 o@ol 21 0.013 o@ol 10 0.019 @edy 20 0.018 qo@aiin 28 0.016 qo@y 20 0.012 @chor 9 0.017 @ar 19 0.017 che@y 27 0.015 o@aiin 19 0.012 @chol 8 0.015 chc@hy 18 0.016 @ar 24 0.014 @chy 18 0.011 @aiin 7 0.013 @chdy 17 0.015 o@y 24 0.014 o@chol 18 0.011 @ol 7 0.013 o@am 16 0.015 o@eedy 20 0.011 c@ho 18 0.011 y@chy 7 0.013 o@chdy 15 0.014 @chdy 20 0.011 cho@y 17 0.010 cho@y 7 0.013 o@eol 15 0.014 qo@chdy 18 0.010 qo@ol 16 0.010 qo@aiin 7 0.013 qo@ar 15 0.014 y@ar 17 0.010 @ol 15 0.009 c@hol 7 0.013 y@eedy 14 0.013 @edy Hm, there is some resemblance, but not as much as I would like. Perhaps it will get better if I delete the [oqy] prefixes and eplace cth,ckh by tch, kch: foreach lang ( a b ) foreach f ( t.k k.t ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/[- .:]//g' \ -e 's/p/f/' \ -e 's/^[qoy]*//' \ -e 's/c\([tkpf]\)h/\1ch/' \ -e 's/'"${f:r}"'/@/' \ -e 's/'"${f:e}"'/~/' \ | grep '@' \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ > .tmp-he${lang}-${f:r}.frq end end dicio-wc .tmp-he{a,b}-{t,k}.frq pr -m -t -i' '1 -w 104 .tmp-he{a,b}-{t,k}.frq \ | expand \ > .tmp-t-k-cmp.txt lines words bytes file ------ ------- --------- ------------ 446 1338 9376 .tmp-hea-t.frq 453 1359 9489 .tmp-hea-k.frq 205 615 4279 .tmp-heb-t.frq 318 954 6630 .tmp-heb-k.frq Language A Language B ------------------------------------------------ --------------------------------------------- contexts of t contexts of k contexts of t contexts of k --------------------- ---------------------- -------------------- ------------------- 218 0.123 @chy 137 0.084 @chy 51 0.097 @edy 89 0.081 @ar 109 0.061 @chol 80 0.049 @y 35 0.067 @ar 89 0.081 @edy 99 0.056 @chor 75 0.046 @aiin 26 0.049 @chdy 77 0.070 @aiin 82 0.046 @y 70 0.043 @ol 23 0.044 @y 44 0.040 @al 73 0.041 @ol 62 0.038 @chol 20 0.038 @aiin 43 0.039 @eedy 70 0.039 @chey 61 0.037 @chor 19 0.036 @al 41 0.037 @chdy 63 0.036 @aiin 50 0.031 @chey 16 0.030 @chedy 35 0.032 ch@chy 47 0.027 @or 49 0.030 @eey 13 0.025 @eedy 30 0.027 @y 40 0.023 @cho 36 0.022 @or 12 0.023 @eey 25 0.023 @chy 29 0.016 @chody 30 0.018 @cho 11 0.021 @am 25 0.023 @eey 26 0.015 cho@chy 25 0.015 @eol 11 0.021 @chy 19 0.017 @eody 23 0.013 @char 23 0.014 @al 10 0.019 @chey 19 0.017 che@y 20 0.011 @eey 21 0.013 ch@chy 10 0.019 @eol 18 0.016 @ain 20 0.011 cho@y 20 0.012 @ey 10 0.019 @ody 18 0.016 @am 19 0.011 @chaiin 20 0.012 cho@chy 9 0.017 @or 14 0.013 @ol 17 0.010 @al 19 0.012 @shy 8 0.015 @ol 13 0.012 @chedy 17 0.010 ch@chy 18 0.011 @chody 8 0.015 ch@chy 11 0.010 @ey Not perfect, but convincing enough... Ok. let's try again to equalize the tail distributions: foreach lang ( a b ) cat Note-009/he${lang}-f.factored \ | sed \ -e 's/d//g' \ -e 's/k/t/g' \ -e 's/f/p/g' \ -e 's/cth/tch/g' \ -e 's/ckh/tch/g' \ -e 's/cph/pch/g' \ -e 's/cfh/pch/g' \ -e 's/ei/a/g' \ -e 's/a/o/g' \ > .he${lang}-f-erg.factored end foreach lang ( a b ) cat .he${lang}-f-erg.factored \ | gawk '/./ {print ($1 $2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-words-all.frq cat .he${lang}-f-erg.factored \ | grep -v -e '- -' \ | gawk '/./ {print $1}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-unifs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print $1}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-prefs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-midfs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-suffs-all.frq cat .he${lang}-f-erg.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > .he${lang}-f-erg-tails-all.frq end dicio-wc .he{a,b}-f-erg-{word,unif,pref,midf,suff,tail}s-all.frq lines words bytes file ------ ------- --------- ------------ 1563 3126 23155 .hea-f-erg-words-all.frq 193 386 2524 .hea-f-erg-unifs-all.frq 47 94 600 .hea-f-erg-prefs-all.frq 286 572 4647 .hea-f-erg-midfs-all.frq 126 252 1732 .hea-f-erg-suffs-all.frq 888 1776 14121 .hea-f-erg-tails-all.frq 880 1760 12786 .heb-f-erg-words-all.frq 132 264 1735 .heb-f-erg-unifs-all.frq 28 56 345 .heb-f-erg-prefs-all.frq 193 386 3105 .heb-f-erg-midfs-all.frq 76 152 1018 .heb-f-erg-suffs-all.frq 506 1012 7898 .heb-f-erg-tails-all.frq foreach elem ( word unif pref midf suff tail ) foreach lang ( A.a B.b ) set file = "he${lang:e}-f-erg-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat .${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by Friedman\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > .${file}.fmt end end foreach elem ( word unif pref midf suff tail ) set tfiles = ( ) foreach lang ( a b ) set file = "he${lang}-f-erg-${elem}s-all" set tfiles = ( ${tfiles} .${file}.fmt ) end pr -m -t -i' '1 -w 54 ${tfiles} \ | expand \ > .herbal-f-erg-${elem}-cmp.txt end dicio-wc .herbal-f-erg-{word,unif,pref,midf,suff,tail}-cmp.txt lines words bytes file ------ ------- --------- ------------ 1569 7361 55938 .herbal-f-erg-word-cmp.txt 199 1007 7275 .herbal-f-erg-unif-cmp.txt 53 257 1901 .herbal-f-erg-pref-cmp.txt 292 1469 11188 .herbal-f-erg-midf-cmp.txt 132 638 4738 .herbal-f-erg-suff-cmp.txt 894 4214 32524 .herbal-f-erg-tail-cmp.txt With these transformations, the prefixes are obviously still the same in both languages: by Friedman by Friedman language A language B freq pc prefix freq pc prefix ---- -- ------------------ ---- -- ------------------ 3857 65 - 1269 52 - 825 14 o- 504 21 o- 607 10 qo- 300 12 qo- 440 7 y- 227 9 y- 56 1 s- 66 3 ol- 42 1 ol- 26 1 l- 21 0 so- 6 0 s- 16 0 l- 5 0 o:i- 13 0 or- 4 0 or- 12 0 r- 3 0 lo- 11 0 oy- 2 0 olo- 7 0 o:i- 2 0 qol- 6 0 yo- 2 0 r- 5 0 os- 1 0 lol- 4 0 ro- 1 0 lqo- 4 0 sol- 1 0 o:ii- 4 0 sy- 1 0 o:n- 3 0 lo- 1 0 oo- 2 0 ls- 1 0 oro- 2 0 o:in- 1 0 orol- 2 0 oo- 1 0 oy- 2 0 oro- 1 0 so:i- 2 0 qoo:i- 1 0 sol- The suffixes are close enough: by Friedman by Friedman language A language B freq pc suffix freq pc suffix ---- -- ------------------ ---- -- ------------------ 1853 31 -y 1173 48 -y 1028 17 -ol 250 10 -or 881 15 -or 202 8 -ol 456 8 -o 179 7 -oiin 370 6 -oiin 122 5 -oy 266 5 -oy 96 4 - 136 2 - 73 3 -o 130 2 -om 47 2 -om 96 2 -ooiin 34 1 -os 84 1 -oly 33 1 -oin 77 1 -s 31 1 -oly 76 1 -os 31 1 -s 53 1 -oin 20 1 -oir 44 1 -ory 11 1 -ooiin 40 1 -oor 10 0 -oor 35 1 -on 9 0 -ory 28 1 -ool 6 0 -orom 12 0 -n 6 0 -oror 12 0 -ols 6 0 -yy 12 0 -yy 5 0 -ool The midfixes are still very different: by Friedman by Friedman language A language B freq pc midfix freq pc midfix ---- -- ------------------ ---- -- ------------------ 1090 18 -tch- 590 24 -t- 1045 18 -ch- 274 11 -te- 913 15 -t- 172 7 -ch- 526 9 -sh- 163 7 -che- 251 4 -che- 141 6 -tch- 191 3 -tche- 135 6 -tee- 181 3 -pch- 110 5 -she- 155 3 -te- 79 3 -sh- 142 2 -she- 64 3 -tche- 131 2 -tee- 57 2 -chtch- 96 2 -chot- 48 2 -chet- 93 2 -tsh- 48 2 -p- 69 1 -chtch- 46 2 -pch- 61 1 -chotch- 39 2 -pche- 60 1 -p- 24 1 -shee- 58 1 -chee- 23 1 -chee- 50 1 -cht- 19 1 -cht- 43 1 -pche- 18 1 -ee- 36 1 -shee- 18 1 -tsh- 30 1 -tchee- 18 1 -tshe- 25 0 -eee- 16 1 -chetch- And the tails, oh my: by Friedman by Friedman language A language B freq pc tailix freq pc tailix ---- -- ------------------ ---- -- ------------------ 379 6 -tchy 171 7 -tey 269 5 -chol 149 6 -tor 232 4 -chor 113 5 -tchy 195 3 -tol 108 4 -toiin 192 3 -tchol 99 4 -teey 191 3 -tchor 97 4 -chey 182 3 -ty 89 4 -tol 163 3 -toiin 73 3 -chy 154 3 -chy 63 3 -ty 121 2 -tchey 58 2 -shey 116 2 -sho 52 2 -tchey 114 2 -tor 49 2 -chtchy 104 2 -shol 30 1 -shy 95 2 -tcho 30 1 -tom 88 2 -chey 28 1 -pchey 83 1 -shy 25 1 -pchy 82 1 -shor 25 1 -teoy 75 1 -teey 23 1 -chety 61 1 -cho 23 1 -toin 58 1 -cheor 22 1 -toly 58 1 -choiin 21 1 -teol 53 1 -tchoy 20 1 -chol 47 1 -chotchy 18 1 -toy 45 1 -choy 17 1 -sheey 45 1 -shey 16 1 -chetchy 44 1 -teol 16 1 -chor 43 1 -chtchy 16 1 -choy 42 1 -pchy 14 1 -teo The unifixes are rather OK, I think, except for the inversion between "oiin" and "or": by Friedman by Friedman language A language B freq pc unifix freq pc unifix ---- -- ------------------ ---- -- ------------------ 441 24 oiin 149 19 or 175 10 or 126 16 oiin 145 8 ol 75 10 ol 107 6 y 35 4 y 88 5 s 25 3 soiin 77 4 oin 22 3 om 55 3 om 18 2 oroiin 40 2 soiin 17 2 oly 31 2 ooiin 16 2 oy 30 2 sor 13 2 oloiin 28 2 oir 12 2 oin 28 2 sol 12 2 olor 25 1 o 12 2 s 25 1 sy 10 1 ory 20 1 qooiin 9 1 ooiin The words as a whole are rather different: by Friedman by Friedman language A language B freq pc wordix freq pc wordix ---- -- ------------------ ---- -- ------------------ 441 6 oiin 149 5 or 247 3 chol 126 4 oiin 201 3 chor 80 3 chey 182 2 tchy 75 2 ol 175 2 or 64 2 chy 145 2 ol 56 2 qotey 126 2 chy 54 2 otey 108 1 sho 54 2 otor 107 1 tchol 53 2 shey 107 1 y 49 2 otoiin 104 1 tchor 48 2 chtchy 101 1 qotchy 44 1 otol 100 1 shol 43 1 tchy 88 1 s 35 1 qotor 79 1 otol 35 1 y 79 1 shor 33 1 tor 77 1 oin 31 1 oteey 77 1 oty 30 1 oty