I took the list of good words and examined the contexts of all "gallows" letters: cat bio-j-jsa.wds \ | jsa2hoc \ | enum-contexts -vCTX=1 -vPAT='[HP]' \ | wfreq4 710 AHc 361 AHa 320 cHc 319 oHc 170 oHa 146 eHc 108 Hc 90 AH9 80 ?Hc 72 cH9 57 9Hc Statistics of \c/ strings: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=0 -vPAT='cc*' \ | wfreq4 860 c 1317 cc 884 ccc 145 cccc 1 ccccc These numbers suggest that \cc/ may be a single letter. Let's look at the frequencies of single \c/: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \ | wfreq4 399 Hc8 9 qcH 4 zce 1 rca 247 Hc9 8 zca 2 acH 1 qc9 30 zc8 8 zcH 2 Pca 1 Pco 25 zco 7 Hcz 2 Pc9 1 Pc8 23 Hco 6 Hca 2 8c8 1 9cP 13 zc9 6 AcH 1 zcz 1 9cH 10 ocH It seems that single \c/s are all associated with \z/, \H/, or \P/. I will try to remove the cHc and cPc sequences and see what happens: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | sed -e 's/c[HP]c/X/g' \ | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \ | wfreq4 380 Hc8 25 zco 4 zce 1 qc9 78 Hc9 19 Hco 2 qcH 1 Xcz 60 zcX 13 zc9 2 Hca 1 Xco 30 zc8 8 zca 2 8c8 1 AcX 30 Xc9 7 Hcz 1 zcz 1 9cP 29 Xc8 5 zcH 1 rca It seems that \c8/ and \c9/ are also major sources of single \c/s cat bio-j-jsa-gut.wds \ | jsa2hoc \ | sed \ -e 's/c[HP]c/X/g' \ -e 's/zc/Z/g' \ -e 's/c8/K/g' \ -e 's/c9/W/g' \ | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \ | wfreq4 315 HcK 9 Zca 2 rcW 1 Zcz 166 HcW 6 XcK 2 qcH 1 ZcP 52 ZcK 6 AcK 2 XcW 1 Xcz 48 ZcX 5 PcK 2 Hcz 1 Xco 40 Zco 5 HcZ 2 Hca 1 PcW 34 ZcW 5 8cK 1 rca 1 AcX 25 ZcH 3 qcK 1 qcW 1 AcW 19 Hco 3 KcW 1 ocW 1 9cP 15 ecK 3 8cW 1 ocK 1 9cK 14 ecW Not very illuminating. Let's examine again the \c/ strings in context: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=1 -vPAT='cc*' \ | wfreq4 401 Hc8 11 8ccc8 3 Hcccc9 1 ocP 347 Hcc8 11 ccca 3 8cco 1 eccz 322 zcc8 10 cccz 3 8ccco 1 eccccz 255 Hc9 10 cP 3 8c8 1 ecccco 200 Hcc9 9 zcca 3 ccccz 1 ecccP 192 ccc8 9 qcH 3 ccP 1 ecca 116 zcc9 9 Pcc8 2 zcccP 1 eccP 97 eccc8 9 cce 2 zc 1 eccH 90 cccH 8 zca 2 rcccH 1 ecc 82 zccH 8 zcH 2 rcc9 1 Pco 67 ccc9 8 Pcc9 2 occc8 1 Pcca 59 zcccH 7 Hcz 2 ecce 1 Pc8 53 Pccc8 7 8cc9 2 eccca 1 HccccH 52 zccc8 7 cccP 2 acH 1 Hc 51 ccccH 6 rccc9 2 Pcccc9 1 Accc9 42 eccc9 6 rccc8 2 Pca 1 AccH 40 zcco 6 eccccH 2 Pc9 1 Acc9 37 Hccc8 6 Hca 2 Hccco 1 9ccco 34 zccc9 6 Acc8 2 Hcccc8 1 9ccccz 31 zc8 6 AcH 2 Accc8 1 9cccc9 31 cccc9 6 9ccc8 2 9ccccH 1 9cccc8 26 zco 6 cc9 1 zcz 1 9ccc9 25 Hco 5 eccco 1 zccz 1 9cc8 25 ccco 5 ecccc8 1 zccccH 1 9cP 24 Hccc9 5 8cc8 1 zcccc9 1 9cH 22 cc8 4 zce 1 rccz 1 8cccco 20 cccc8 4 zcccz 1 rcccz 1 8cccc9 17 cco 4 zccP 1 rcccc9 1 8cccc8 17 cH 4 ecccc9 1 rcccc8 1 8cccH 15 ecc8 4 Pccco 1 rccca 1 8ccc9 14 zc9 4 Hccz 1 rca 1 8c9 14 ecc9 4 Hcca 1 qccc8 1 cccco 14 ccH 3 zccco 1 qcc9 1 ccccco 13 cca 3 qcc8 1 qc9 1 cccca 11 ocH 3 ecco 1 occca 1 ccccP 11 Pccc9 3 ecccH 1 occ9 1 ca 11 Hcco 3 Pcco 1 occ8 Let's look more closely at the \Hc^*8/ strings: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=1 -vPAT='Hcc*8' \ | wfreq4 198 AHc89 14 Hccc89 2 AHcccc89 1 cHc8 184 AHcc89 14 Hcc89 2 9Hcc8a 1 AHccc8 85 oHc89 11 AHccc89 2 8Hc89 1 AHcc8a 59 oHcc89 9 oHc8a 2 Hcc8 1 AHc8z 31 cHcc89 6 oHccc89 1 oHcc8a 1 AHc8o 29 eHc89 5 AHcc8 1 oHc8c 1 AHc8c 27 eHcc89 5 AHc8a 1 oHc8 1 Hccc8a 26 Hc89 3 oHcc8 1 eHcc8 1 Hcc8o 20 cHc89 3 AHc8 1 eHc8c 1 Hcc8a 15 9Hc89 2 eHccc89 1 eHc8 1 Hc8a 14 9Hcc89 2 cHccc8 1 cHcc8c Curiously, so the string \Hc*8/ is almost always followed by \9/. I wonder if that is true in general: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=0 -vRCTX=2 -vPAT='c8' \ | wfreq4 1526 c89 5 c8oe 2 c8av 1 c89z 45 c8 4 c8cc 1 c8zc 1 c89o 24 c8ar 4 c8af 1 c8c9 1 c89H 22 c8ae 3 c8or 1 c8c8 1 c89A 10 c8am 3 c89e 1 c8an It does seem that \c8/ is almost always followed by \9/, and occasionally by end-of-word. Let's look at the left context: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='cc*8' \ | wfreq4 401 Hc8 31 zc8 6 9ccc8 1 rcccc8 347 Hcc8 22 cc8 5 ecccc8 1 qccc8 322 zcc8 20 cccc8 5 8cc8 1 occ8 192 ccc8 15 ecc8 3 qcc8 1 Pc8 97 eccc8 11 8ccc8 2 occc8 1 9cccc8 53 Pccc8 9 Pcc8 2 Hcccc8 1 9cc8 52 zccc8 6 rccc8 2 Accc8 1 8cccc8 37 Hccc8 6 Acc8 2 8c8 1 c8 Let's look at fout \c/s: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=0 -vRCTX=1 -vPAT='cccc' \ | wfreq4 61 ccccH 30 cccc8 3 cccco 1 cccca 44 cccc9 5 ccccz 1 ccccc 1 ccccP cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='cccc' \ | wfreq4 109 cccc 6 Hcccc 3 8cccc 2 rcccc 17 ecccc 5 9cccc 2 zcccc 2 Pcccc Let's examine the contexts of \zc/: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='zc' \ | wfreq4 580 zc 31 8zc 13 rzc 5 czc 108 ezc 19 9zc 9 zzc 4 ozc 41 Hzc 14 Pzc cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=0 -vRCTX=1 -vPAT='zc' \ | wfreq4 730 zcc 14 zc9 8 zcH 2 zc 31 zc8 8 zca 4 zce 1 zcz 26 zco It is time to change my ad-hoc encoding to reflect the consonant/vowel theory (as suggested by Grove). We have identified several categories of symbols: \iiiu/, \iiu/, \iis/, \ij/ The ziggies: strictly final, preceded always by \ci/ or, more rarely, by \o/. \cy/ Almost always final, but occasionaly followed by other letters. Preceded by about the same letters as \ci/; indeed, it is probably the final form of \ci/. \cg/ May be followed by many letters, most often \cy/ and \ci/. Almost always prededed by \c/, or initial; rarely by \ix/ or \o/. \cs/ Most often followed by \c/, somewhat less often by \o/, \ci/, or word break. Most often initial, but also preceded by \ix/, gallows, \c/, \cy/, \cg/, \is/. \lg/, \qg/, \lj/, \qj/ The capitals: Very similar to each other, different from the rest. probably to be combined with \c/ on both sides. \qo/ Strictly initial, almost always followed by a capital. \ix/ Usually initial or preceded by \ci/ or \o/; followed by any letter except ziggies and \qo/, \ix/, \is/ \is/ Similar to \ix/ except that it cannot be followed by capitals or \cg/, either. \ci/ May be followed only by the ziggies, \ix/, or \ir/ only. Often follows a capital, but also \cg/, \cs/, \c/, \ix/, \is/, or word break. \o/ Similar to \a/, but is very often word-initial. With these considerations, I defined a new encoding, "hic": jsa2hic ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to ad-hoc encoding: s/ij/J/g s/ix/I/g s/ci/S/g s/iiiu/M/g s/iiu/N/g s/iis/L/g s/is/C/g s/csc/Z/g s/cg/U/g s/cy/S/g s/qo/W/g s/qj/A/g s/qg/E/g s/lj/Y/g s/lg/O/g s/o/R/g s/cc/T/g ------------------------------------------------- extract-words-from-interlin \ -recode jsa2hic \ -chars "AEIOUYWSRTMNLc" \ bio-j-jsa.evt \ bio-j-hic lines words bytes file ------ ------- --------- ------------ 7054 7054 41600 bio-j-hic.wds 1934 1934 15117 bio-j-hic.dic 4626 4626 26296 bio-j-hic-gut.wds 800 800 5150 bio-j-hic-gut.dic 875 875 2655 bio-j-hic-fun.wds 33 33 189 bio-j-hic-fun.dic 1553 1553 12649 bio-j-hic-bad.wds 1101 1101 9778 bio-j-hic-bad.dic S T R L N Z H Q A E O U c TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 161 965 282 78 215 727 362 1215 . . . . 621 4626 S 2781 1 3 410 274 69 24 11 2 33 19 108 328 18 4081 T 14 1 . 758 170 502 6 11 4 4 1 1 9 17 1498 Q 7 4 1 134 7 1044 . 7 1 . . 1 . 17 1223 R 824 113 105 3 2 190 115 53 1 . . . . 205 1611 L 396 64 36 . . . 13 2 . . . . . 21 532 N 12 807 110 6 . . 55 6 . . . . . 1420 2416 Z 41 72 63 1 . 3 9 3 . . . . . 823 1015 H 49 1939 37 3 1 4 31 1 . . . . . 37 2102 A 36 . 1 . . . . . . . . . . . 37 E 19 1 . . . . . . . . . . . . 20 O 109 . 1 . . . . . . . . . . . 110 U 334 3 . . . . . . . . . . . . 337 c 4 915 176 14 . 389 35 1646 . . . . . 3509 6688 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4626 4081 1498 1611 532 2416 1015 2102 1223 37 20 110 337 6688 26296 Next-symbol probability (× 99): N R L S T Z Q H A E O U c TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 5 6 2 3 21 16 26 8 . . . . 13 99 N . . . . 33 5 2 . . . . . . 58 99 R 51 12 . . 7 6 7 . 3 . . . . 13 99 L 74 . . . 12 7 2 . . . . . . 4 99 S 67 2 10 7 . . 1 . . 1 . 3 8 . 99 T 1 33 50 11 . . . . 1 . . . 1 1 99 Z 4 . . . 7 6 1 . . . . . . 80 99 Q 1 85 11 1 . . . . 1 . . . . 1 99 H 2 . . . 91 2 1 . . . . . . 2 99 A 96 . . . . 3 . . . . . . . . 99 E 94 . . . 5 . . . . . . . . . 99 O 98 . . . . 1 . . . . . . . . 99 U 98 . . . 1 . . . . . . . . . 99 c . 6 . . 14 3 1 . 24 . . . . 52 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 17 9 6 2 15 6 4 5 8 0 0 0 1 25 26296 Previous-symbol probability (× 99): N R L S T Z Q H A E O U c TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 9 17 15 4 64 71 98 17 . . . . 9 17 N . . . . 20 7 5 . . . . . . 21 9 R 18 8 . . 3 7 11 . 2 . . . . 3 6 L 8 . . . 2 2 1 . . . . . . . 2 S 60 3 25 51 . . 2 . 1 88 94 97 96 . 15 T . 21 47 32 . . 1 . 1 11 5 1 3 . 6 Z 1 . . . 2 4 1 . . . . . . 12 4 Q . 43 8 1 . . . . . . . 1 . . 5 H 1 . . . 47 2 3 . . . . . . 1 8 A 1 . . . . . . . . . . . . . 0 E . . . . . . . . . . . . . . 0 O 2 . . . . . . . . . . . . . 0 U 7 . . . . . . . . . . . . . 1 c . 16 1 . 22 12 3 . 78 . . . . 52 25 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 26296 cat bio-j-jsa.wds \ | /n/gnu/bin/awk -f foo.awk \ | sort | uniq -c | sort -nr | expand \ > bio-j-jsa-wpairs.freq