Restarted everything from the beginning. cat bio-m-evt.evt \ | fsg2jsa \ > bio-m-jsa.evt Prepared a raw text file for training cat bio-m-jsa.evt \ | egrep '^<.*;[FC]> ' \ | sed \ -e 's/<.*;[FC]> *//g' \ -e 's/{[^}]*}//g' \ > bio-m-jsa.txt Note that fsg2jsa removes the "%" and "!" characters, so the lines in the *-jsa.evt output files are not aligned (to align them we must run some dynamic programming). cat bio-m-jsa.txt \ | grep -v '[*]' \ | sed -e 's/^/ /g' \ > bio-m-jsa-trainset.txt cat bio-m-jsa-trainset.txt \ | generate-fix-patterns -vMINOCC=10 \ > bio-m-jsa-fixer.sed lines words bytes file ------ ------- --------- ------------ 1669 3486 106719 bio-m-evt.evt 1530 1530 75404 bio-m-evt.txt 1530 1530 117592 bio-m-jsa.txt 1470 1470 115821 bio-m-jsa-trainset.txt 596 716 9932 bio-m-jsa-fixer.sed Generate consensus: cat bio-m-jsa.evt \ | make-consensus-interlin \ > bio-x-jsa.evt cat bio-x-jsa.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-jsa.evt extract-words-from-interlin \ -chars "qocilgysxju" \ bio-j-jsa.evt \ bio-j-jsa lines words bytes file ------ ------- --------- ------------ 7054 7054 62690 bio-j-jsa.wds 2132 2132 24925 bio-j-jsa.dic 4661 4661 40897 bio-j-jsa-gut.wds 992 992 9720 bio-j-jsa-gut.dic 840 840 2445 bio-j-jsa-fun.wds 2 2 5 bio-j-jsa-fun.dic 1553 1553 19348 bio-j-jsa-bad.wds 1138 1138 15200 bio-j-jsa-bad.dic Digraph counts: q o c i l g y s x j u TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 965 1877 361 60 . . . . . . 4661 q 1 . 1229 18 . 1 154 . . . 700 . 2103 o 21 486 1 63 1087 1071 . . . . . . 2729 c 4 167 176 6137 1209 232 2114 2921 1019 . . . 13979 i 4 1 1 8 1997 2 . . 560 1616 37 457 4683 l . . . . . . 16 . . . 1566 . 1582 g 52 . 74 2150 4 4 . . . . . . 2284 y 2790 26 2 47 13 43 . . . . . . 2921 s 463 1 99 1013 1 2 . . . . . . 1579 x 827 24 105 488 5 167 . . . . . . 1616 j 46 . 76 2175 6 . . . . . . . 2303 u 453 . 1 3 . . . . . . . . 457 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4661 2103 2729 13979 4683 1582 2284 2921 1579 1616 2303 457 40897 Word length statistics: cat bio-j-jsa-gut.wds \ | tr 'a-z0-9' '..........................................................................' \ | sort | uniq -c 2 . 21 .. 177 ... 176 .... 295 ..... 568 ...... 640 ....... 1021 ........ 793 ......... 627 .......... 184 ........... 91 ............ 40 ............. 12 .............. 11 ............... 1 ................ 2 ................. Ditto, removing limb letters: cat bio-j-jsa-gut.wds \ | tr -d 'gysxju' \ | tr 'a-z0-9' '..........................................................................' \ | sort | uniq -c 19 . 227 .. 334 ... 638 .... 1115 ..... 1290 ...... 685 ....... 263 ........ 72 ......... 13 .......... 3 ........... 2 ............ Treating limbs as end-of-words: cat bio-j-jsa-gut.wds \ | sed -e 's/\([gysxju]\)/\1 /g' \ | tr ' ' '\012' \ | egrep '.' \ | sort | uniq -c \ | sort +0 -1nr \ | pr -4 -t -s' ' \ | expand 2058 cy 27 ciij 4 ciiu 1 ciiiis 973 cs 24 ccccqj 4 cix 1 ciocs 821 qolj 20 qoqg 4 ocqj 1 ciqj 721 cccg 19 ciiis 4 oij 1 coclj 628 oix 16 ccois 4 olg 1 cocqg 454 ccccg 15 ccccs 4 qoclj 1 ij 438 cg 15 cqj 3 c 1 occcciiiu 436 ccg 13 ccciix 3 ccciij 1 occcg 374 ciix 12 ccciis 3 ci 1 occcy 353 cccy 12 o 3 colj 1 ocy 327 ciiiiu 11 ccix 3 qcccg 1 oiis 303 ix 11 clj 2 ccccois 1 oiiu 272 ccy 10 cccqg 2 ccis 1 oqo 269 lj 9 cois 2 ccocg 1 oqoix 245 ciis 9 cqg 2 ccolj 1 oqolg 237 olj 9 ocg 2 ccoqj 1 oqolj 219 oqj 9 oiiiu 2 cicg 1 q 205 qoqj 8 cccciix 2 ciiix 1 qccccg 186 ccccy 8 cciis 2 cilj 1 qcccy 138 ois 8 ccs 2 cis 1 qccy 134 qj 7 cccs 2 clg 1 qci 133 qoix 7 cciix 2 co 1 qcs 108 ciiiu 7 ccqg 2 occccg 1 qcy 101 ccclj 7 lg 2 qclj 1 qlj 86 is 7 qcqj 2 qoccccg 1 qoccccy 71 qg 7 qocg 2 qocqj 1 qocccy 65 cclj 7 qois 1 cc 1 qocclj 55 ccoix 6 cccois 1 ccccciiii 1 qociiiiu 54 cccqj 6 oclj 1 cccccoix 1 qociis 44 cccccy 6 qo 1 cccciij 1 qociix 37 cccclj 6 qocccg 1 ccccoix 1 qocy 37 cccoix 5 cccccs 1 ccccqg 1 qoiiu 36 coix 5 cccciis 1 cciij 1 qolg 35 oqg 5 ocs 1 cclg 1 qooix 32 ccqj 4 cics 1 ciclj 1 qoqolj 30 cccccg 4 ciiiiiu 1 cicqj Obvously, not all "letters" or "syllabes" end with limb strokes; some "cc" groups must be broken, too. (Many are probably the [T] and [S] characters.) Here is the table again, sorted by reverse-lex order: cat bio-j-jsa-gut.wds \ | sed -e 's/\([gysxju]\)/\1 /g' \ | tr ' ' '\012' \ | egrep '.' \ | revbytes | sort | revbytes | uniq -c \ | pr -4 -t -s' ' \ | expand 3 c 3 ccciij 2 co 4 ciiiiiu 1 cc 1 cccciij 6 qo 9 oiiiu 438 cg 4 oij 1 oqo 1 oiiu 436 ccg 269 lj 1 q 1 qoiiu 721 cccg 11 clj 973 cs 303 ix 454 ccccg 65 cclj 8 ccs 4 cix 30 cccccg 101 ccclj 7 cccs 11 ccix 2 occccg 37 cccclj 15 ccccs 374 ciix 2 qoccccg 1 qocclj 5 cccccs 7 cciix 1 qccccg 1 ciclj 4 cics 13 ccciix 1 occcg 6 oclj 5 ocs 8 cccciix 6 qocccg 1 coclj 1 ciocs 1 qociix 3 qcccg 4 qoclj 1 qcs 2 ciiix 2 cicg 2 qclj 86 is 628 oix 9 ocg 2 cilj 2 cis 36 coix 2 ccocg 237 olj 2 ccis 55 ccoix 7 qocg 3 colj 245 ciis 37 cccoix 7 lg 2 ccolj 8 cciis 1 ccccoix 2 clg 821 qolj 12 ccciis 1 cccccoi 1 cclg 1 oqolj 5 cccciis 1 qooix 4 olg 1 qoqolj 1 qociis 133 qoix 1 qolg 1 qlj 19 ciiis 1 oqoix 1 oqolg 134 qj 1 ciiiis 2058 cy 71 qg 15 cqj 1 oiis 272 ccy 9 cqg 32 ccqj 138 ois 353 cccy 7 ccqg 54 cccqj 9 cois 186 ccccy 10 cccqg 24 ccccqj 16 ccois 44 cccccy 1 ccccqg 1 cicqj 6 cccois 1 qoccccy 1 cocqg 4 ocqj 2 ccccois 1 occcy 35 oqg 2 qocqj 7 qois 1 qocccy 20 qoqg 7 qcqj 4 ciiu 1 qcccy 3 ci 1 ciqj 108 ciiiu 1 qccy 1 qci 219 oqj 1 occccii 1 ocy 1 ij 2 ccoqj 327 ciiiiu 1 qocy 27 ciij 205 qoqj 1 cccccii 1 qcy 1 cciij 12 o 1 qociiii Let's analyze the family {i,ii,iii,iiii}{u,s,j,x}, specifically: cat bio-j-jsa-gut.wds \ | sed \ -e 's/cs/z/g' \ -e 's/\([^i]i\)/ \1/g' \ -e 's/i\([^iusxj]\)/i\1 \1/g' \ -e 's/\([usjx]\)/\1 \1/g' \ -e 's/z/cs/g' \ | tr ' ' '\012' \ | egrep 'i' \ | revbytes | sort | revbytes \ | uniq -c | expand Results: | 32 ciij | 1 gis | 4 ciiu | 281 ix | 4 cic | 4 oij | 79 is | 109 ciiiu | 14 cix | 1 i | 1 yij | 2 xis | 2 oiiu | 8 yix | 5 ci | | 4 cis | 329 ciiiiu | 3 gix | 3 oi | | 4 yis | 9 oiiiu | 1 csix | 2 cil | | 271 ciis | 4 ciiiiiu | 6 jix | 1 cio | | 178 ois | | 3 xix | 1 ciq | | 19 ciiis | | 403 ciix | 4 cics | | 1 oiis | | 890 oix | | | 1 ciiiis | | 2 ciiix | Observations: Note that, by far, these suffixes are always preceded by \c/ or \o/. The exceptions are \ix/ and \is/ which are often initial and occasionally preceded by \y/, \lj/, \qj/ or another \ix/. Except for the \iu/ family, it seems that \ci/ and \o/ are satistically equivalent. The peculiarity of the \iu/ family could be explained by the fact that FSG has separate codes [M] and [N] for its members, whereas the other families are denoted [IR], [IIR], [IIIR], etc. Identifying \o/ with \ci/, the most frequent members of these families are \ciij/ (0.97 of the \ij/ family) \is/ (0.16) \ciis/ (0.80) \ciiis/ (0.04) \ciiiu/ (0.24) \ciiiiu/ (0.74) \ix/ (0.19) \ciix/ (0.78) Let's check the hypothesis that \o/ ~ \ci/, by looking at the distribution of adjacent letters. I will exclude the words that begin with \qo/ since these appear to be special: cat bio-j-jsa-gut.wds \ | egrep -v '^qo' \ | sed \ -e 's/[ql][jg]/p/g' \ -e 's/ci/a/g' \ | enum-trigraphs \ | egrep '.[ao].' \ > .ao.tri cat .ao.tri \ | egrep '.a.' \ > .a.tri cat .ao.tri \ | egrep '.o.' \ > .o.tri lines words bytes file ------ ------- --------- ------------ 855 855 3420 .a.tri 1460 1460 5840 .o.tri 2315 2315 9260 .ao.tri cat .a.tri \ | tr ' ' '_' \ | sed -e 's/\(.\).\(.\)/\1.\2 \1a\2/g' \ | sort | uniq -c \ > .a.frq cat .o.tri \ | tr ' ' '_' \ | sed -e 's/\(.\).\(.\)/\1.\2 \1o\2/g' \ | sort | uniq -c \ > .o.frq join -t' ' -a1 -a2 -e '000' -j1 2 -j2 2 -o1.3,1.1,2.3,2.1 .a.frq .o.frq \ | expand \ | gawk ' { printf " %3s %4.2f %3s %4.2f %4.2f\n", $1, ($2/855), $3, ($4/1460), ($2/855 + $4/1460) } ' \ | sed -e 's/\b000/.../g' \ | sort +4 -5nr _ai 0.07 _oi 0.31 0.39 gai 0.35 goi 0.02 0.38 pai 0.31 poi 0.05 0.36 _ap 0.00 _op 0.33 0.33 sai 0.13 soi 0.06 0.19 cai 0.07 coi 0.10 0.17 xai 0.03 xoi 0.06 0.09 _ac 0.00 _oc 0.02 0.02 ... 0.00 xo_ 0.01 0.01 cax 0.01 ... 0.00 0.01 Nothing clear emerges. There may be confusion between \ci/ and \o/, but the two do not seem to be equivalent. In all, it seems that \ci/ and \o/ are lexically similar but distinct letters. The valid \i/ sequences are \ij/ \is/ \iis/ \iiu/ \iiiu/ \ix/; the others are likely to be scription or transcription errors. So let's replace the \i/ sequences by distinct letters, then \ci/ by `a', and look at what we get. Let's use this ad-hoc encoding: jsa2hoc ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to ad-hoc encoding: s/ij/f/g s/ix/e/g s/ci/a/g s/iiiu/m/g s/iiu/n/g s/iis/v/g s/is/r/g s/cs/z/g s/cg/8/g s/cy/9/g s/qo/A/g s/qj/H/g s/qg/P/g s/lj/H/g s/lg/P/g ------------------------------------------------- extract-words-from-interlin \ -recode jsa2hoc \ -chars "aocz89AHPermnvf" \ bio-j-jsa.evt \ bio-j-hoc lines words bytes file ------ ------- --------- ------------ 7054 7054 41600 bio-j-hoc.wds 1995 1995 15504 bio-j-hoc.dic 4626 4626 26296 bio-j-hoc-gut.wds 858 858 5513 bio-j-hoc-gut.dic 875 875 2655 bio-j-hoc-fun.wds 33 33 189 bio-j-hoc-fun.dic 1553 1553 12649 bio-j-hoc-bad.wds 1104 1104 9802 bio-j-hoc-bad.dic Digraph counts: a o c z 8 9 A H P e r m n v f TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 66 965 621 727 362 95 1215 151 64 282 78 . . . . 4626 a 3 . 1 2 4 2 . . 3 . 402 270 328 108 19 32 1174 o 14 . . 17 6 11 1 4 463 39 758 170 9 1 1 4 1498 c 4 60 176 3509 35 1646 855 . 358 31 14 . . . . . 6688 z 41 68 63 823 9 3 4 . 2 1 1 . . . . . 1015 8 49 308 37 37 31 1 1631 . 4 . 3 1 . . . . 2102 9 2778 1 2 16 20 9 . 2 64 2 8 4 . . . 1 2907 A 7 3 1 17 . 7 1 1 1022 22 134 7 . 1 . . 1223 H 10 583 74 1323 41 3 207 . . . 6 . . . . . 2247 P 2 11 36 97 14 3 6 . . . . . . . . . 169 e 824 32 105 205 115 53 81 1 180 10 3 2 . . . . 1611 r 396 42 36 21 13 2 22 . . . . . . . . . 532 m 334 . . . . . 3 . . . . . . . . . 337 n 109 . 1 . . . . . . . . . . . . . 110 v 19 . . . . . 1 . . . . . . . . . 20 f 36 . 1 . . . . . . . . . . . . . 37 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4626 1174 1498 6688 1015 2102 2907 1223 2247 169 1611 532 337 110 20 37 26296 Next-symbol probability (× 99): a o c z 8 9 A H P e r m n v f TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1 21 13 16 8 2 26 3 1 6 2 . . . . 99 a . . . . . . . . . . 34 23 28 9 2 3 99 o 1 . . 1 . 1 . . 31 3 50 11 1 . . . 99 A 1 . . 1 . 1 . . 83 2 11 1 . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- e 51 2 6 13 7 3 5 . 11 1 . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- c . 1 3 52 1 24 13 . 5 . . . . . . . 99 z 4 7 6 80 1 . . . . . . . . . . . 99 8 2 15 2 2 1 . 77 . . . . . . . . . 99 H . 26 3 58 2 . 9 . . . . . . . . . 99 P 1 6 21 57 8 2 4 . . . . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 9 95 . . 1 1 . . . 2 . . . . . . . 99 r 74 8 7 4 2 . 4 . . . . . . . . . 99 m 98 . . . . . 1 . . . . . . . . . 99 n 98 . 1 . . . . . . . . . . . . . 99 v 94 . . . . . 5 . . . . . . . . . 99 f 96 . 3 . . . . . . . . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 17 4 6 25 4 8 11 5 8 1 6 2 1 0 0 0 26296 Previous-symbol probability (× 99): a 9 o c z 8 A H P e r m n v f TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 6 3 64 9 71 17 98 7 37 17 15 . . . . 17 a . . . . . . . . . . 25 50 96 97 94 86 4 o . . . . . 1 1 . 20 23 47 32 3 1 5 11 6 A . . . . . . . . 45 13 8 1 . 1 . . 5 c . 5 29 12 52 3 78 . 16 18 1 . . . . . 25 z 1 6 . 4 12 1 . . . 1 . . . . . . 4 8 1 26 56 2 1 3 . . . . . . . . . . 8 9 59 . . . . 2 . . 3 1 . 1 . . . 3 11 H . 49 7 5 20 4 . . . . . . . . . . 8 P . 1 . 2 1 1 . . . . . . . . . . 1 e 18 3 3 7 3 11 2 . 8 6 . . . . . . 6 r 8 4 1 2 . 1 . . . . . . . . . . 2 m 7 . . . . . . . . . . . . . . . 1 n 2 . . . . . . . . . . . . . . . 0 v . . . . . . . . . . . . . . . . 0 f 1 . . . . . . . . . . . . . . . 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 26296