Summary of previous notebooks ============================= On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip I manually extracted from it a homogeneous, full-text sample bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the "biological" section, in Currier's Language B, hand 2. This section includes Currier's and Friedman's transcriptions. Currier's seems to be the most complete of them. The two versions have many differences (affecting 5-10% of the words), and often disagree even in the grouping of symbols: where one sees two words the other sees a single word, what is [A] for one may be [CI] for the other, and so on. So I decided to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. I called this new encoding "jsa" (Jorge's Super-Analytic). After mapping to jsa, I generated a "consensus" version of the biological section, and got these digraph counts: q o c i l g y s x j u TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 965 1877 361 60 . . . . . . 4661 q 1 . 1229 18 . 1 154 . . . 700 . 2103 o 21 486 1 63 1087 1071 . . . . . . 2729 c 4 167 176 6137 1209 232 2114 2921 1019 . . . 13979 i 4 1 1 8 1997 2 . . 560 1616 37 457 4683 l . . . . . . 16 . . . 1566 . 1582 g 52 . 74 2150 4 4 . . . . . . 2284 y 2790 26 2 47 13 43 . . . . . . 2921 s 463 1 99 1013 1 2 . . . . . . 1579 x 827 24 105 488 5 167 . . . . . . 1616 j 46 . 76 2175 6 . . . . . . . 2303 u 453 . 1 3 . . . . . . . . 457 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4661 2103 2729 13979 4683 1582 2284 2921 1579 1616 2303 457 40897 Some conclusions we get from this and other data: The valid \i/ sequences are \ij/ \is/ \iis/ \iiu/ \iiiu/ \ix/; the others are likely to be scription or transcription errors. \ci/ and \o/ are lexically similar but distinct glyphs. The suffixes \ij/, \is/, \iiu/, and \iiiu/ are preceded almost exclusively by \ci/ and strictly word-final. It seems plausible that these are errors: \oij/ (4 occurrences) should be \ciij/ ( 32 occurrences) \oiiu/ (2 occurrences) should be \ciiiu/ (109 occurrences) \ciiu/ (4 occurrences) should be \ciiiu/ (109 occurrences) \oiiiu/ (9 occurrences) should be \ciiiiu/ (329 occurrences) \ciiiiiu/ (4 occurrences) should be \ciiiiu/ (329 occurrences) \ciiix/ (2 occurrences) should be \ciix/ (403 occurrences) \ciiis/ (19 occurrences) may also be a misreading of \ciis/ (291 occurrences). \cg/ is always a glyph. \qo/ is a combination that occurs only in word-initial position. \qc/ is likely to be a misreading/miswriting of \qo/. \cy/ is always a glyph, almost certainly a final form of \ci/. \qj/, \lj/, \qg/, \lg/ are glyphs. \cs/ is a glyph closely related to (but distinct from) \c/. \ccg/ is almost always followed by \ci/ or \cy/. Here "glyph" means a group of strokes that can be treated as a single symbol for analysis; it may actually be part of a larger, still unrecognized symbol. Summarizing again: \iiiu/, \iiu/, \iis/, \ij/ The ziggies: strictly final, preceded always by \ci/ or, more rarely, by \o/. \ix/ Usually initial or preceded by \ci/ or \o/; followed by any letter except ziggies and \qo/, \ix/, \is/ \is/ Similar to \ix/ except that it cannot be followed by capitals or \cg/, either. \cy/ Almost always final, but occasionaly followed by other letters. Preceded by about the same letters as \ci/; indeed, it is probably the final form of \ci/. \cg/ May be followed by many letters, most often \cy/ and \ci/. Almost always prededed by \c/, or initial; rarely by \ix/ or \o/. \cs/ Most often followed by \c/, somewhat less often by \o/, \ci/, or word break. Most often initial, but also preceded by \ix/, gallows, \c/, \cy/, \cg/, \is/. \lj/, \qj/ The H-gallows: Very similar to each other, different from the rest, but somewhat similar to the P-gallows. They probably combine with \c/ on both sides to make glyphs. It is very likely that \l/ and \q/ are exactly equivalent. \lg/, \qg/ The P-gallows: Very similar to each other, different from the rest, but somewhat similar to the P-gallows. They probably combine with \c/ on both sides to make glyphs. It is very likely that \l/ and \q/ are exactly equivalent. They may be merely ornate forms of some letter, or several letters (\cg/, perhaps), used mainly in the first line of each paragraph (and perhaps of each page?) \qo/ Strictly initial, almost always followed by a capital. Sometimes misread as \qc/? \ci/ May be followed only by the ziggies, \ix/, or \ir/ only. Often follows a capital, but also \cg/, \cs/, \c/, \ix/, \is/, or word break. \o/ Similar to \a/, but is very often word-initial. Other conclusions: * The manuscript does not appear to use any hyphenation mark. Either words are not broken across lines, which would be unusual, or they are broken without any extra marks. Such word breaks may result in statistical anomalies at the beginning and end of lines. Could this explain Currier's claim that lines are "functional units"? * Note that parsing sequences like \cij/, \ciis/, and \ciiis/ requires some care: the right parsings are c+ij, c+iis, ci+iis. * The parsing of \ciis/ is ambiguous: ci+is or c+iis. Declaring \ciiis/ to be a misreading of \ciis/ would remove the ambiguity. * The parsing of \ciiiu/ is ambiguous, too; but since the \iu/ series does not seem to follow a bare \c/, it seems safe to parse it as ci+iiu. * The gallows characters \qj/ and \lj/ appear to be closely related: for every common word with \lj/, there appears to be a a word with \qj/ that occurs with about 1/4 the frequency. * There seems to be a kinship between the glyphs \cs/ (when not attached to the following \c/s) \ir/, and the gallows \lj/ and \qj/ (also, when unattached). * The same phenomenon can be noted with respect to prefixes containing \cc/ and \csc/: for every word beginning with \cc/, there is a word where the first \cc/ is replaced by \csc/, and practically the same frequency. * There apepars to be much confusion between the suffixes \iu/ and \iiiu/. They are almost surely distinct letters, but in about one half of the cases, Currier sees \iiu/ where Friedman has \iiiu/. * There appears to be much confusions between \o/ and \ci/. The strings of \c/, \cs/, \lj/, \qj/, \lg/, \qg/ must be treated together, after collapsing the glyphs listed above, since there seem to be glyphs consisting of gallows preceded and followed by \c/ or \cc/. When this is taken into account, we can see that a single \c/ is not a glyph, but \cs/ is. In fact, after shrinking \ci/ to `a', \cs/ to `z', the gallows to `H' or `P', the only possible glyphs of the form [czHp]* with length at most 3 are freq glyph ---- ----- 795 H 52 P 152 z 138 cc 70 zc 482 Hc 484 ccc 439 zcc ? 493 Hcc ? 19 cHc 4 cPc The ones marked `?' may be composite, z+cc and H+cc, but this hypothesis does not seem very likely (perhaps they are *sometimes* composite?) The significant strings of length 4 that cannot be parsed into the glyphs above are 20 cHcc 4 cPcc Strings with 4 or more [czHP]'s tend to be quite ambiguous. Looking at the raw texts, it seems that the main source of "?"s is the confusion between "M" and "N" by Currier and/or Friedman. So I decided to map both [N] and [M] (and other lookalikes) to "m". I christened the new encoding "hop". --- fsg2hop ------------------------ #! /n/gnu/bin/gawk -f # Recoding an interlinear file from the FSG alphabet to # my Lossy Ad-hoc Semi-Analytic Fault-Tolerant encoding BEGIN { print "# Output of fsg2hop - Stolfi's Semi-Analytic Fault-Tolerant alphabet" } /^ *$/ { print; next } /^ *#/ { print; next } /^<[^>.;]*>/ { print; next } /^<[^>]*\.[^>]*;[A-Z]> / { curtxt = substr($0,20) # We discard "%" and "!" since the conversion # will destroy synchronism anyway. gsub(/[%!]/, "", curtxt); # First, the conversion from FSG to JSA (Stolfi's super-analytic) gsub(/IIIK/, "iiiij", curtxt); gsub(/IIIL/, "iiiiu", curtxt); gsub(/IIIR/, "iiiis", curtxt); gsub(/IIIE/, "iiiix", curtxt); gsub(/IIE/, "iiix", curtxt); gsub(/IIR/, "iiis", curtxt); gsub(/IIK/, "iiij", curtxt); gsub(/HZ/, "cqjc", curtxt); gsub(/PZ/, "cqgc", curtxt); gsub(/DZ/, "cljc", curtxt); gsub(/FZ/, "clgc", curtxt); gsub(/IE/, "iix", curtxt); gsub(/IR/, "iis", curtxt); gsub(/IK/, "iij", curtxt); gsub(/2/, "cs", curtxt); gsub(/4/, "q", curtxt); gsub(/6/, "cj", curtxt); gsub(/7/, "ig", curtxt); gsub(/8/, "cg", curtxt); gsub(/A/, "ci", curtxt); gsub(/C/, "c", curtxt); gsub(/D/, "lj", curtxt); gsub(/E/, "ix", curtxt); gsub(/F/, "lg", curtxt); gsub(/G/, "cy", curtxt); gsub(/H/, "qj", curtxt); gsub(/I/, "i", curtxt); gsub(/K/, "ij", curtxt); gsub(/L/, "iu", curtxt); gsub(/M/, "iiiu", curtxt); gsub(/N/, "iiu", curtxt); gsub(/O/, "o", curtxt); gsub(/P/, "qg", curtxt); gsub(/R/, "is", curtxt); gsub(/S/, "csc", curtxt); gsub(/T/, "cc", curtxt); gsub(/V/, "?", curtxt); gsub(/Y/, "?", curtxt); # Now, the conversion from JSA to HOP: gsub(/[ql]j/, "H", curtxt); gsub(/[ql]g/, "P", curtxt); gsub(/cs/, "z", curtxt); gsub(/ij/, "k", curtxt); gsub(/ix/, "e", curtxt); gsub(/is/, "r", curtxt); gsub(/iiu/, "n", curtxt); gsub(/y/, "i", curtxt); gsub(/ci/, "a", curtxt); gsub(/cg/, "8", curtxt); gsub(/ir/, "w", curtxt); gsub(/i*n/, "m", curtxt); print (substr($0,1,19) curtxt); next } ------------------------------------