From these preliminary hacking, I got the following conclusions: The manuscript does not appear to use any hyphenation mark. Either words are not broken across lines, which would be unusual, or they are broken without any extra marks. Such word breaks may result in statistical anomalies at the beginning and end of lines. Could this explain Currier's claim that lines are "functional units"? Comparing the two versions (Currier and Friedman), and looking at the word statistics, it seems that both are highly contamiated with error (5-10% of the words. This large amount of noise will mess up any statistical analysis based on either text alone. Therefore, before spending more time in the analysis, I must first prepare a "corrected" interlinear where discrepancies between FSG and Currier are resolved, taking into account the probabilities above. Loking at the actual shape of the characters, I realized that the FSG encoding was not very good for my purposes, since is assigns completely different codes to glyphs which may be just calligraphic variations of the same grapheme. Thus I decided to do most processing using a more analytical encoding, which can be lumped later. I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding, but even that is a bit too synthetic --- for example, his <2> should be "i'", and his <9> should be `c)', for consistency. (The statistics on the occurrence of repeated s apparently confirm this choice). Thus I decided to define my own "super-analytic" or "JSA" encoding. My super-analytic encoding -------------------------- The idea is to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. There is some question as to what is a logical stroke, and when two strokes are different. Obviously, the definition of a stroke must include not only its shape but also the way it connects to the neighboring strokes; and, given the irregularity of handwritten glyphs, that may be hard to decide. For instance, FSG's [A] character can be broken down into two strokes, shaped like the [C] and [I] glyphs. Supposedly, the difference between an [A] and a [CI] is that in the former the strokes are connected into a closed shape. Is this difference significant? I checked the occurrences of [CI], [CM], and [CN] in the interlinear file. Two things are curious. First, these combinations are extremely rare. Second, a good many of them are transcribed differently by Currier and the FSG: where one has [CIIR] the other often has [AIR], and vice-versa. Same for [CM] versus [AN], etc. In light of these observations, I have decided to treat all occurrences of [A] as [CI]. If the two are indeed different, that will be just one more ambiguity added to the inherent ambiguity of natural language; so it cannot make the decipherment task more difficult. Confusing the two will change the letter frequencies, it is true; but, since the language does not appear to be a standardized one, there is not much information we can extract from absolute letter frequencies. The methods we hope to use --- such as automaton analysis --- are not significantly disturbed by collapsing letters. On the other hand, if [A] and [CI] are the same grapheme, using different encodings will seriously confuse statistics --- especially if the spacing depends on the immediate context. For similar resons, it is best to ignore the distinction between [T] and [CC], or between [S] and [2C]. The ligature is often lost, and we don't know whether it is significant. Also, the characters that Currier transcribes as [6] are usually transcribed [K] by Friedman, and the two are very similar. Strangely [K] seems to occur mostly at the end of *lines*. The characters [7] [V} [Y] do not occur in this corpus. Summarizing, the JSA encoding breaks down evey character into strokes, which are cast into one of these types: 1. "Body" strokes: q same as FSG [4], Guy <4>; also part of [H], [P], [HZ], ... o same as [O], c same as [C], ; also part of [A], [8], etc. i same as [I], ; also part of [A], [M], [N], [R], etc. l long vertical bar of [D], [F], [DZ], [FZ] 2. "Limb" strokes ("flourishes", "plumes", ...) g an 8-shaped loop with both ends attached to the previous letter, as the right three-fourths of [8] and [7]; and also the right-hand swirls of [P], [F], [PZ], [FZ]. y a curving descender shaped like a right-parenthesis, attached to the top of the preceding stroke; the right-hand stroke of [G] = <9> s a plume attached to the top of the preceding char, pointing NE and curving up, as in [2] = , [R] = <2>, and [S] x a hook attached to the top of an \i/ stroke, curving sharply down and crossing the \i/; half of [E] = . j a P-shaped loop with one end attached to the top of the previous slope, and the other extending straight down; as in the right half of [H], [D], [HZ], [DZ], and [K]. u a plume similar to \s/, but attached to the *bottom* of the preceding stroke; as in [L], [N], [M]. The ligature in [T] is ignored, i.e. Guy's and are identified with his , and denoted uniformly by \c/. This identification is consistent with the digraph statistics. The character is rendered \ci/. In fact, is probably not a letter --- it appears to be a \c/ stroke (possibly half of the preceding letter) accidentally connected to an \i/ stroke (probably the beginning of the next letter). The weirdo symbols [Y], [V], etc. will be translated as \?/. The FGS -> JSA correspondence is, therefore IIIK -> iiiij IE -> iix A -> ci N -> iiu IIIL -> iiiiu IR -> iis C -> c O -> o IIIR -> iiiis IK -> iij D -> lj P -> ag IIIE -> iiiix 2 -> cs E -> ix R -> is IIE -> iiix 4 -> a F -> lg S -> csc IIR -> iiis 6 -> cj G -> cy T -> cc IIK -> iiij 7 -> ig H -> aj V -> ? HZ -> cajc 8 -> cg I -> i Y -> ? PZ -> cagc K -> ij DZ -> cljc L -> iu FZ -> clgc M -> iiiu Note that the \i/ groups have one more \i/ in JSA than they have in Guy's encoding. This is redundant but makes it more evident that , , <2> are homologous members of their respecive series. Also, this encoding fixes a minor discrepancy of Guy2, which uses one extra \i/ in the series , , ... . Ad-hoc encodings ---------------- After mapping everything to the JSA encoding, and looking at the digraph frequency tables, I observed that: The stroke `l' is always followed by either `j' or `g', hence `lj' and `lg' should be single letters. Note also that there are two clearly different kinds of strokes, "body" B = {`c',`o',`t',`i',`q',`l'} and "limb" L = {`u',`x',`y',`j',`g',`s'}. If we reduce the digraph count matrix to these two classes, plus word break W, we get B L ----- ----- ----- . 6420 . B 59 19849 15616 L 6361 9255 . ----- ----- ----- Next-symbol probabilities (× 99): B L ----- ----- ----- . 99 . B . 55 44 L 40 59 . ----- ----- ----- Previous-symbol probabilities (× 99): B L ----- ----- ----- . 18 . B 1 55 99 L 98 26 . ----- ----- ----- Note that every word begins with a body stroke; this was expected from the definition of the limb strokes (they can be recognized only by their relationship to a previous stroke). Note also that a limb stroke cannot be followed by another limb stroke; this too is not wholly unexpected. The surprise is that almost no words *end* in a body stroke. The least rare body stroke in word-final position is `o'. The words that end in body strokes appear to be errors or the result of breaking a line in the middle of a word. An interesting observation from the body/limb frequency tables above is that the transition probabilities from body stroke to body and limb are respectively 55% and 45%. Thus, if the limb strokes mark the end of a syllabe (or letter?), the the average number of body strokes in a syllabe is slightly over 2. (Considering that we are counting each \i/ as a body stroke, the correct number may well be precisly 2.)