98-03-09 stolfi =============== [ Lots of work on PGMTextFilter and PGMRankFilter omitted. ] Recreating the images in the norm-filter site: list .gifs --- .gifs ------------------------ test1-4-05-05-4-95-95-nf test1-4-05-05-4-50-80-nf test1-6-05-05-6-95-95-nf test1-4-05-05-6-50-80-nf test1-6-05-05-6-50-80-nf test1-1-05-05-6-50-80-nf test1-1-05-05-1-50-80-nf test1-1-05-05-1-95-95-nf test1-3-05-05-3-50-80-nf test1-2-05-05-2-95-95-nf test1-2-05-05-2-50-80-nf test1-3-05-05-3-95-95-nf test1-4-05-05-9-50-80-nf test1-1-05-05-9-50-80-nf test1-4-05-05-6-50-80-hi test1-4-05-05-9-50-80-hi ---------------------------------- --- test-nf ------------------------ #! /bin/csh -f set usage = "$0 DIR NAME LORAD LOFRAC LOCOLOR HIRAD HIFRAC HICOLOR OP" set notify if ( $#argv != 9 ) then echo "usage: ${usage}"; exit 1 endif set dir = "$1"; shift; set name = "$1"; shift; set lorad = "$1"; shift; set lofrc = "$1"; shift set locol = "$1"; shift set hirad = "$1"; shift; set hifrc = "$1"; shift set hicol = "$1"; shift set op = "$1"; shift set otname = "${name}-${lorad}-${lofrc}-${locol}-${hirad}-${hifrc}-${hicol}-${op}" echo "${otname}.pgm" ( cat ${dir}/${name}.pgm \ | nice PGMNormFilter \ -write ${op} \ -loRadius ${lorad} -loFraction 0.${lofrc} -loColor 0.${locol} \ -hiRadius ${hirad} -hiFraction 0.${hifrc} -hiColor 0.${hicol} \ -minWeight 1e-6 \ > ${dir}/${otname}.pgm \ && xv ${dir}/${otname}.pgm ) & ------------------------------------ foreach f ( `cat .gifs` ) echo $f if ( ! ( -r filter-gal/norm-filter/${f}.pgm ) ) then set parms = ( `echo $f | tr 'a-' 'a '` ) test-nf filter-gal/norm-filter ${parms} endif cat filter-gal/norm-filter/${f}.pgm \ | pnmdepth 255 \ | ppmtogif \ > filter-gal/norm-filter/${f}.gif end In parallel I collected some EVA text in fine-structure.txt and separate dits words by hand into "fine elements", with these conventions: # Notation: # # \ bogus line break # / legitimate word break # / bogus word break # : missing word break # . fine lement separator # (x) missing "x" here # (=y) previous char(s) should be "y" # w! "w" is an anomalous word # p{m}s "p" is prefix, "s" is suffix, "m" may be midfix I tried to use these fine elements: # Prefix elements: # # q # o y # ch sh # # Midfix elements: # # d # k t # ckh cth # ke te # kee tee # che she # chee shee # ch ch # sh sh # p f (polyvalent?) # cph cfh (polyvalent?) # # Suffix elements: # # o y # dy # ar or # al ol # ain oin # air oir # aiin oiin # aiiin oiiin # aiir oiir # ch ch # sh sh # # Unsure still: # # s r I extracted the "words" into fine-structure.wds and tried to identify prefix, suffix, midfix: cat fine-structure.txt \ | sed \ -e '/^#/d' \ -e 's/^<[^ ]*> *//g' \ | tr -d ' .\!\\-' \ | tr '/:' '\012\012' \ | egrep '.' \ > fine-structure.wds factor-words fine-structure A factor-words bio B Checking distribution of word endings: cat hea-u.wds \ | sed \ -e 's/$/#/g' \ -e 's/^/__/g' \ -e 's/\(ii*[nr]\)#/-\1/g' \ -e 's/\(.\)#/-\1/g' \ -e 's/^.*-//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > head-u-ends.frq count ending ----- ------ 220 y 158 r 150 l 101 iin 62 s 49 o 39 m 22 in 10 d 6 * 6 a 6 h 5 ir 4 iiin 4 n 3 iir 2 e 1 c 1 i Last pair distribution: cat hea-u.wds \ | egrep -v '[*]' \ | egrep -v '(^s|es|c|i)$' \ | sed \ -e 's/[cs]h/C/g' \ -e 's/c[ktpf]h/T/g' \ -e 's/[ktpf]/t/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ -e 's/^/_/g' -e 's/$/}3/g' \ -e 's/\([ao]*i*[rlmns]\)}3/}2 \1/g' \ -e 's/\(i*[rlmns]\)}3/}2 \1/g' \ -e 's/\([ao]*[oy]\)}3/}2 \1/g' \ -e 's/}3/}2 _/g' \ -e 's/\([aoy]*[CSTKPFtkpfdrls]e*\)}2/}1 \1/g' \ -e 's/\(.e*[aoy]*\)}2/}1 \1/g' \ -e 's/^.*}1 //g' \ | sort | uniq -c | expand | sort -b +2 -3 +0 -1nr \ | gawk '/./{if($3!=o){printf "\n";o=$3;} print;}' \ > hea-u-endp.frq It looks like the final -es is usually -ees or -eees, so it could be -sh with ligature omitted. However, these groups are rather common and are not transcription errors. Either ligatures are unimportant, or the IST is true... The other final -s could be separate words. The empty suffix: 5 d _ 5 od _ 2 C _ 2 T _ 1 Ce _ 1 oC _ 1 ooT _ 1 te _ The "oi*[nrm]" family: 42 d oiin 9 d oin 1 d oiiin 4 d oir 1 d oiim 1 d oim 13 od oiin 5 od oin 1 od oiiin 1 od oir 10 ot oiin 3 ot oin 1 oot oiiin 8 t oiin 1 Ce oin 1 yt oiiin 7 C oiin 1 ol oin 3 Ce oiin 1 or oin 3 yt oiin 1 s oin 2 or oiin 2 s oiin 1 T oiin 1 yC oiin 1 yod oiin 1 yr oiin The "y" and "o" suffixes. Note that they aren't very similar but neither very different. Perhaps the "o" suffixes are actually split words? 49 C y 25 C o 2 C ooiin 33 od y 9 Ce o 2 t ooiin 19 T y 4 ol o 1 _ ooiin 17 d y 3 ot o 1 yt ooiin 17 ot y 2 T o 14 Ce y 2 d o 11 t y 2 oC o 9 ol y 2 od o 6 Cee y 1 l o 6 or y 1 oCe o 5 oT y 1 ote o 4 oC y 1 t o 3 Te y 1 yC o 3 tee y 1 yt o 3 yte y 2 ote y 2 r y 2 s y 2 yt y 1 _ y 1 de y 1 oTe y 1 ode y 1 oor y 1 yC y 1 yCe y The "or", "or", "om", "oy", and "os" suffixes. Note that "os" doesn't belong. 57 C ol 67 C or 6 C om 5 C oy 2 C os 20 ot ol 14 ot or 5 ot om 1 ot oy 2 T os 19 T ol 13 Ce or 4 T om 2 _ os 17 d ol 13 d or 4 d om 2 yC os 5 Ce ol 11 T or 3 od om 1 Cee os 5 od ol 8 _ or 3 t om 1 l os 4 _ ol 5 t or 2 Ce om 1 ote os 4 ote ol 4 od or 1 oT om 1 s os 3 oC ol 3 s or 1 oor om 3 te ol 3 yC or 1 or om 2 s ol 3 yCe or 1 yC om 2 t ol 2 l or 1 yte om 1 Cee ol 2 oC or 1 de ol 1 Cee or 1 oT ol 1 oCe or 1 ol ol 1 ol or 1 otee ol 1 yt or 1 r ol 1 yC ol The double-"o"-"i" suffixes: 1 _ ooiir 1 _ ooin 2 _ oor And the complete misfits: 1 or iin 1 C oiir 39 _ s 1 _ oiir 2 Ce s 1 C l 1 _ l 2 C on 1 Ce on 2 d m 1 T on 1 _ m 1 od m 1 C r 1 _ r 1 ot r Distribution of last block ignoring suffixes: cat hea-u.wds \ | egrep -v '[*]' | egrep -v '[ci]$' \ | sed \ -e 's/eee/che/g' -e 's/ee/ch/g' \ -e 's/ch/C/g' -e 's/sh/S/g' \ -e 's/[kt]/t/g' -e 's/[pf]/p/g' \ -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ -e 's/y\(.\)/o\1/g' \ -e 's/^/_/g' -e 's/$/}3/g' \ -e 's/\([ao]*i*[rlmns]\)}3/}2 \1/g' \ -e 's/\(i*[rlmns]\)}3/}2 \1/g' \ -e 's/\([ao]*[oy]\)}3/}2 \1/g' \ -e 's/}3/}2 _/g' \ -e 's/\([aoyc]*\)\([CSTKPFtkpfdrls]\)\(e*\)}2/}1 \1. \2 .\3/g' \ -e 's/\([aoyc]*\)\(.\)\(e*\)}2/}1 \1. \2 .\3/g' \ -e 's/^.*}1 //g' \ | gawk '/./{print $1, $2, $3;}' \ | sort | uniq -c | expand | sort -b +2 -3 +0 -1nr \ | gawk '/./{if($3!=o){printf "\n";o=$3;} print;}' \ | sed -e 's/[.] //' -e 's/ [.]//' \ > hea-u-penp.frq Here is the distribution. Note that "d" and "T", have basically the same [ao]*/e* environment. Ditto for "C" and "S". We can include "t": the patterns are the same, but the frequencies are very different. 118 d 56 T 197 C 46 S 78 ot 5 P 5 op 73 od 7 oT 39 Ce 14 Se 27 t 1 Pe 1 cp 2 de 2 Te 20 oC 5 oS 12 ote 1 ode 1 oTe 7 oCe 4 te 1 ood 1 ooT 4 ct 1 oot 12 or 16 ol 11 s 63 _ 2 oor 4 l 3 r Let's now split the entire word: cat hea-u.wds \ | egrep -v '[*]' | egrep -v '[ci]$' \ | sed \ -e 's/eee/che/g' -e 's/ee/ch/g' \ -e 's/ch/C/g' -e 's/sh/S/g' \ -e 's/[kt]/t/g' -e 's/[pf]/p/g' \ -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ -e 's/y\(.\)/o\1/g' \ -e 's/^/{/' -e 's/$/#/' \ -e 's/\([ao]*i*[rlmns]\)#/}.\1/g' \ -e 's/\(i*[rlmns]\)#/}.\1/g' \ -e 's/\([ao]*[oy]\)#/}.\1/g' \ -e 's/#/}._/g' \ -e ':x' \ -e 's/\([aoyc]*[CSTKPFtkpfdrls]e*\)}/}.\1/g' \ -e 'tx' \ -e 's/{}//g' \ > hea-u-fact.fac cat hea-u-fact.fac \ | tr '.' '\012' \ | sort | uniq -c | expand | sort -b +0 -1nr \ > hea-u-elem.frq Total element frequencies: 136 d 2 de 66 T 3 Te 8 P 1 Pe 79 od 1 ode 8 oT 1 oTe 1 oP 1 ood 1 ooT 289 C 47 Ce 87 S 19 Se 63 s 26 oC 7 oCe 6 oS 12 os 147 ot 14 ote 14 op 168 or 175 ol 32 om 4 on 73 t 5 te 9 p 6 r 8 l 4 m 1 oot 4 oor 6 ct 1 cp 18 _ (no termination) 214 y 94 oiin 55 o 21 oin 6 ooiin 6 oy 5 oir 4 oiiin 2 oiir 1 iin 1 oiim 1 oim 1 ooiir 1 ooin This parsing has three groups of letters: { d de T Te P Pe C Ce S Se s } prefixed by "o" only ~20% of the time { t te p (pe?) } prefixed by "o" half the time { r l m n } prefixed by "o" ~90% of the time There are many terminations, some beginning with "o" (or "y") and some with "oo" (or "oy"). The "oo" group occurs once each before the letters { d t T } and 4 times before "r" (almost as often as the empty premodifier). The "c" (half-platform) modifier occurs before "t" abd "p", in exclusion of "o" and "e" modifiers. Let's try again, attaching the [ao]* groups after the letter: cat hea-u.wds \ | egrep -v '[*]' | egrep -v '[ci]$' \ | sed \ -e 's/eee/che/g' -e 's/ee/ch/g' \ -e 's/ch/C/g' -e 's/sh/S/g' \ -e 's/[kt]/t/g' -e 's/[pf]/p/g' \ -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ -e 's/y\(.\)/o\1/g' \ -e 's/^/{/' -e 's/$/#/' \ -e 's/\(i*[rlmns]\)#/}.\1_/g' \ -e 's/#/}._/g' \ -e ':x' \ -e 's/\([c]*[CSTKPFtkpfdrls]e*[aoy]*\)}/}.\1/g' \ -e 'tx' \ -e 's/{\([oy][ao]*\)}/{}.@\1/g' \ -e 's/{}//g' \ > hea-u-fcto.fac cat hea-u-fcto.fac \ | sed -e 's/y/o/g' \ | tr '.' '\012' \ | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \ > hea-u-elmo.frq Element distributions: 209 @o 9 @oo 266 Co 46 Ceo 79 So 18 Seo 9 Po 1 Peo 42 C 8 Ce 14 S 1 Se 7 Coo 112 to 18 teo 71 To 4 Teo 191 do 3 deo 104 t 1 te 4 T 25 d 5 too 4 cto 1 cpo 2 ct 150 l_ 157 r_ 62 s_ 36 m_ 4 n_ 23 lo 17 ro 12 so 10 l 4 r 1 s 15 p 8 po 4 iiin_ 1 iim_ 101 iin_ 3 iir_ 1 im_ 22 in_ 5 ir_ This second alternative shows these groups of letters: { m n } only final { l r s } can be final; when non-final, they behave like next group { C Ce S Se P Pe cp t te T Te ct d de } followed by "o" about 90-60% of the time There are fewer terminations, and the letters seem more homogeneous. Perhaps if we treat final "y" as a termination, thing get better: cat hea-u.wds \ | egrep -v '[*]' | egrep -v '[ci]$' \ | sed \ -e 's/eee/che/g' -e 's/ee/ch/g' \ -e 's/ch/C/g' -e 's/sh/S/g' \ -e 's/[kt]/t/g' -e 's/[pf]/p/g' \ -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ -e 's/y\(.\)/o\1/g' \ -e 's/^/{/' -e 's/$/#/' \ -e 's/\(i*[rlmns]\)#/}.\1_/g' \ -e 's/\([y]\)#/}.\1_/g' \ -e 's/#/}._/g' \ -e ':x' \ -e 's/\([c]*[CSTKPFtkpfdrls]e*[ao]*\)}/}.\1/g' \ -e 'tx' \ -e 's/{\([oy][ao]*\)}/{}.@\1/g' \ -e 's/{}//g' \ > hea-u-fcty.fac cat hea-u-fcty.fac \ | tr '.' '\012' \ | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \ > hea-u-elmy.frq Element frequencies: 208 @o 9 @oo 217 Co 70 So 35 Ceo 14 Seo 48 To 3 cto 13 teo 141 do 8 Po 1 cpo 96 C 23 S 19 Ce 5 Se 27 T 3 ct 6 te 75 d 1 P 2 Coo 131 t 3 Te 17 p 1 Pe 2 de 86 to 1 Teo 6 po 1 deo 4 too 73 _ 150 l_ 157 r_ 62 s_ 19 l 13 r 10 so 14 lo 8 ro 3 s 36 m_ 4 n_ 220 y_ 101 iin_ 22 in_ 5 ir_ 4 iiin_ 3 iir_ 1 iim_ 1 im_ This schema has the following groups of letters: { C Ce S Se T ct te d p cp } most often followed by "o" { t Te p Pe de } most often NOT followed by "o" { l r s } typically final, otherwise ~50% followed by "o" { m n } only final. This division seems less logical than the previous one. Also "s" is rather different from "l" and "r" in its affinity for "o". If we treat final "y" as a termination, we should probably include the "o" in "oiin" etc. let's try it: cat hea-u.wds \ | egrep -v '[*]' | egrep -v '[ci]$' \ | sed \ -e 's/eee/che/g' -e 's/ee/ch/g' \ -e 's/ch/C/g' -e 's/sh/S/g' \ -e 's/[kt]/t/g' -e 's/[pf]/p/g' \ -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ -e 's/^/{/' -e 's/$/#/' \ -e 's/\([ao]ii*[rmnsl]\)#/}.\1_/g' \ -e 's/\(i*[rmnsl]\)#/}.\1_/g' \ -e 's/\([y]\)#/}.\1_/g' \ -e 's/#/}._/g' \ -e ':x' \ -e 's/\([c]*[CSTKPFtkpfdrls]e*[aoy]*\)}/}.\1/g' \ -e 'tx' \ -e 's/{\([oy][ao]*\)}/{}.@\1/g' \ -e 's/{}//g' \ > hea-u-fctz.fac cat hea-u-fctz.fac \ | tr '.' '\012' \ | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \ > hea-u-elmz.frq Let me extract my herbal-A trancription again: rm -f hea-u.txt foreach f ( `cat filter-gal/pages.dir` ) echo "<${f}>" >> hea-u.txt cat filter-gal/$f/$f.P?.txt >> hea-u.txt end cat hea-u.txt \ | tr -d '%\!' \ | sed \ -e '/^#/d' \ -e 's/^<[^ ]*> *//g' \ -e 's/{[^{}]*}//g' \ | tr ' ' '.' \ | tr '.,=-' '\012\012\012\012' \ | egrep -e '.' \ | egrep -v '[*?]' \ > hea-u.wds dicio-wc hea-u.txt hea-u.wds lines words bytes file ------ ------- --------- ------------ 144 1146 8209 hea-u.txt 803 803 4681 hea-u.wds Let's try to make sense out of this. [ See Notes/017/Note-017.txt ] JUNK from now on: set letters = ( t te T Te p pe P Pe d de S Se C Ce r l s m n ) cat ${file}.fac \ | egrep -e '([*]|[ci]$|ee|[aoy][aoy]|c[ktpf]([^h]|$))' \ > .${file}-fine.wds foreach k ( ${letters} ) cat .${file}-fine.fac | egrep -v '[@]' \ | sed -e -e 's/[kt]/t/g' -e 's/[pf]/p/g' \ -e 's/[ao]/o/g' \ -e 's/^q//g' \ cat hea-u-fctz.fac \ | tr '.' '\012' \ | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \ > hea-u-elmz.frq ??? cat hea-u-fact.fac \ | sed -e 's/^/./' -e 's/[.][^.]*$/#/g' \ | tr -d 'eo.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ > hea-u-patt.frq