Hacking at the Voynich manuscript - Side notes 019 Analyzing word frequencies per section Last edited on 1999-07-28 13:12:14 by stolfi [ Originally part of Notes/021; First version done on 1998-04-28. Redone 1998-06-20 with fresher data. Redone 1998-07-02 with different dictionaries and axes. Redone 1999-01-30 with 1.6e6 majority transcription (Notes/045). Split off from Notes/021 on 1999-01-31. ] 1998-07-02 stolfi ================= The goal of this note is to compare the word distributions among the various sections of the Voynich manuscript. [ NEEDS TO BE REDONE. A BUG IN lines-from-evt WOULD SPLIT WORDS AT "!" THUS GENERATING MANY FALSE n, y, and words ending in "ai". ] I. EXTRACTING AND COUNTING WORDS The source file will be the majority version, with weirdos mapped to "*" and other basic EVA chars, and chopped into pages and subsections: ln -s ../045/subsecs-m text-subsecs ln -s ../045/pages-m text-pages However I must filter those files since they contain labels and other stuff besides plain text. Labels are useful but here they would distort the analysis: pages with more labels will seem to be in a differnt language than pages with few labels. ln -s ../../L16+H-eva cat L16+H-eva/unit16e6.txt \ | gawk -v FS=":" '/./{print $2,$6}' \ > unit-to-type.tbl cat unit-to-type.tbl | gawk '//{print $2}' | sort | uniq - circular-lines circular-text labels letters parags radial-lines starred-parags titles words We will prepare two sets of statistics, one using raw words ("-RAW") and one using word equivalence classes ("-EQV"). word-to-class -describe word equivalence: map_sh_to_ch ignore_gallows_eyes join_ei equate_aoy collapse_ii equate_eights equate_pt erase_q crush_invalid_words append_tilde This mapping will hopefully reduce noise in many ways. For one thing, it collapses pairs of characters that are similar and most easily misread or confused by image noise. It also reduces the sampling error, by increasing the number of occurrences of keywords in the page. Finally, it also neutralizes certain transcriber bias, such as the frequent misreading of "daiin" versus "dain" in Friedman's transcription of Bio (as pointed out by John Grove on 1998-05-02). One thing it does not fix is inconsistent transcription of spaces. Creating a combined file of the source text for archiving: ( cd text-pages && cat `cat all.names | sed -e 's/$/.evt/'` ) \ | select-units \ -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \ -v table=unit-to-type.tbl \ > all.evt Selecting the plain text: foreach utype ( pages subsecs ) foreach f ( `cat text-${utype}/all.names` ) set ofile = "/tmp/${utype}-${f}.txt" echo ${ofile} cat text-${utype}/${f}.evt \ | select-units \ -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \ -v table=unit-to-type.tbl \ | lines-from-evt | egrep '.' \ > ${ofile} end end Extracting words and mapping them to classes foreach ep ( word-to-clean.RAW word-to-class.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} foreach utype ( pages subsecs ) foreach f ( `cat text-${utype}/all.names` ) set ofile = "/tmp/${utype}-${f}-${etag}.wds" echo ${ofile} cat /tmp/${utype}-${f}.txt \ | words-from-evt | egrep '.' \ | tr '*%' '??' \ | ${ecmd} | egrep '.' \ > ${ofile} end end end Counting words and computing relative frequencies: # mkdir -p RAW EQV # foreach etag ( RAW EQV ) # /bin/rm -rf ${etag}/wfreqs # mkdir -p ${etag}/wfreqs # foreach utype ( pages subsecs ) # set frdir = "${etag}/wfreqs/${utype}" # mkdir -p ${frdir} # end # end foreach etag ( RAW EQV ) foreach utype ( pages subsecs ) set frdir = "${etag}/wfreqs/${utype}" cp -p text-${utype}/all.names ${frdir}/ foreach f ( `cat text-${utype}/all.names` ) set ofile = "${frdir}/$f.frq" echo ${ofile} mv ${ofile} ${ofile}~ cat /tmp/${utype}-${f}-${etag}.wds \ | sort | uniq -c | expand \ | sort -b +0 -1nr \ | compute-freqs \ > ${ofile} end end end /bin/rm /tmp/{pages,subsecs}-*{,-RAW,-EQV}.{txt,wds} Combining data by section instead of subsection: foreach etag ( RAW EQV ) foreach sec ( `cat ${etag}/wfreqs/secs/all.names` ) set ofile = "${etag}/wfreqs/secs/${sec}.frq" set ifiles = ( `cd ${etag}/wfreqs/subsecs/ && ls ${sec}.*.frq` ) echo "$ifiles" mv ${ofile} ${ofile}~ (cd ${etag}/wfreqs/subsecs/ && cat ${ifiles} ) \ | gawk '/./{print $1, $3;}' \ | combine-counts \ | sort -b +0 -1nr \ | compute-freqs \ > ${ofile} end end Compute total frequencies: foreach etag ( RAW EQV ) foreach utype ( pages subsecs secs ) set fmt = "${etag}/wfreqs/${utype}/%s.frq" set frfiles = ( \ `cat ${etag}/wfreqs/${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \ ) echo ${frfiles} cat ${frfiles} \ | gawk '/./{print $1, $3;}' \ | combine-counts \ | sort -b +0 -1nr \ | compute-freqs \ > ${etag}/wfreqs/${utype}/tot.frq end end II. TABULATING WORD FREQUENCIES PER SUBSECTION cat text-subsecs/all.names (Edited ${subsectags} manually to match presumed writing order.) set subsectags = ( \ pha.1 pha.2 \ hea.1 hea.2 \ unk.1 unk.2 \ str.1 \ cos.1 cos.2 \ zod.1 \ cos.3 \ str.2 \ heb.2 heb.1 \ bio.1 \ unk.3 unk.4 unk.5 unk.6 unk.7 unk.8 \ ) echo $subsectags | tr ' ' '\012' | sort > .foo diff text-subsecs/all.names .foo foreach etag ( RAW EQV ) tabulate-frequencies \ -dir ${etag}/wfreqs/subsecs \ -title "word" \ tot ${subsectags} end Frequencies of raw words (RAW/wfreqs/subsecs/all.cmp-frq) in each section (× 9999), minus the "unk"s: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 236 572 364 543 230 119 108 140 89 158 120 215 230 120 daiin 146 129 175 65 126 52 . 80 29 260 101 197 75 352 ol 137 10 . 1 . . . 26 39 90 179 143 203 319 chedy 123 172 119 29 46 132 108 120 277 271 173 179 199 54 aiin 116 10 . . . 13 . 6 49 113 109 71 110 367 shedy 107 140 231 297 126 105 . 93 19 33 63 71 30 21 chol 97 97 175 66 115 52 . 53 19 361 59 215 192 98 or 96 10 13 21 22 291 108 221 217 203 136 71 151 35 ar 93 107 84 75 22 13 . 80 59 11 115 53 58 137 chey 90 140 63 90 69 66 54 154 79 135 46 107 155 105 dar 81 . . . . . . . . 11 126 17 34 224 qokeedy 81 107 77 14 22 79 . 33 9 22 146 53 20 125 qokeey 76 21 126 45 57 13 . 80 49 . 80 17 24 144 shey 74 53 84 148 11 66 . 127 89 45 11 71 124 79 dy 72 32 35 8 34 132 . 154 287 113 127 17 37 35 al 71 10 6 23 . 79 . . . 45 109 125 44 125 qokaiin 71 10 . . . . . . . 22 54 17 134 238 qokedy 70 280 49 58 34 158 108 140 79 22 27 35 82 99 dal 64 183 84 133 357 79 162 67 89 56 10 53 72 21 s 58 . . 1 . 13 . . . 11 56 17 10 219 qokain 56 118 77 199 80 26 . 33 . . 16 17 20 1 chor 56 32 6 37 11 13 . 20 39 22 88 107 93 48 okaiin 52 10 13 2 . 132 . 46 . 33 38 89 17 163 qokal 50 21 49 133 80 66 . 33 39 33 23 17 37 26 shol 48 86 28 104 22 . . 20 . 11 27 17 27 65 dain 46 53 84 20 22 . . 40 49 11 69 17 13 61 cheey 45 151 161 26 80 39 . 26 29 . 51 . 20 45 cheol 44 43 140 10 11 13 54 60 39 . 87 . 27 24 okeey 42 32 6 2 . . . . . . 28 . 3 175 qol 40 43 6 138 11 26 . 60 39 22 10 53 24 13 chy 40 10 13 39 . 26 . 6 89 45 69 53 20 17 otaiin 40 . . . . . . 6 9 124 56 35 68 71 otedy 40 32 13 4 . 92 54 . . 11 38 107 75 65 qokar 39 10 . 8 . 13 . 26 19 67 37 323 120 33 chdy 39 21 28 34 22 52 . . . 11 25 107 17 89 qoky 39 86 13 30 103 39 54 6 9 11 38 17 58 54 saiin 39 64 63 26 22 39 . 40 . 56 46 . 20 55 sheey 37 21 13 21 . . . 6 . 11 36 107 103 61 chckhy 37 21 35 20 11 132 108 26 39 67 38 125 61 32 okal 37 21 . 7 . 52 . 33 59 79 50 107 51 35 otar 37 10 . 49 11 79 486 80 69 33 19 71 48 33 y 36 21 13 13 11 119 54 20 39 . 51 71 41 30 otal 35 . 35 5 46 . . 87 148 67 56 71 6 23 oteey 34 43 13 5 11 52 . 60 19 33 37 125 86 27 okar 33 . 49 126 34 26 . 33 19 11 11 17 3 . sho 31 . . . . 13 . . . . 51 17 3 86 lchedy 30 10 20 136 46 . . 20 . . . . 10 1 cthy 30 53 91 55 57 . . 20 . 11 7 35 13 45 dol 30 . . 5 . 26 . . . 11 48 35 34 59 okain . . . . . . . . . . . . . . ... 603 734 799 203 483 1085 918 851 3049 1888 611 843 255 393 ? Unknown sections: tot unk1 unk2 unk3 unk4 unk5 unk6 unk7 unk8 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 236 328 642 212 264 175 102 154 . daiin 146 46 . . 231 204 163 180 . ol 137 . . 212 165 116 143 51 . chedy 123 . . . 33 204 143 413 . aiin 116 . . 212 198 58 40 77 . shedy 107 375 214 . . 58 102 25 . chol 97 140 . . . 146 265 232 . or 96 46 . . 99 . 204 258 . ar 93 93 71 . 33 29 102 103 . chey 90 93 71 . 132 175 224 51 . dar 81 . . . 33 29 40 . . qokeedy 81 . . . 33 29 40 25 . qokeey 76 140 . . 66 146 61 103 . shey 74 . 285 425 33 58 61 77 . dy 72 . . . 33 58 20 103 . al 71 . . . . 29 245 25 . qokaiin 71 . . . 33 87 . . . qokedy 70 93 71 . 165 116 102 51 . dal 64 93 71 . . 58 . . . s 58 . . . . . 20 . . qokain 56 140 357 . . 87 20 25 . chor 56 46 . . 33 87 61 25 . okaiin 52 . . . 66 29 61 25 . qokal 50 140 214 . 66 . 40 . . shol 48 281 142 . . . . . . dain 46 . 71 . . 87 . 51 . cheey 45 . 71 . 33 29 20 . . cheol 44 . . . . . . . . okeey 42 . . . . . . . . qol 40 46 71 . . 29 20 . . chy 40 . . . . 29 40 154 . otaiin 40 . . . . 175 . . . otedy 40 . . . 66 58 224 154 . qokar 39 46 . . 99 58 61 51 . chdy 39 . . . 165 . 61 25 . qoky 39 . . . . 58 20 . . saiin 39 . . . . . 81 . . sheey 37 . . . 33 . . 25 . chckhy 37 . . . 33 29 20 25 . okal 37 . . . . 116 143 77 . otar 37 187 . 212 . . . 25 . y 36 . . . . 58 102 180 . otal 35 46 . . 33 . . 25 . oteey 34 . . . 33 58 102 51 . okar 33 93 214 . . . . . . sho 31 . . . 33 . . . . lchedy 30 140 71 . . . . 25 . cthy 30 . 71 . 66 58 . . . dol 30 46 . . . . . . . okain . . . . . . . . . ... 603 516 285 638 198 964 1165 775 . ? Frequencies of word classes (EQV/wfreqs/subsecs/all.cmp-frq) in each section (× 9999), minus unknowns: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 353 97 35 190 34 317 54 46 158 339 494 502 255 574 otoin~ 294 669 406 675 253 119 108 167 89 169 153 233 261 188 doin~ 261 194 217 77 161 185 . 234 316 373 258 215 120 563 ol~ 255 21 . 1 . 13 . 33 89 203 289 215 317 686 chedo~ 242 345 231 203 126 529 216 214 128 124 201 359 213 342 otol~ 212 205 350 55 161 105 54 248 267 192 367 125 99 200 oteeo~ 201 215 280 139 172 39 . 227 148 22 244 71 106 284 cheo~ 200 107 189 97 138 357 108 274 237 588 202 287 344 143 or~ 194 151 91 116 115 317 54 194 118 226 192 466 330 172 otor~ 179 10 . . . . . 20 9 180 170 125 410 464 otedo~ 175 161 280 441 207 264 . 147 79 67 114 89 68 68 chol~ 168 . 6 1 11 . . 26 59 79 269 89 141 386 oteedo~ 160 237 147 40 92 145 108 140 277 328 228 179 241 103 oin~ 142 64 63 221 115 132 108 120 118 113 85 341 134 181 oto~ 120 97 91 399 103 66 . 160 79 45 38 89 65 32 cho~ 110 151 98 336 92 145 . 46 9 56 61 89 48 21 chor~ 106 151 119 142 69 66 54 154 79 135 49 125 168 112 dor~ 101 334 140 115 92 158 108 160 79 33 35 71 96 144 dol~ 99 21 28 329 115 13 . 80 19 45 49 125 48 21 otcho~ 91 118 154 46 46 39 . 87 49 67 133 17 41 117 cheeo~ 87 237 217 39 195 158 . 60 79 . 102 . 37 98 cheol~ 85 32 13 66 11 . . 20 9 33 87 197 130 155 chctho~ 83 107 168 24 34 145 162 140 178 45 88 . 79 115 oteo~ 79 53 91 165 11 66 . 134 89 56 12 71 130 81 do~ 74 . . 1 . 13 . 6 19 101 142 71 82 95 otchedo~ 64 53 20 69 34 105 1189 140 148 101 37 71 55 55 o~ 64 183 84 133 357 79 162 67 89 56 10 53 72 21 s~ 60 107 49 49 115 39 54 20 9 33 60 17 58 101 soin~ 59 237 112 75 115 66 . 60 49 11 48 17 24 48 cheor~ 59 10 42 80 46 13 54 53 49 113 68 53 37 39 otcheo~ 54 43 35 214 138 39 . 20 9 . 4 . 34 4 ctho~ 54 140 385 33 80 66 . 67 49 45 52 17 44 13 oteol~ 52 10 . 11 . 13 . 46 19 90 49 376 168 43 chdo~ 51 10 6 4 . . . 40 . 79 62 161 155 42 otchdo~ 49 21 6 49 11 39 . . 19 11 66 161 41 45 toin~ 46 32 70 53 57 132 . 46 29 22 29 . 34 62 tol~ 43 32 56 27 46 79 . 6 . 11 38 161 110 33 tor~ 42 . . . . 13 . . . . 64 35 3 121 lchedo~ 42 53 49 65 11 39 . 26 9 169 44 71 41 4 odoin~ 42 43 . 49 46 52 . 6 19 56 49 17 99 13 otod~ 40 75 42 56 161 13 . 33 . 22 34 89 68 1 chodo~ 38 21 6 . . . . . 9 22 49 17 30 99 cheedo~ 37 32 28 20 22 52 108 46 19 . 21 35 113 51 cheto~ 36 172 70 11 57 13 . 67 39 56 44 35 65 . cheodo~ 36 97 35 45 46 39 54 154 29 33 29 . 30 5 doir~ 35 151 126 4 57 13 . 120 128 33 24 . 61 . oteodo~ 34 10 . 4 . . . . . 11 34 . 48 106 chectho~ 34 43 28 131 80 13 . 33 9 22 8 17 3 2 otchol~ 34 53 70 24 103 . . 46 9 . 12 17 3 93 sol~ . . . . . . . . . . . . . . ... 603 734 799 203 483 1085 918 851 3049 1888 611 843 255 393 ?~ tot unk1 unk2 unk3 unk4 unk5 unk6 unk7 unk8 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 353 234 . . 66 175 613 620 . otoin~ 294 657 785 212 264 175 102 154 . doin~ 261 46 . . 264 263 183 284 . ol~ 255 . . 425 364 175 183 129 . chedo~ 242 140 71 212 231 263 388 490 . otol~ 212 46 . 212 66 58 61 180 . oteeo~ 201 281 285 . 132 175 163 206 . cheo~ 200 187 . . 99 146 470 490 . or~ 194 . . 212 165 467 633 671 . otor~ 179 . . . 198 380 61 51 . otedo~ 175 516 428 . 66 58 163 25 . chol~ 168 . . . 132 58 81 25 . oteedo~ 160 46 . . 33 204 163 413 . oin~ 142 93 . 212 231 29 183 258 . oto~ 120 187 571 . 66 29 20 51 . cho~ 110 187 357 . 66 116 81 103 . chor~ 106 140 71 . 231 175 224 51 . dor~ 101 93 142 . 231 175 102 51 . dol~ 99 46 285 . 132 116 81 103 . otcho~ 91 . 71 . 33 87 81 51 . cheeo~ 87 . 71 . 198 58 81 25 . cheol~ 85 . 214 212 132 29 . 51 . chctho~ 83 . . 425 99 . . 25 . oteo~ 79 . 285 425 33 58 61 77 . do~ 74 . . 212 165 204 102 . . otchedo~ 64 234 . 425 . . . 25 . o~ 64 93 71 . . 58 . . . s~ 60 . . . . 58 20 . . soin~ 59 93 71 . 99 . 61 . . cheor~ 59 93 . 425 66 87 81 77 . otcheo~ 54 234 71 212 . . 40 25 . ctho~ 54 46 71 . . 29 . . . oteol~ 52 46 . . 165 87 81 77 . chdo~ 51 . . . 66 438 122 25 . otchdo~ 49 93 . . . . 183 154 . toin~ 46 46 71 212 33 . 61 103 . tol~ 43 140 . . . 58 102 154 . tor~ 42 . . . 99 . . . . lchedo~ 42 140 . . 33 58 40 103 . odoin~ 42 . . . 66 29 40 180 . otod~ 40 140 . . 132 29 20 154 . chodo~ 38 . . . 66 116 . . . cheedo~ 37 46 71 212 132 . . 51 . cheto~ 36 . . 425 99 29 20 77 . cheodo~ 36 . . . 132 87 40 . . doir~ 35 . . 212 231 116 20 51 . oteodo~ 34 . . . 33 . 20 . . chectho~ 34 46 . . . 29 . 25 . otchol~ 34 . 71 . . . . 51 . sol~ . . . . . . . . . ... 603 516 285 638 198 964 1165 775 . ?~ III. CLASSIFYING THE WORDS Let's manually sort the words according to their relative frequencies over the subsecs. We will exclude the "unk" section for clarity. Words with fairly uniform frequencies: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 236 572 364 543 230 119 108 140 89 158 120 215 230 120 daiin 146 129 175 65 126 52 . 80 29 260 101 197 75 352 ol 123 172 119 29 46 132 108 120 277 271 173 179 199 54 aiin 97 97 175 66 115 52 . 53 19 361 59 215 192 98 or 93 107 84 75 22 13 . 80 59 11 115 53 58 137 chey 90 140 63 90 69 66 54 154 79 135 46 107 155 105 dar 76 21 126 45 57 13 . 80 49 . 80 17 24 144 shey Words that are almost specific to herbal-A: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 50 21 49 133 80 66 . 33 39 33 23 17 37 26 shol 48 86 28 104 22 . . 20 . 11 27 17 27 65 dain 30 10 20 136 46 . . 20 . . . . 10 1 cthy 40 43 6 138 11 26 . 60 39 22 10 53 24 13 chy 33 . 49 126 34 26 . 33 19 11 11 17 3 . sho Words almost specific of Pharma: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 45 151 161 26 80 39 . 26 29 . 51 . 20 45 cheol Words almost specific of language A: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 107 140 231 297 126 105 . 93 19 33 63 71 30 21 chol 64 183 84 133 357 79 162 67 89 56 10 53 72 21 s 56 118 77 199 80 26 . 33 . . 16 17 20 1 chor Words more common in language A: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Words almost specific to herbal-B: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 39 10 . 8 . 13 . 26 19 67 37 323 120 33 chdy 40 32 13 4 . 92 54 . . 11 38 107 75 65 qokar 34 43 13 5 11 52 . 60 19 33 37 125 86 27 okar 39 21 28 34 22 52 . . . 11 25 107 17 89 qoky 37 21 . 7 . 52 . 33 59 79 50 107 51 35 otar Words almost specific to the Biological section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 42 32 6 2 . . . . . . 28 . 3 175 qol Words almost specific to language B: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 137 10 . 1 . . . 26 39 90 179 143 203 319 chedy 116 10 . . . 13 . 6 49 113 109 71 110 367 shedy 81 . . . . . . . . 11 126 17 34 224 qokeedy 71 10 . . . . . . . 22 54 17 134 238 qokedy 58 . . 1 . 13 . . . 11 56 17 10 219 qokain 31 . . . . 13 . . . . 51 17 3 86 lchedy 40 . . . . . . 6 9 124 56 35 68 71 otedy Words more common in language B: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 71 10 6 23 . 79 . . . 45 109 125 44 125 qokaiin 56 32 6 37 11 13 . 20 39 22 88 107 93 48 okaiin 37 21 13 21 . . . 6 . 11 36 107 103 61 chckhy 30 . . 5 . 26 . . . 11 48 35 34 59 okain Words more common in the Cosmo section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 37 10 . 49 11 79 486 80 69 33 19 71 48 33 y Words more common in the Stars section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 96 10 13 21 22 291 108 221 217 203 136 71 151 35 ar Words more common in the Zodiac section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 72 32 35 8 34 132 . 154 287 113 127 17 37 35 al 35 . 35 5 46 . . 87 148 67 56 71 6 23 oteey Words more common in the Stars-1 section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 36 21 13 13 11 119 54 20 39 . 51 71 41 30 otal Words with peculiar distributions: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 74 53 84 148 11 66 . 127 89 45 11 71 124 79 dy 70 280 49 58 34 158 108 140 79 22 27 35 82 99 dal 81 107 77 14 22 79 . 33 9 22 146 53 20 125 qokeey 52 10 13 2 . 132 . 46 . 33 38 89 17 163 qokal 46 53 84 20 22 . . 40 49 11 69 17 13 61 cheey 44 43 140 10 11 13 54 60 39 . 87 . 27 24 okeey 40 10 13 39 . 26 . 6 89 45 69 53 20 17 otaiin 39 86 13 30 103 39 54 6 9 11 38 17 58 54 saiin 39 64 63 26 22 39 . 40 . 56 46 . 20 55 sheey 37 21 35 20 11 132 108 26 39 67 38 125 61 32 okal 30 53 91 55 57 . . 20 . 11 7 35 13 45 dol There seems to be no simple pattern for the differences, except that words ending with are almost specific to language B. The word seems specific to herbal-B (not to other language B subsecs), and , , to herbal-A (not to Pharma). Let's now do the same with the word classes: Classes with fairly uniform distribution: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes almost exclusive of language A: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes more common in language A: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes almost exclusive of language B: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes more common in language B: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes more common in the Cosmo section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes more common in the Zodiac section: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Classes to sort: tot pha1 pha2 hea1 hea2 str1 cos1 cos2 zod1 cos3 str2 heb2 heb1 bio1 word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 353 97 35 190 34 317 54 46 158 339 494 502 255 574 otoin~ 294 669 406 675 253 119 108 167 89 169 153 233 261 188 doin~ 261 194 217 77 161 185 . 234 316 373 258 215 120 563 ol~ 255 21 . 1 . 13 . 33 89 203 289 215 317 686 chedo~ 242 345 231 203 126 529 216 214 128 124 201 359 213 342 otol~ 212 205 350 55 161 105 54 248 267 192 367 125 99 200 oteeo~ 201 215 280 139 172 39 . 227 148 22 244 71 106 284 cheo~ 200 107 189 97 138 357 108 274 237 588 202 287 344 143 or~ 194 151 91 116 115 317 54 194 118 226 192 466 330 172 otor~ 179 10 . . . . . 20 9 180 170 125 410 464 otedo~ 175 161 280 441 207 264 . 147 79 67 114 89 68 68 chol~ 168 . 6 1 11 . . 26 59 79 269 89 141 386 oteedo~ 160 237 147 40 92 145 108 140 277 328 228 179 241 103 oin~ 142 64 63 221 115 132 108 120 118 113 85 341 134 181 oto~ 120 97 91 399 103 66 . 160 79 45 38 89 65 32 cho~ 110 151 98 336 92 145 . 46 9 56 61 89 48 21 chor~ 106 151 119 142 69 66 54 154 79 135 49 125 168 112 dor~ 101 334 140 115 92 158 108 160 79 33 35 71 96 144 dol~ 99 21 28 329 115 13 . 80 19 45 49 125 48 21 otcho~ 91 118 154 46 46 39 . 87 49 67 133 17 41 117 cheeo~ 87 237 217 39 195 158 . 60 79 . 102 . 37 98 cheol~ 85 32 13 66 11 . . 20 9 33 87 197 130 155 chctho~ 83 107 168 24 34 145 162 140 178 45 88 . 79 115 oteo~ 79 53 91 165 11 66 . 134 89 56 12 71 130 81 do~ 74 . . 1 . 13 . 6 19 101 142 71 82 95 otchedo~ 64 53 20 69 34 105 1189 140 148 101 37 71 55 55 o~ 64 183 84 133 357 79 162 67 89 56 10 53 72 21 s~ 60 107 49 49 115 39 54 20 9 33 60 17 58 101 soin~ 59 237 112 75 115 66 . 60 49 11 48 17 24 48 cheor~ 59 10 42 80 46 13 54 53 49 113 68 53 37 39 otcheo~ 54 43 35 214 138 39 . 20 9 . 4 . 34 4 ctho~ 54 140 385 33 80 66 . 67 49 45 52 17 44 13 oteol~ 52 10 . 11 . 13 . 46 19 90 49 376 168 43 chdo~ 51 10 6 4 . . . 40 . 79 62 161 155 42 otchdo~ 49 21 6 49 11 39 . . 19 11 66 161 41 45 toin~ 46 32 70 53 57 132 . 46 29 22 29 . 34 62 tol~ 43 32 56 27 46 79 . 6 . 11 38 161 110 33 tor~ 42 . . . . 13 . . . . 64 35 3 121 lchedo~ 42 53 49 65 11 39 . 26 9 169 44 71 41 4 odoin~ 42 43 . 49 46 52 . 6 19 56 49 17 99 13 otod~ 40 75 42 56 161 13 . 33 . 22 34 89 68 1 chodo~ 38 21 6 . . . . . 9 22 49 17 30 99 cheedo~ 37 32 28 20 22 52 108 46 19 . 21 35 113 51 cheto~ 36 172 70 11 57 13 . 67 39 56 44 35 65 . cheodo~ 36 97 35 45 46 39 54 154 29 33 29 . 30 5 doir~ 35 151 126 4 57 13 . 120 128 33 24 . 61 . oteodo~ 34 10 . 4 . . . . . 11 34 . 48 106 chectho~ 34 43 28 131 80 13 . 33 9 22 8 17 3 2 otchol~ 34 53 70 24 103 . . 46 9 . 12 17 3 93 sol~ . . . . . . . . . . . . . . ... ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1999-07-26 stolfi ================= IV. TABULATING WORD FREQUENCIES PER SECTION cat text-secs/all.names (Edited ${sectags} manually to match presumed writing order.) set sectags = ( \ pha \ hea \ str \ cos \ zod \ heb \ bio \ unk \ ) echo $sectags | tr ' ' '\012' | sort > .foo diff RAW/wfreqs/secs/all.names .foo foreach etag ( RAW EQV ) tabulate-frequencies \ -dir ${etag}/wfreqs/secs \ -title "word" \ tot ${sectags} end tot pha hea str cos zod heb bio unk word ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 603 773 235 642 1214 3049 349 393 749 ? 236 446 508 120 144 89 228 120 218 daiin 146 157 72 98 136 29 95 352 156 ol 137 4 1 167 46 39 193 319 98 chedy 123 140 30 170 171 277 196 54 161 aiin 116 4 . 103 42 49 104 367 72 shedy 107 195 277 65 66 19 37 21 98 chol 97 144 72 58 156 19 196 98 156 or 96 12 21 146 206 217 138 35 124 ar 93 93 69 108 50 59 57 137 72 chey 90 93 87 47 140 79 147 105 135 dar 81 . . 117 3 . 31 224 20 qokeedy 81 89 15 142 27 9 25 125 25 qokeey 76 84 46 76 46 49 23 144 88 shey 74 72 133 14 89 89 115 79 78 dy 72 33 11 127 128 287 34 35 41 al 71 8 20 107 15 . 57 125 72 qokaiin 71 4 . 51 7 . 115 238 20 qokedy 70 140 55 36 97 79 75 99 98 dal 64 123 158 14 70 89 69 21 25 s 58 . 1 53 3 . 11 219 5 qokain 56 93 186 17 19 . 20 1 67 chor 56 16 34 83 19 39 95 48 46 okaiin 52 12 2 45 39 . 28 163 36 qokal 50 38 127 25 31 39 34 26 52 shol 48 50 95 25 15 . 25 65 41 dain 46 72 20 65 27 49 14 61 31 cheey 45 157 32 51 15 29 17 45 20 cheol 44 101 10 82 39 39 23 24 . okeey 42 16 2 26 . . 2 175 . qol 40 21 124 11 42 39 28 13 20 chy 40 12 34 66 19 89 25 17 46 otaiin 40 . . 52 46 9 63 71 31 otedy 40 21 3 42 7 . 80 65 109 qokar 39 4 7 35 39 19 153 33 57 chdy 39 25 33 27 3 . 31 89 46 qoky 39 42 38 38 11 9 52 54 15 saiin 39 63 25 45 42 . 17 55 20 sheey 37 16 19 33 7 . 104 61 10 chckhy 37 29 19 45 46 39 72 32 20 okal 37 8 6 50 46 59 60 35 72 otar 37 4 45 23 93 69 52 33 31 y 36 16 12 55 15 39 46 30 72 otal 35 21 10 52 74 148 17 23 15 oteey 34 25 6 38 46 19 92 27 52 okar 33 29 116 12 23 19 5 . 25 sho 31 . . 48 . . 5 86 5 lchedy 30 16 126 . 11 . 8 1 25 cthy 30 76 55 6 15 . 17 45 25 dol 30 . 5 46 3 . 34 59 5 okain . . . . . . . . . ...