Hacking at the Voynich manuscript - Side notes 110 Comparative word rank-frequency (Zipf law) plots Last edited on 2023-05-14 19:26:50 by stolfi INTRODUCTION In this note we produce the rank-frequency (Zipf law) plots for the words of Voynichese and other languages. SETTING UP THE ENVIRONMENT Links: ln -s ../../../work ln -s work/compute-freqs Data, figure, and export directories: ln -s ../tr-stats/dat ln -s ../tr-stats/exp ln -s ../tr-stats/fig CREATING DESCRIPTIONS FOR ALL SAMPLE TEXTS AND SECTIONS: For every language {lang}, book {book}, and section {sec}, we create files "dat/{lang}/caption.wik", "dat/{lang}/{book}/caption.wik", and "dat/{lang}/{book}/{sec}/caption.wik" with description of the items in Wikipedia markup. These files will be combined to make descriptions "fig/{oname}.wik" for uploading the Zipf plots "fig/{oname}.svg" to the Wikimedia Commons. create-book-captions.sh COMPARING THE ZIPF LAW PLOTS WITH OTHER LANGS Preparing plots of word frequency vs. word frequency rank generate-all-zipf-plots.sh -format svg -show 1 generate-all-zipf-plots.sh -format eps -show 1 generate-all-zipf-plots.sh -format png -show 1 make-zipf-index > zipf-index.html Exporting the plots: cp -avu fig/zipf-* exp/ EXTRACTING THE TOP N WORDS Extract the N most frequent words of each language: ntop=6 for f in \ chin/ptt/tot.1 \ chin/red/tot.1 \ engl/cul/tot.1 \ engl/wow/tot.1 \ arab/quv/tot.1 \ geez/gok/tot.1 \ grek/nwt/tot.1 \ latn/ptt/tot.1 \ span/qvi/tot.1 \ tibe/ccv/tot.1 \ tibe/vim/tot.1 \ viet/ptt/tot.1 \ voyn/prs/bio.1 \ voyn/prs/hea.1 \ voyn/prs/heb.1 \ voyp/grs/tot.1 \ voyp/grm/tot.1 \ ; do ifile="dat/$f/gud.wfr" ofile="dat/$f/.top" cat ${ifile} | sort -b -k1,1nr | head -${ntop} > ${ofile} echo -n "${f} " cat ${ofile} | condense-popular-words done chin/ptt/tot.1 de5(0.054) ni3(0.019) ta1(0.019) he2(0.018) men5(0.018) ren2(0.015) chin/red/tot.1 le5(0.019) yi1(0.019) bu4(0.017) ren2(0.014) de5(0.013) lai2(0.011) engl/cul/tot.1 the(0.079) and(0.048) of(0.037) in(0.026) to(0.020) it(0.019) engl/wow/tot.1 the(0.083) and(0.042) of(0.038) a(0.028) to(0.019) in(0.017) arab/quv/tot.1 mîn(0.030) fîy(0.014) mâa(0.013) alllâhî(0.012) allâ£îynâ(0.011) alllâhû(0.011) geez/gok/tot.1 keme(0.017) 'Igzi'AbHEr(0.014) 'Isme(0.012) wste(0.010) 'Inze(0.005) `hebe(0.005) grek/nwt/tot.1 kai(0.073) o(0.028) de(0.020) tou(0.018) en(0.017) autou(0.015) latn/ptt/tot.1 et(0.075) in(0.031) ad(0.016) est(0.015) dominus(0.011) de(0.011) span/qvi/tot.1 que(0.054) de(0.049) y(0.047) a(0.025) la(0.025) el(0.024) tibe/ccv/tot.1 PA(0.071) LA(0.032) YIN(0.028) BA(0.028) MA(0.026) PA'I(0.026) tibe/vim/tot.1 PA(0.054) DANG(0.028) DE(0.027) PAR(0.023) LA(0.021) BA(0.021) viet/ptt/tot.1 va`(0.020) ngu+o+`i(0.019) cu?a(0.019) ca'c(0.015) ngu+o+i(0.015) ddu+'c(0.014) voyn/prs/bio.1 shedy(0.038) ol(0.037) chedy(0.033) qokedy(0.025) qokeedy(0.023) qokain(0.023) voyn/prs/hea.1 daiin(0.056) chol(0.030) chor(0.020) dy(0.015) chy(0.014) cthy(0.014) voyn/prs/heb.1 daiin(0.024) chedy(0.021) aiin(0.021) or(0.020) dar(0.016) ar(0.016) voyp/grs/tot.1 chedy(0.024) kdy(0.023) shedy(0.022) chey(0.021) ky(0.018) kedy(0.015) voyp/grm/tot.1 ol(0.034) ky(0.030) chey(0.023) daiin(0.023) qol(0.018) shedy(0.018) Actually the "liao3" in Chinese samples should be "le5". To print in full: # echo " --- ${f} ----------------" # cat ${ofile} | format-popular-words