Hacking at the Voynich manuscript - Side notes
110 Comparative word rank-frequency (Zipf law) plots

Last edited on 2023-05-14 19:26:50 by stolfi

INTRODUCTION

  In this note we produce the rank-frequency (Zipf law)
  plots for the words of Voynichese and other languages.
  
SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../../../work 

    ln -s work/compute-freqs

  Data, figure, and export directories:

    ln -s ../tr-stats/dat
    ln -s ../tr-stats/exp
    ln -s ../tr-stats/fig

CREATING DESCRIPTIONS FOR ALL SAMPLE TEXTS AND SECTIONS:

  For every language {lang}, book {book}, and section {sec}, we create files
  "dat/{lang}/caption.wik", "dat/{lang}/{book}/caption.wik", and
  "dat/{lang}/{book}/{sec}/caption.wik" with description of the 
  items in Wikipedia markup.
  
  These files will be combined to make descriptions "fig/{oname}.wik" 
  for uploading the Zipf plots "fig/{oname}.svg" to the Wikimedia Commons.

    create-book-captions.sh

COMPARING THE ZIPF LAW PLOTS WITH OTHER LANGS

  Preparing plots of word frequency vs. word frequency rank

    generate-all-zipf-plots.sh -format svg -show 1
    generate-all-zipf-plots.sh -format eps -show 1
    generate-all-zipf-plots.sh -format png -show 1
    
    make-zipf-index > zipf-index.html
    
  Exporting the plots:
  
    cp -avu fig/zipf-* exp/
  
EXTRACTING THE TOP N WORDS

  Extract the N most frequent words of each language:
  
    ntop=6
    
    for f in \
      chin/ptt/tot.1 \
      chin/red/tot.1 \
      engl/cul/tot.1 \
      engl/wow/tot.1 \
      arab/quv/tot.1 \
      geez/gok/tot.1 \
      grek/nwt/tot.1 \
      latn/ptt/tot.1 \
      span/qvi/tot.1 \
      tibe/ccv/tot.1 \
      tibe/vim/tot.1 \
      viet/ptt/tot.1 \
      voyn/prs/bio.1 \
      voyn/prs/hea.1 \
      voyn/prs/heb.1 \
      voyp/grs/tot.1 \
      voyp/grm/tot.1 \
    ; do
      ifile="dat/$f/gud.wfr"
      ofile="dat/$f/.top"
      cat ${ifile} | sort -b -k1,1nr | head -${ntop} > ${ofile}
      echo -n "${f} "
      cat ${ofile} | condense-popular-words
    done
    
    chin/ptt/tot.1  de5(0.054) ni3(0.019) ta1(0.019) he2(0.018) men5(0.018) ren2(0.015)
    chin/red/tot.1  le5(0.019) yi1(0.019) bu4(0.017) ren2(0.014) de5(0.013) lai2(0.011)
    engl/cul/tot.1  the(0.079) and(0.048) of(0.037) in(0.026) to(0.020) it(0.019)
    engl/wow/tot.1  the(0.083) and(0.042) of(0.038) a(0.028) to(0.019) in(0.017)
    arab/quv/tot.1  mîn(0.030) fîy(0.014) mâa(0.013) alllâhî(0.012) allâ£îynâ(0.011) alllâhû(0.011)
    geez/gok/tot.1  keme(0.017) 'Igzi'AbHEr(0.014) 'Isme(0.012) wste(0.010) 'Inze(0.005) `hebe(0.005)
    grek/nwt/tot.1  kai(0.073) o(0.028) de(0.020) tou(0.018) en(0.017) autou(0.015)
    latn/ptt/tot.1  et(0.075) in(0.031) ad(0.016) est(0.015) dominus(0.011) de(0.011)
    span/qvi/tot.1  que(0.054) de(0.049) y(0.047) a(0.025) la(0.025) el(0.024)
    tibe/ccv/tot.1  PA(0.071) LA(0.032) YIN(0.028) BA(0.028) MA(0.026) PA'I(0.026)
    tibe/vim/tot.1  PA(0.054) DANG(0.028) DE(0.027) PAR(0.023) LA(0.021) BA(0.021)
    viet/ptt/tot.1  va`(0.020) ngu+o+`i(0.019) cu?a(0.019) ca'c(0.015) ngu+o+i(0.015) ddu+'c(0.014)
    voyn/prs/bio.1  shedy(0.038) ol(0.037) chedy(0.033) qokedy(0.025) qokeedy(0.023) qokain(0.023)
    voyn/prs/hea.1  daiin(0.056) chol(0.030) chor(0.020) dy(0.015) chy(0.014) cthy(0.014)
    voyn/prs/heb.1  daiin(0.024) chedy(0.021) aiin(0.021) or(0.020) dar(0.016) ar(0.016)
    voyp/grs/tot.1  chedy(0.024) kdy(0.023) shedy(0.022) chey(0.021) ky(0.018) kedy(0.015)
    voyp/grm/tot.1  ol(0.034) ky(0.030) chey(0.023) daiin(0.023) qol(0.018) shedy(0.018)

  Actually the "liao3" in Chinese samples should be "le5".
  
  To print in full:

      # echo "  --- ${f} ----------------"
      # cat ${ofile} | format-popular-words