Hacking at the Voynich manuscript - Side notes 203 Statistics of classical word paradigms Last edited on 2002-01-04 00:52:17 by stolfi INTRODUCTION In this note we compute the correcteness and coverage of the classical word paradigms: Tiltman, Roe, and Firth. SETTING UP THE ENVIRONMENT Commands: ln -s ../../combine-counts ln -s ../../compute-freqs Data: ln -s ../101/lang Paper directories: set tbldir = "/home/staff/stolfi/papers/voynich-words/techrep/tables/auto" set figdir = "/home/staff/stolfi/papers/voynich-words/techrep/figures/auto" Result directories: mkdir para COLLECTING VMS WORD DATA First, collect the words and word frequencies for text and labels together: cat lang/voyn/{text,labs}/gud.wfr \ | gawk '/./{print $1, $3;}' \ | combine-counts \ | compute-freqs \ | sort -b +0 -1nr +2 -3 \ > para/vms.wfr cat para/vms.wfr \ | gawk '/./{print $3;}' \ | sort | uniq \ > para/vms.wds ENUMERATING CLASSICAL PARADIGMS Now enumerate the words generated by the classical paradigms (John Tiltman, Mike Roe, and Robert Firth): set guys = ( tiltman roe firth ) foreach guy ( ${guys} ) set ofile = "para/enum-${guy}.wds" echo "--> ${ofile}" generate-words-${guy} \ | sort | uniq \ > ${ofile} dicio-wc ${ofile} end COMPUTING CONTAINMENT AND COVERAGE RATIOS Compute the subset of words from each paradigm that belong to the VMS lexicon: foreach guy ( ${guys} ) set ifile = "para/enum-${guy}.wds" set ofile = "para/ok-${guy}.wfr" echo "${ifile} --> ${ofile}" cat para/vms.wfr \ | grep -F -w -f ${ifile} \ | gawk '/./{ print $1, $3; }' \ | combine-counts \ | compute-freqs \ | sort -b +0 -1nr +2 -3 \ > ${ofile} end Compute the total number of words and tokens in the VMS: set nwrds = `cat para/vms.wfr | gawk '/./{w++;} END{print w}'` set ntoks = `cat para/vms.wfr | gawk '/./{t+=$1;} END{print t}'` echo "vms: ${nwrds} words and ${ntoks} tokens" foreach guy ( ${guys} ) set ifile = "para/ok-${guy}.wfr" echo " "; echo "${guy}" cat ${ifile} \ | gawk -v nwrds="${nwrds}" -v ntoks="${ntoks}" \ ' /./{w++; t+=$1;} \ END{ \ printf "%7d (%6.2f%%) words\n", w, (100.0*w)/nwrds; \ printf "%7d (%6.2f%%) tokens\n", t, (100.0*t)/ntoks; \ } \ ' end Counting tokens: foreach guy ( ${guys} ) echo "=== paradigm of ${guy} ===" end with those observed in the VMS: