Hacking at the Voynich manuscript - Side notes
063 Correlations between first and last letters across spaces

(See Notes/663 for the former obsolete Notes/063)

  This note investigates the frequencies of first and last 
  letters of consecutive words in the Starred Parags section (SPS)
  of the VMS and in the modern Mandarin pinyin version of the
  Shennong Bencao Jing (SBJ).
  
SETUP

  ln -s ../.. work
  ln -s work/convert_pinyin_to_numeric.py
  
  ln -s ../077
  ln -s 077/convert_starps_raw_to_lin_ivt.py
  ln -s 077/convert_starps_lin_to_par_ivt.py
  
  ln -s ../074

THE SPS FILE

  The SPS file "in/2026-06-29-starps-wc.ivp" is derived of my own
  transcription "../074/star25e1.ivt" as of 2026-03-15 15:59:31. It is
  all lowercase with no ligature brackets {}, no online comments <!...>,
  no locus IDs <f...>, no alignment markers [-«»], no parag markers <%>
  <$>, and with all weirdos and strange characters converted to '?'. It
  has one parag per line, with all word separators [-,.] converted to
  single blanks, and one blank at start and enf of each line. The
  encoding is nominally Uicode in UTF-8 but it actually uses only ascii
  characters.

  There is another version "in/2026-06-29-starps-wp.ivp" that is obtained
  in the same way but with commas deleted, so that only [-.] become
  word breaks.  However, most statistics are done on the "-wc" version.

    create_starps_raw_file.sh 2026-06-29

      utype "wc"

         330 parags
       11205 tokens
        2850 lexemes

      entropy (min ct = 1):    2850   11205   9.5603 1
      entropy (min ct = 2):     941    9296   8.4917 1
      entropy (min ct = 3):     608    8630   8.0996 1

      utype "wp"

         330 parags
        9892 tokens
        3323 lexemes

      entropy (min ct = 1):    3323    9892  10.0786 1
      entropy (min ct = 2):     923    7492   8.6547 1
      entropy (min ct = 3):     567    6780   8.1729 1

THE SBJ FILE

  The SBJ file "in/2026-06-27-bencao-py.utf" is derived from two files
  obtained from the internet (one from the Chinese Texts Project, one
  from the Chinese Wikisource), with many corrections, converted to
  modern Mandarin pinyin by Google Translate. (It is not the right
  version as would be seen in the white-on-black text of the Zhenghe
  Bencao in 1400 CE. In particular, it usually has [zhǔ zhì] instead of
  just [zhǔ]. The vowels in this file are "[aeiouü]". Tones are
  indicated by diacritics (acute, grave, macron, and caron) on vowels,
  including on "ü". The encoding is Unicode in UTF-8.

    ofile="in/2026-06-27-bencao-py.utf"
    wfile="in/.wfile-sbj"
    cat ${ofile} \
      | egrep -v -e '^[ ]*([#]|$)' \
      | tr ' ' '\012'| egrep -e '.' \
      | sort | uniq -c \
      > ${wfile}
      
    for n in 1 2 3 ; do 
      printf "entropy (min ct = ${n}): " 1>&2
      cat ${wfile} \
        | gawk -v n=${n} '//{ if ($1 >= n) { printf "%6d 1 %s\n", $1, $2 }}' \
        | work/compute_cond_entropy.gawk \
        1>&2
    done

    entropy (min ct = 1):   13266   7.5340 1
    entropy (min ct = 2):   13134   7.4577 1
    entropy (min ct = 3):   12982   7.3797 1


SCRIPTS

  The following scripts do the counting of initial-final letters:
  
    compute_abs_counts.sh
    compute_cond_counts.sh
  
  To run them all, run "do_note_063.sh".