Hacking at the Voynich manuscript - Side notes 111 Analyzing the statistics of Vietnamese word elems Last edited on 2012-05-06 00:13:30 by stolfilocal INTRODUCTION This note computes the length distribution (over tokens and words) for each element of Vietnamese words. SETTING UP THE ENVIRONMENT Links: ln -s ../tr-stats/dat ln -s ../tr-stats/fig ln -s ../tr-stats/exp ln -s ../../../work COMPUTING THE LENGTH DISTRIBUTION OF VIETNAMESE SLOTS Let's compute the length distribution of slots in Vietnamese words, in the VIQR encoding: make LANG=viet BOOK=ptt ELEM=viqr -f slot-length-stats.make all # viet/ptt/tot.1/gud slot = inic counts = t # len count freq strings # --- ------ ------ ------------------ 0 1955 0.0558 - 1 21739 0.6206 q,k,x,d,g,h,b,r,s,n,m,t,v,l,c 2 11233 0.3207 gh,kh,nh,ph,th,tr,ng,ch,dd 3 100 0.0029 ngh 4 0 0.0000 # viet/ptt/tot.1/gud slot = vows counts = t # len count freq strings # --- ------ ------ ------------------ 0 0 0.0000 _ 1 14460 0.4128 a|e|i|o|u|y 2 13386 0.3822 a(|a^|ai|ao|au|ay|e^|eo|ia|ie|io|iu|o+|o^|oa|oe|oi|u+|ua|ui|uy 3 4881 0.1393 a^u|a^y|e^u|ia(|ia^|iai|iao|iau|iay|ie^|ieo|io+|io^|ioi|iu+|o+i|o^i|oa(|oai|u+a|u+i|u+u|ua(|ua^|uay|ue^|uo+|uo^|ye^ 4 1127 0.0322 a(e^|e^a^|ia^u|ie^u|iea^|io+i|iu+a|u+o+|u+oi|ua^i|ua^y|uo^i|uye^|ye^u 5 1173 0.0335 u+o+i|u+o+u 6 0 0.0000 # viet/ptt/tot.1/gud slot = tone counts = t # len count freq strings # --- ------ ------ ------------------ 0 12727 0.3633 _ 1 22300 0.6367 '|.|?|`|~ 2 0 0.0000 # viet/ptt/tot.1/gud slot = finc counts = t # len count freq strings # --- ------ ------ ------------------ 0 20723 0.5916 _ 1 9728 0.2777 c|m|n|p|t 2 4576 0.1306 ch|mg|ng|nh 3 0 0.0000 In the lexicon: # viet/ptt/tot.1/gud slot = inic counts = w # len count freq strings # --- ------ ------ ------------------ 0 64 0.0378 _ 1 1089 0.6436 b|c|d|g|h|k|l|m|n|q|r|s|t|v|x 2 528 0.3121 ch|dd|gh|kh|ng|nh|ph|th|tr 3 11 0.0065 ngh 4 0 0.0000 # viet/ptt/tot.1/gud slot = vows counts = w # len count freq strings # --- ------ ------ ------------------ 0 0 0.0000 _ 1 610 0.3605 a|e|i|o|u|y 2 694 0.4102 a(|a^|ai|ao|au|ay|e^|eo|ia|ie|io|iu|o+|o^|oa|oe|oi|u+|ua|ui|uy 3 286 0.1690 a^u|a^y|e^u|ia(|ia^|iai|iao|iau|iay|ie^|ieo|io+|io^|ioi|iu+|o+i|o^i|oa(|oai|u+a|u+i|u+u|ua(|ua^|uay|ue^|uo+|uo^|ye^ 4 89 0.0526 a(e^|e^a^|ia^u|ie^u|iea^|io+i|iu+a|u+o+|u+oi|ua^i|ua^y|uo^i|uye^|ye^u 5 13 0.0077 u+o+i|u+o+u 6 0 0.0000 # viet/ptt/tot.1/gud slot = tone counts = w # len count freq strings # --- ------ ------ ------------------ 0 475 0.2807 _ 1 1217 0.7193 '|.|?|`|~ 2 0 0.0000 # viet/ptt/tot.1/gud slot = finc counts = w # len count freq strings # --- ------ ------ ------------------ 0 765 0.4521 _ 1 637 0.3765 c|m|n|p|t 2 290 0.1714 ch|mg|ng|nh 3 0 0.0000