Last edited on 1998-07-16 00:40:19 by stolfi
East Asian languages
- Vietnamese
Text fetched from WWW
[ full ]
[ page ]
Vietnamese has a large alphabet,
Latin-based but with tons of diacritics. This text uses some
standard computer encoding, which seems to be almost a
letter-for-letter transcription of the official Vietnamese
script (except that some characters will look very wrong on
your ISO Latin-1 browser.) The entropy is lower than that of
English, probably because of the monosyllabic word
structure. Note, in particular, that the word breaks are fairly
predictable. Another source of low entropy is the use of
digraphs (nh, ng,
ph, th, kh,
eå, uù etc.) to represent some
relatively frequent phonemes.
l = 1 r = 0
[ colorized page ]
[ bits per tuple ]
l = 2 r = 0
[ colorized page ]
[ bits per tuple ]
l = 3 r = 0
[ colorized page ]
[ bits per tuple ]
l = 1 r = 1
[ colorized page ]
[ bits per tuple ]
- Phonetic Mandarin (numeric tones)
Lao-Tsu's Tao Te King
[ full ]
[ page ]
The low entropy of this text is
obvious. The Chinese scholars who designed the modern phonetic
spelling (pinyin) had the double goal of accurately
recording the Chinese phonemes while preserving the occidental
letter values as much as possible. Therefore many common
consonant sounds came to be represented by digraphs,
e.g. zh, ch,
sh, t'.
Moreover, each Chinese non-compound
word is a single syllable, which must begin with a single
consonant sound or semivowel, continue with a very limited
choice of vowels and diphtongs, and end with r,
n, ng, or nothing. This rigid
(hence predictable) word structure contributes to lower the
entropy. Finally, this particular variant of pinyin uses
postfix digits to denote the four tones; which means each digit
carries less than 2 bits of information.
Note that most of the information is
carried by the initial consonant of each syllable; on the other
hand, the spaces in this encoding have almost zero information
contents.
l = 1 r = 0
[ colorized page ]
[ bits per tuple ]
l = 2 r = 0
[ colorized page ]
[ bits per tuple ]
l = 3 r = 0
[ colorized page ]
[ bits per tuple ]
l = 1 r = 1
[ colorized page ]
[ bits per tuple ]
- Phonetic Mandarin (diacritic tones)
Lao-Tsu's Tao Te King
[ full ]
[ page ]
This is the same text as in the
previous example, but with a more compact encoding.
The numeric tones have been replaced by diacritics over the
main vowel, as in official pinyin---except that
the umlaut `¨' is used for macron
`¯', and circumflex for hacheck, due
to font limitations. Also the common digraphs
ng, zh, ch,
sh have been arbitrarily replaced by single
characters ñ, ð, þ,
ç.
These changes raise the mean
h3 entropy of this Chinese text, to levels
comparable to the Western texts above (2.7 bits/letter by the
Bayesian formula, 2.0 by the frequentistic one). The main
source of inefficiency still left is the word break, which is
almost entirely predictable due to the rigid word structure.
Indeed, in the official pinyin script, compound words (which
are the majority of the entries in a dictionary) are written
without intervening spaces, as in zhöngguó (=
China, lit. middle-country) and
huôxïng (= Mars,
lit. fire-planet).
l = 1 r = 0
[ colorized page ]
[ bits per tuple ]
l = 2 r = 0
[ colorized page ]
[ bits per tuple ]
l = 3 r = 0
[ colorized page ]
[ bits per tuple ]
l = 1 r = 1
[ colorized page ]
[ bits per tuple ]