Last edited on 1998-07-04 11:35:12 by stolfi

Scatter-plots of VMs pages

Plots based on wordclass frequencies

Projection on "Herbal" axes

         

Projection on "Pharma" axes

         

Click on any image to see a full-size version

For these plots, each word in the sample was mapped into a "wordclass" by identifying similar-looking characters and other possible noise, as follows:

      "sh" --> "ch"
      { "p" "f" "k" } --> "t"
      { "ei" "a" "y" } --> "o"
      { "ii" "iii" "iiii" } --> "i"
      { "j" "g" "m" } --> "d"
      delete the "q" prefix
      append "~"
So, for example, "qopshedy ytchedy daiin otam" would become "otchedo~ otchedo~ doin~ otod~".

The hope is that mapping words into wordclasses reduces the effect of transcription errors and/or calligraphic variation. As a side effect, the mapping also reduces the sampling error, because the counts for wordclasses in each page are higher than those of raw words.

The coordinates of each page are derived from the relative frequencies of the following 50 wordclasses in that page:

  chctho~ chdo~ chectho~ chedo~ cheedo~ cheeo~ cheodo~ cheol~
  cheor~ cheo~ cheto~ chodo~ chol~ chor~ choto~ cho~ ctho~
  doin~ doir~ dol~ dor~ do~ lchedo~ odoin~ oin~ ol~ oroin~
  or~ otchdo~ otchedo~ otcheo~ otchol~ otcho~ otedo~ oteedo~
  oteeo~ oteodo~ oteol~ oteo~ otod~ otoin~ otol~ otor~ oto~
  o~ soin~ sor~ s~ toin~ tor~

These are roughly the 50 most popular wordclasses from Rene's word frequency list. (Another set of plots was prepared using instead the 50 wordclasses whose frequencies showed the most variation from section to section, but the differences were not very great.)

The initial coordinates of each page are the relative frequencies of these wordclasses in that page. To produce the plots, I picked three mutually orthogonal unit-length vectors X, Y, and Z, in 50-dimensional wordclass frequency space, and projected the page points onto them.

For the "Herbal" projection, the X, Y, and Z axes are the result of Gram-Schmidt orthogonalization applied to the vectors HEA-TOT, HEB-TOT BIO-TOT, where TOT is the vector of wordclass frequencies for the whole sample, and HEA, HEB and BIO are the frequencies for the Herbal-A, Herbal-B, and biological sections, respectively. Here are the X, Y, and Z coordinates of each page in this projection.

The "Pharma" projection is an attempt to discover a separation of Herbal-A pages from Pharma page, by using a different projection to three-space. The X, Y, and Z axes were obtained by orthogonalization of PHA-HEA, HEB-HEA BIO-HEA, where PHA is the vector of wordclass frequencies for the "Pharma" section. Here are the X, Y, and Z coordinates of each page in this projection.

You may want to check another version of these plots without the identification of similar characters.