Last edited on 1999-01-31 07:23:11 by stolfi

Scatter-plots of VMs pages

Analysis and discussion

The plots have several notable features:

Page clustering

The pages are obviously lumped into a few well-separated clusters, which correspond to major sections:

The Pharma cluster is very close to Herbal-A, but one can still separate most of one from most of the other by a plane, with a suitable projection.

We cannot say much about the other sections ("Astronomical", "Cosmological", and "Zodiac"), because only a couple of pages have been transcribed from each. We will ignore them in the rest of this discussion.

There are several possible explamantions for this clustering:

Randomness within each cluster

There is no clear evidence for clustering, or any other structure, within each section. (Again, it may be that we simply have chosen the wrong projection.)

In fact, the pages seem to jump fairly randomly within each cluster. Some of that randomness may be due to sampling error, given the relatively small amount of text in many pages. The randomness seems I should have computed and plotted the sampling error ellipses...

No linear trends?

Another conspicuous (non-)feature is the "random" placement of the clusters in space. There seems to be no obvious alignment between the cluster centers, and no obvious ordering of the clusters along a line.

It is not clear how to interpret this observation. At first, I thought it was a point against the "gradual evolution" theory . In gradual evolution (due to aging, practicing, or increasing slopiness), the page points should form an elongated cloud; whereas I saw only discrete clusters.

However, after seeing René's paper, I don't know what to think. René sees a single elongated cloud (even though it has a 90 degree turn at some point), and takes that as an argument for gradual evolution.

Words are better than elements

Comparing the plots based on words or wordclasses to the ones based on elements, we see that clustering is much more evident in the former than in the latter.

This observation seems consitent with many theories about the nature of clustering. It would be expected if the section clustering is due to different but related languages or dialects. But it couls also be explained by different subject matter or grammatical style (e.g. different usage of "-ed" and "will" in narrative vs. romance). It could also be due to a change in spelling rules. And so on...