Last edited on 1998-07-06 10:18:46 by stolfi

Scatterplots of VMs pages

Abstract

This document presents a set of "page scatterplots", showing the similarities and differences between various sections of the VMs, based on the relative frequencies of words and characters in each page. The page distribution turns out to be strongly clustered, and the clusters correspond pretty closely to the traditional sections.

Related work

These plots can be considered an independent replication of some of Rene's plots presented at the Teddington meeting, specifically the plots in the "Language characteristics" section. (The plots themselves are quite similar, but, curiously, we drew quite opposite conclusions: while Rene sees the plots as confirming the unity and gradual change, I see discrete clusters and clear gaps. But those may be just biases and holes in my data...)

More generally, these plots fit into a series of attempts to classify the VMs pages by various statistical criteria. Here we can mention Rene's distance matrix image in his original paper, now redone in the Teddington paper; dendrograms by Karl Kluge and Gabriel, and informal tabulations by many others.

Contents

The plots

The input text

Mathematical details

Analysis and discussions

Lab notebooks

The terminally curious reader can browse my "lab notebooks", containing the unix "recipes" for these plots, and the data files mentioned therein: