Last edited on 1999-01-31 08:37:45 by stolfi

# Scatter-plots of VMs pages

## Mathematical details

### What is being plotted, exactly?

All the scatterplots in this document are built according to the same general principle. Each page of the manuscript is represented by one dot, whose coordinates are derived from the frequencies of certain "keys" in that page. The keys could be words, characters, character groups, or any other independent discrete tokens extracted from the text.

Let K = (K1,K2, ..., Kn) be the list of all keys, in some fixed order, and let Pi be the relative frequency of key Ki in page P, expressed as a number between 0 and 1. Then we can can view each page P as a point of the n-dimensional space Rn, whose coordinates are the key frequencies (P1,P2, ..., Pn). Each scatterplot is then the projection of all those points on some two-dimensional subspace of Rn.

### Computing the key frequencies Pi

The frequency Pi of a key Ki in a page P is estimated by the formula

 Pi = #(Ki , P) + 1 #(K , P) + n
(1)

where
• n is the number of distinct keys in the list K,
• #(Ki , P) is the number of times key Ki occurs on page P;
• #(K , P) is the total number of key occurrences on page P, that is, the sum of #(Ki,P) for i = 1,2,...n.

We use formula (1), rather than the obvious one

 Pi = #(Ki , P) / #(K P) (2)

in order to reduce the impact of sampling error on short pages. As , #(K , P) tends to zero, formula (1) converges towards the uniform distribution Pi = 1/n, whereas formula (2) jumps around erratically. On the other hand, as the page's contents #(K , P) increases, both formulas converge to the true relative frequency of key Ki in the page's "language".

### The probability simplex

Note that, with either formula, the estimated frequencies Pi are always non-negative, and their sum is 1. Therefore, the point P = (P1,P2, ..., Pn) actually lies on a particular (n-1)-dimensional hyperplane H of n-space, and in fact inside a regular simplex on that hyperplane.

In any case, since the values of n used here are rather large (10 or more), and the projection plane is always contained in H, the difference between n-space and (n-1)-space has no visible consequence in the plots.

### Projecting the points

In order to make sense of this cloud of points, we project them orthogonally onto some 3-dimensional space of n-space, defined by an orthogonal frame. coordinates of the projected points should be measured with respect to In other words, we must select three mutually orthogonal vectors X, Y, and Z of Rn, and map each page's point P to the three numbers Px = (P-O)·X, Py = (P-O)·Y, Pz = (P-O)·Z, where "·" denotes scalar product in Rn, and O is the global distribution (the vector of key frequencies in the whole text).

### Selecting the projection axes

There is no obvious choice for the three projection axes X, Y, and Z. For the plots shown in these pages, I have merely applied Gram-Schmidt orthogonalization to three vectors A-R, B-R C-R where A, B, C, and R are the frequency vectors for three selected sections.

In particular, for the top three plots in each display, I have orthogonalized the three vectors HEA-O, HEB-O BIO-O, where O is the key frequency vector for the whole sample, and HEA, HEB and BIO are the key frequencies in the Herbal-A, Herbal-B, and Biological sections (the first segment of each), respectively. That is, X is the direction from the global key frequencies O towards the frequencies in Herbal-A; Y is the direction perpendicular to X that is closest to the direction from O to Herbal-B; and Z is the direction perpendicular to X and Y that is closest to the direction from O to the Biological frequencies.

For the bottom three plors, i have used instead the vectors PHA-HEA, HEB-HEA BIO-HEA, normalized, where PHA is the Pharma section.