Last edited on 1999-01-31 08:37:45 by stolfi

Scatter-plots of VMs pages

Mathematical details

What is being plotted, exactly?

All the scatterplots in this document are built according to the same general principle. Each page of the manuscript is represented by one dot, whose coordinates are derived from the frequencies of certain "keys" in that page. The keys could be words, characters, character groups, or any other independent discrete tokens extracted from the text.

Let K = (K₁,K₂, ..., K_n) be the list of all keys, in some fixed order, and let P_i be the relative frequency of key K_i in page P, expressed as a number between 0 and 1. Then we can can view each page P as a point of the n-dimensional space Rⁿ, whose coordinates are the key frequencies (P₁,P₂, ..., P_n). Each scatterplot is then the projection of all those points on some two-dimensional subspace of Rⁿ.

Computing the key frequencies P_i

The frequency P_i of a key K_i in a page P is estimated by the formula

P_i =	#(K_i , P) + 1

	#(K , P) + n

(1)

where

n is the number of distinct keys in the list K,
#(K_i , P) is the number of times key K_i occurs on page P;
#(K , P) is the total number of key occurrences on page P, that is, the sum of #(K_i,P) for i = 1,2,...n.

We use formula (1), rather than the obvious one

P_i = #(K_i , P) / #(K P)

(2)

in order to reduce the impact of sampling error on short pages. As , #(K , P) tends to zero, formula (1) converges towards the uniform distribution P_i = 1/n, whereas formula (2) jumps around erratically. On the other hand, as the page's contents #(K , P) increases, both formulas converge to the true relative frequency of key K_i in the page's "language".

The probability simplex

Note that, with either formula, the estimated frequencies P_i are always non-negative, and their sum is 1. Therefore, the point P = (P₁,P₂, ..., P_n) actually lies on a particular (n-1)-dimensional hyperplane H of n-space, and in fact inside a regular simplex on that hyperplane.

In any case, since the values of n used here are rather large (10 or more), and the projection plane is always contained in H, the difference between n-space and (n-1)-space has no visible consequence in the plots.

Projecting the points

In order to make sense of this cloud of points, we project them orthogonally onto some 3-dimensional space of n-space, defined by an orthogonal frame. coordinates of the projected points should be measured with respect to In other words, we must select three mutually orthogonal vectors X, Y, and Z of Rⁿ, and map each page's point P to the three numbers P_x = (P-O)·X, P_y = (P-O)·Y, P_z = (P-O)·Z, where "·" denotes scalar product in Rⁿ, and O is the global distribution (the vector of key frequencies in the whole text).

Selecting the projection axes

There is no obvious choice for the three projection axes X, Y, and Z. For the plots shown in these pages, I have merely applied Gram-Schmidt orthogonalization to three vectors A-R, B-R C-R where A, B, C, and R are the frequency vectors for three selected sections.

In particular, for the top three plots in each display, I have orthogonalized the three vectors HEA-O, HEB-O BIO-O, where O is the key frequency vector for the whole sample, and HEA, HEB and BIO are the key frequencies in the Herbal-A, Herbal-B, and Biological sections (the first segment of each), respectively. That is, X is the direction from the global key frequencies O towards the frequencies in Herbal-A; Y is the direction perpendicular to X that is closest to the direction from O to Herbal-B; and Z is the direction perpendicular to X and Y that is closest to the direction from O to the Biological frequencies.

For the bottom three plors, i have used instead the vectors PHA-HEA, HEB-HEA BIO-HEA, normalized, where PHA is the Pharma section.