Last edited on 19990131 08:37:45 by stolfi
All the scatterplots in this document are built according to the same general principle. Each page of the manuscript is represented by one dot, whose coordinates are derived from the frequencies of certain "keys" in that page. The keys could be words, characters, character groups, or any other independent discrete tokens extracted from the text.
Let K = (K_{1},K_{2}, ..., K_{n}) be the list of all keys, in some fixed order, and let P_{i} be the relative frequency of key K_{i} in page P, expressed as a number between 0 and 1. Then we can can view each page P as a point of the ndimensional space R^{n}, whose coordinates are the key frequencies (P_{1},P_{2}, ..., P_{n}). Each scatterplot is then the projection of all those points on some twodimensional subspace of R^{n}.
The frequency P_{i} of a key K_{i}
in a page P is estimated by the formula

(1) 
We use formula (1), rather than the obvious
one
P_{i} = #(K_{i} , P) / #(K P)  (2) 
Note that, with either formula, the estimated frequencies P_{i} are always nonnegative, and their sum is 1. Therefore, the point P = (P_{1},P_{2}, ..., P_{n}) actually lies on a particular (n1)dimensional hyperplane H of nspace, and in fact inside a regular simplex on that hyperplane.
In any case, since the values of n used here are rather large (10 or more), and the projection plane is always contained in H, the difference between nspace and (n1)space has no visible consequence in the plots.
In order to make sense of this cloud of points, we project them orthogonally onto some 3dimensional space of nspace, defined by an orthogonal frame. coordinates of the projected points should be measured with respect to In other words, we must select three mutually orthogonal vectors X, Y, and Z of R^{n}, and map each page's point P to the three numbers P_{x} = (PO)·X, P_{y} = (PO)·Y, P_{z} = (PO)·Z, where "·" denotes scalar product in R^{n}, and O is the global distribution (the vector of key frequencies in the whole text).
There is no obvious choice for the three projection axes X, Y, and Z. For the plots shown in these pages, I have merely applied GramSchmidt orthogonalization to three vectors AR, BR CR where A, B, C, and R are the frequency vectors for three selected sections.
In particular, for the top three plots in each display, I have orthogonalized the three vectors HEAO, HEBO BIOO, where O is the key frequency vector for the whole sample, and HEA, HEB and BIO are the key frequencies in the HerbalA, HerbalB, and Biological sections (the first segment of each), respectively. That is, X is the direction from the global key frequencies O towards the frequencies in HerbalA; Y is the direction perpendicular to X that is closest to the direction from O to HerbalB; and Z is the direction perpendicular to X and Y that is closest to the direction from O to the Biological frequencies.
For the bottom three plors, i have used instead the vectors PHAHEA, HEBHEA BIOHEA, normalized, where PHA is the Pharma section.