# Last edited on 2005-01-17 02:42:00 by stolfi DATA SETS FOR MUTUAL INFORMATION CONTENT ANALYSIS -------------------------------------------------- OVERVIEW -------- This directory contains two complete data sets for the analysis program ("frb_analyze"): Data set "s2" is used for the analysis of mutual information contents between "true" candidates --- closely fitting pairs of fragment outlines --- selected by hand from those used in Helena's thesis. Data set "f2" is a "control experiment" consisting of "false"candidates, namely fragment outline segments that were picked and paired at random. The program actually requires two kinds of input data: * the geometry of the fragment outlines (the ".flc" files in "curves-001/", shared by both data sets); and * a representative of sample of candidates taken from the target set (file "sample.can" in the directory "cands-001/SET/", where SET is "s2" or "f2"). The program writes many output files, mostly for debugging and documentation purposes. The actual results of the analysis are written as two tables, the information content of each Fourier coefficient ("sample.ifc") and the same data condensed by frequency band ("sample.ibd"), in the directories "cands-001/SET/". FRAGMENT-RELATED FILES ---------------------- The input fragment outlines are in directory "curves-001/" , specifically in the files "curves-001/NNNN/f001.flc" where NNNN is the fragment number, ranging from 0000 to 0111. These outlines have been smoothed with a "geometric" Gaussian filter (as described in Helena's thesis) with characteristic width sigma = 1 pixel = 1/300 inch = 0.085 mm, then sampled with uniform step as close as possible sigma/4, namely delta = 0.25 pixel = 0.021 mm. CANDIDATE-RELATED FILES ----------------------- Candidate-related files, input and output, live in the directories "cands-001/SET" where SET is either "s2" or "f2". The input sample candidates, to be analyzed, are described in the file "cands-001/SET/raw/sample.can" For each candidate listed in this file, the program will extract its two segments from the respective fragment outlines, and write them to files "cands-001/SET/raw/NNNNNN/SIDE.flc" "cands-001/SET/raw/NNNNNN/SIDE.fsh" where NNNNNN is the candidate number, and SIDE is either "a" or "b". These ".flc" files describe plane curves, like those that describe the whole fragment outlines; each data point has three coordinates X Y Z (with Z = 0 for our fragments). The ".fsh" files are the corresponding "shape functions"; each data point is a single number. The program then adjusts the alignment of each candidate and trims its segments to the proper number of data points (2^9+1 = 513). The vital parameters of these "refined" candidates are written to the file "cands-001/SET/ref/sample.can" The geometry of these segments and their "shape functions" are written to the files "cands-001/SET/ref/NNNNNN/SIDE.flc" "cands-001/SET/ref/NNNNNN/SIDE.fsh" A plot of the two segments, rotated and translated to their "about-to-fit" positions, can be found in "cands-001/SET/ref/NNNNNN/ab-flc.eps" The program also writes the Fourier transforms (actually, sine transforms) and the power spectra of the shape function, to files "cands-001/SET/ref/NNNNNN/SIDE.fft" "cands-001/SET/ref/NNNNNN/SIDE.fpw" respectively. Finally, the program computes the mean "m" and the difference "d" of the two shape functions, and writes them and their derived data to the files "cands-001/SET/ref/NNNNNN/m.fsh" "cands-001/SET/ref/NNNNNN/m.fft" "cands-001/SET/ref/NNNNNN/m.fpw" "cands-001/SET/ref/NNNNNN/d.fsh" "cands-001/SET/ref/NNNNNN/d.fft" "cands-001/SET/ref/NNNNNN/d.fpw" Note that, in general, the candidate number NNNNNN is *not* necessarily the same in the "raw" and "refined" sets, because the program may discard "raw" candidates that are too short for realignment and trimming. In the two data sets provided here, however, no candidates were eliminated, so the numbers happen to match. The program also writes the files "cands-001/SET/ref/m.fvt" "cands-001/SET/ref/d.fvt" "cands-001/SET/ref/ab.fvt" These files contain one line for each frequency k from 1 to 511, and show the average and variance of the corresponding Fourier coefficient --- respectively for the "m", "d", and "a"/"b" shape functions. Finally, the program writes "cands-001/SET/ref/sample.ifc" which gives the mutual information content for each frequency; and "cands-001/SET/ref/sample.ibd" which gives the same information, condensed for each frequency band. FILE FORMATS ------------ The format of those files should be mostly self-explanatory. Lines beginning with "|", when present, are merely comments. The named "unit" parameter defined in some files is a scale factor to be multiplied into the following data values. Fragment outlines and segments (".flc" extension): After the named parameter definition "samples = M" there are M data points equally spaced along the fragment's contour, one per line. Each point is a triplet X Y Z of integer coordinates, implicitly multiplied by the "unit" parameter to yield actual coordinates in mm. The Z coordinate is always zero in these datasets. The full outlines are closed curves, so the last sample is implicitly followed by the first one. This assumption obviously does not apply to the ".flc" files of extracted segments. Shape functions (".fsh" extension): The format is similar to that of ".flc" files, except that each data point (each line) is a single number. Fourier transforms and spectra (".fft" and ".fpw" extensions): These have the same format as the "fsh" (shape function) files. Candidate sets (".can" extension) After the file header and some named parameters, there is one line for each candidate (segment pair) in the format ACRV ATOT AINI ALEN ADIR BCRV BTOT BINI BLEN BDIR R S T The first five fields describe one of the paired segments (segment "a"), and the next five fields describe the other segment ("b"). The last three fields are not used by frb_analyze. For segment "a": ACRV is the index of its fragment outline in directory curves-001; ATOT is the total number of samples in the whole outline; AINI is index of the first sample of segment "a" within that outline; ALEN is the number of samples (not steps) in segment "a"; ADIR is the direction ("+" or "-") in which the segment is to be read. Thus, segment "a" consists of samples curve[ACRV][(AINI+ADIR*k)%ATOT] for k varying from 0 through ALEN-1, where "%" is the mathematical MOD (remainder) operator. The description of segment "b" is entirely similar.