Supervised classification by OPF

Next: Accuracy measure Up: LibOPF Previous: LibOPF

Supervised classification by OPF

The data samples in the supervised OPF classifier are the nodes of a complete graph, whose arcs are weighted by a distance function between the feature vectors of the corresponding nodes. The path-value function assigns the maximum arc weight along the path, and it is minimized for all nodes by the IFT algorithm. The prototypes are the closest samples between distinct classes as found by a minimum-spanning tree of the graph. This does not guarantee zero errors in the training set, as mentioned in [4], but it considerably reduces misclassification in the training set.

A dataset containing feature vector and label for each sample must be presented in the OPF file format, which is specified in the README file of the software distribution (Section 2). It is possible to specify a precomputed distance file in the case of time-consuming distance functions. LibOPF also provides a program opf_distance, with some distance options, to create precomputed distance files. The Euclidean distance is assumed as default.

For large datasets (thousands/millions of samples), it is usually desirable to keep some maximum size for the training set. However, an evaluation set can improve the training samples during pseudo tests (learning procedure). Therefore, LibOPF provides a program to randomly split the dataset into training, evaluation and test sets (opf_split).

One can project an OPF classifier by using the program 'opf_train' and test it by using the program 'opf_classify'. However, for large datasets, the program 'opf_learn' substitutes the 'opf_train' by learning from classification errors in the evaluation set without increasing the training set size. Afterwards, the classifier is tested by using 'opf_classify'.

The main advantages of this supervised approach are:

It is very efficient, running much faster than SVMs and MLPs.
It is naturally multi-class.
It does not assume shape and separability between classes in any space. So it supports some degree of overlapping.
It does not depend on the adjust of parameters.
It can take into account any distance function, which allows to verify the effectiveness of a same set of features in different distance spaces.
It has been successfully evaluated in data sets from several applications, including texture images [5,6], laryngeal pathology detection [7], and oropharyngeal dysphagia identification [8].

Subsections

Accuracy measure

Next: Accuracy measure Up: LibOPF Previous: LibOPF

Joao Paulo Papa 2009-09-30