Improving SVM Accuracy by Training on Auxiliary Data Sources
Pengcheng Wu - Oregon State University
Thomas Dietterich - Oregon State University
The standard model of supervised learning assumes that training and test data are drawn from the same underlying distribution. This paperexplores an application in which a second, auxiliary, source of data isavailable drawn from a different distribution. This auxiliary data is moreplentiful, but of significantly lower quality, than the training and testdata. In the SVM framework, a training example has two roles: (a) as a datapoint to constrain the learning process and (b) as a candidate support vectorthat can form part of the definition of the classifier. The paper considersusing the auxiliary data in either (or both) of these roles. This auxiliarydata framework is applied to a problem of classifying images of leaves ofmaple and oak trees using a kernel derived from the shapes of the leaves.Experiments show that when the training data set is very small, training withauxiliary data can produce large improvements in accuracy, even when theauxiliary data is significantly different from the training (and test) data. The paper also introduces techniques for adjusting the kernel scores of theauxiliary data points to make them more comparable to the training data points.