Abstract

A Theoretical Characterization of Linear SVM-Based Feature Selection
Douglas Hardin - Vanderbilt University Ioannis Tsamardinos - Vanderbilt University Constantin Aliferis - Vanderbilt University
Most prevalent techniques in Support Vector Machine (SVM) feature selectionare based on the intuition that the weights of features that are close to zeroare not required for optimal classification. In this paper we show thatindeed, in the sample limit, the irrelevant variables (in a theoretical andoptimal sense) will be given zero weight by a linear SVM, both in the soft andthe hard margin case. However, SVM-based methods have certain theoreticaldisadvantages too. We present examples where the linear SVM may assign zeroweights to strongly relevant variables (i.e., variables required for optimalestimation of the distribution of the target variable) and where weaklyrelevant features (i.e., features that are superfluous for optimal featureselection given other features) may get non-zero weights. We contrast andtheoretically compare with Markov-Blanket based feature selection algorithmsthat do not have such disadvantages in a broad class of distributions andcould also be used for causal discovery.

A Theoretical Characterization of Linear SVM-Based Feature Selection

Douglas Hardin - Vanderbilt University
Ioannis Tsamardinos - Vanderbilt University
Constantin Aliferis - Vanderbilt University

Most prevalent techniques in Support Vector Machine (SVM) feature selectionare based on the intuition that the weights of features that are close to zeroare not required for optimal classification. In this paper we show thatindeed, in the sample limit, the irrelevant variables (in a theoretical andoptimal sense) will be given zero weight by a linear SVM, both in the soft andthe hard margin case. However, SVM-based methods have certain theoreticaldisadvantages too. We present examples where the linear SVM may assign zeroweights to strongly relevant variables (i.e., variables required for optimalestimation of the distribution of the target variable) and where weaklyrelevant features (i.e., features that are superfluous for optimal featureselection given other features) may get non-zero weights. We contrast andtheoretically compare with Markov-Blanket based feature selection algorithmsthat do not have such disadvantages in a broad class of distributions andcould also be used for causal discovery.