We consider supervised learning in the presence of very many
irrelevant features, and study two different regularization methods
for preventing overfitting. Focusing on logistic regression, we show
that using L_1 regularization of the parameters, the sample complexity
(i.e., the number of training examples required to learn "well,")
grows only logarithmically in the number of irrelevant features.
This logarithmic rate matches the best known bounds for feature
selection, and indicates that L_1 regularized logistic regression
can be effective even if there are exponentially many irrelevant
features as there are training examples. We also give a lower-bound
showing that any rotationally invariant algorithm---including logistic
regression with L_2 regularization, SVMs, and neural networks trained
by backpropagation---has a worst case sample complexity that grows at
least linearly in the number of irrelevant features. |