Abstract

A Pitfall and Solution in Multi-Class Feature Selection for Text Classification
George Forman - Hewlett-Packard Labs
Information Gain is a well-known and empirically proven method forhigh-dimensional feature selection. We found that it and other existingmethods failed to produce good results on an industrial text classificationproblem. On investigating the root cause, we find that a large class offeature scoring methods suffers a pitfall: they can be blinded by a surplus ofstrongly predictive features for some classes, while largely ignoring featuresneeded to discriminate difficult classes. In this paper we demonstrate thispitfall hurts performance even for a relatively uniform text classificationtask. Based on this understanding, we present solutions inspired byround-robin scheduling that avoid this pitfall, without resorting to costlywrapper methods. Empirical evaluation on 19 datasets shows substantialimprovements.

A Pitfall and Solution in Multi-Class Feature Selection for Text Classification

George Forman - Hewlett-Packard Labs

Information Gain is a well-known and empirically proven method forhigh-dimensional feature selection. We found that it and other existingmethods failed to produce good results on an industrial text classificationproblem. On investigating the root cause, we find that a large class offeature scoring methods suffers a pitfall: they can be blinded by a surplus ofstrongly predictive features for some classes, while largely ignoring featuresneeded to discriminate difficult classes. In this paper we demonstrate thispitfall hurts performance even for a relatively uniform text classificationtask. Based on this understanding, we present solutions inspired byround-robin scheduling that avoid this pitfall, without resorting to costlywrapper methods. Empirical evaluation on 19 datasets shows substantialimprovements.