Timezone: »

 
Probing Classifiers are Unreliable for Concept Removal and Detection
Abhinav Kumar · Chenhao Tan · Amit Sharma
Event URL: https://openreview.net/forum?id=MozmMHehWW8 »

Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is non-trivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute, which we prove is difficult to train correctly in presence of spurious correlation.

Author Information

Abhinav Kumar (MICROSOFT RESEARCH)
Chenhao Tan (University of Chicago)
Amit Sharma (Microsoft Research)

More from the Same Authors