Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Spurious correlations, Invariance, and Stability (SCIS)

Probing Classifiers are Unreliable for Concept Removal and Detection

Abhinav Kumar · Chenhao Tan · Amit Sharma

Keywords: [ Fairness ] [ spurious correlation ] [ Adversarial Removal ] [ Null-Space Removal ] [ Probing ]


Abstract:

Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is non-trivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute, which we prove is difficult to train correctly in presence of spurious correlation.

Chat is not available.