Poster
Linear Adversarial Concept Erasure
Shaul Ravfogel · Michael Twiton · Yoav Goldberg · Ryan Cotterell
Hall E #538
Keywords: [ SA: Fairness, Equity, Justice and Safety ] [ SA: Trustworthy Machine Learning ] [ DL: Other Representation Learning ] [ MISC: Representation Learning ] [ SA: Privacy-preserving Statistics and Machine Learning ] [ APP: Language, Speech and Dialog ] [ MISC: General Machine Learning Techniques ] [ Miscellaneous Aspects of Machine Learning ]
Modern neural models trained on textual data rely on pre-trained representations that emerge without direct supervision. As these representations are increasingly being used in real-world applications, the inability to \emph{control} their content becomes an increasingly important problem. In this work, we formulate the problem of identifying a linear subspace that corresponds to a given concept, and removing it from the representation. We formulate this problem as a constrained, linear minimax game, and show that existing solutions are generally not optimal for this task. We derive a closed-form solution for certain objectives, and propose a convex relaxation that works well for others. When evaluated in the context of binary gender removal, the method recovers a low-dimensional subspace whose removal mitigates bias by intrinsic and extrinsic evaluation. Surprisingly, we show that the method---despite being linear---is highly expressive, effectively mitigating bias in the output layers of deep, nonlinear classifiers while maintaining tractability and interpretability.