Balanced Fine-Tuning Is Not Enough: Latent Bias Mitigation in Speech Foundation Models
Abstract
Self-supervised speech foundation models have substantially improved speech recognition and downstream speech classification. However, their behavior can remain uneven across demographic groups even when the downstream fine-tuning data is balanced. We study this issue using Wav2Vec~2.0 and S2T models fine-tuned on balanced splits of Fair-speech and FLEURS. Across gender, ethnicity, and economic-class prediction tasks, we find that precision and recall can diverge sharply across groups, indicating that class balance alone does not remove latent bias inherited from pre-training or introduced during adaptation. To mitigate this problem, we propose a three-phase framework that discovers proxy bias groups without requiring pre-specified protected attributes, learns bias-correlated representations from those proxy labels, and reduces their influence through feature fusion and reweighted supervision. Experiments show strong improvements for gender classification, reducing precision--recall gaps to nearly zero across both datasets and both model families. Economic and ethnicity classification tasks show partial parity improvements while revealing important remaining challenges for complex multi-class demographic attributes.