Mitigating Algorithmic Bias in Toxicity Classification for Muslim Identity Contexts: A Study on FPR Parity
Umar Hasan
Abstract
As automated moderation systems become ubiquitous, ensuring they operate fairly across diverse religious identities is critical. This study investigates unintended bias in toxicity classification models, focusing on Muslim identity terms. Using the Jigsaw Unintended Bias dataset, we identify a 10x disparity in false-positive rates for Muslim-related content in a baseline model. We propose a sample-weighting mitigation strategy that successfully achieves FPR parity, reducing the identity-specific error rate from 0.0273 to 0.0015. Furthermore, feature importance analysis proves our intervention successfully decouples religious identifiers from toxic associations.
Successful Page Load