Harmful content detection models tend to have higher false positive rates for content from marginalized groups. Such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion online. Current approaches to algorithmic harm mitigation are often ad hoc and subject to human bias. We make two main contributions in this paper. First, we design a novel methodology, which provides a principled approach to detecting the severity of potential harms associated with a text-based model. Second, we apply our methodology to audit Twitter’s English marginal abuse model. Without utilizing demographic labels or dialect classifiers, which pose substantial privacy and ethical concerns, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. In order to mitigate the associated harms, we experiment with adding additional true negative examples to the training data. We find that doing so provides improvements to our fairness metrics without large degradations in model performance. Lastly, we discuss challenges to marginal abuse modeling on social media in practice.