When the Model Shouldn't Cite: Calibrated Abstention for Islamic Citation Hallucination
Ahmad A. Rushdi ⋅ Mansur Ali Khan ⋅ Mehmet Efe Akengin
Abstract
Large language models sometimes misquote the Qur'anic verses and Hadiths they cite --- a high-stakes failure where confidently fabricating sacred text is worse than ordinary hallucination. The 2025 IslamicEval shared task benchmarks this, but every system must commit a label on every span; none abstains. We add a plug-in abstention layer: combine publicly released LLM outputs with cross-family runs of our own ($K{=}7$ total), score each span by majority-disagreement, and abstain when disagreement is high. The threshold is chosen on a $60/40$ calibration split via a union-bound-adjusted Clopper-Pearson upper bound, giving a calibration-sample high-probability bound on selective risk. On the Subtask~1C dev set, the selective-accuracy curve exceeds the leaderboard reference at $14\%$ abstention and reaches $94\%$ conditional accuracy at $44\%$; risk-controlled operating points certify the $44\%$ region. Public test labels are unavailable, so leaderboard scores serve as reference lines, not head-to-head. The layer is plug-and-play, requires no retraining, and is reusable across tasks.
Successful Page Load