Do LLMs Distinguish Between Halal and Haram? Benchmarking Islamic Cultural Alignment in General vs. Arabic-Centric SLMs
Abstract
Large Language Models (LLMs) frequently exhibit cultural misalignment in non-Western contexts, often failing to grasp theological and societal nuances inherent to the Arab world. This study introduces \textbf{ACE-Adapt}, a unified evaluation framework designed to assess the cultural fidelity of parameter-efficient Small Language Models (SLMs) under 10 billion parameters. Leveraging the PalmX-IC benchmark, eight distinct architectures stratified into general purpose and Arabic-centric categories are rigorously evaluated on tasks covering Islamic rituals, jurisprudence, and history. By transforming static multiple-choice queries into strict conversational constraints and applying Quantized Low-Rank Adaptation (QLoRA), a significant performance dichotomy is demonstrated. Empirical results reveal that Arabic-centric models consistently outperform their general-purpose counterparts, regardless of parameter scale. Notably, the Fanar-1 9B model achieves state-of-the-art accuracy of 79.60\%, while the 3B-parameter NileChat surpasses the larger Llama-3.1 8B baseline. These findings challenge prevailing scaling laws in cultural domains, demonstrating that domain-aligned pre-training priors are fundamentally more critical than model size for resolving semantic ambiguities in Islamic culture.