Can LLMs Compute Zakat? A Symbolic Islamic Finance Benchmark for Cross-Lingual Islamic Finance
Abstract
We introduce a verifiable, cross-lingual symbolic benchmark for evaluating large language models (LLMs) on rule-bound Islamic finance reasoning. The benchmark comprises 129 expert-validated templates grounded in formally specified AAOIFI Shariah rules, covering six operation categories — zakat (50), Islamic inheritance faraid (31), sukuk and ETB pricing (21), ijara leasing (16), istisna contracts (9), and murabaha financing (2) — and is realised through stratified parameter sampling into 6,450 English instances and 38,700 total cross-lingual instances across English, Arabic, Bahasa Indonesia, Urdu, Hindi, and Kazakh. Each instance ships with an executable verifier, enabling exact step-level scoring and ruling out the contamination concerns endemic to static benchmarks. Evaluating eight LLMs zero-shot, we find that math-specialised 7B models (MetaMath-7B: 62.4% FAC, WizardMath-7B: 61.2% FAC) substantially outperform much larger frontier and finance-tuned models on Shariah calculation, that boolean predicate evaluation collapses below the 50% chance baseline (mean 23.4% FAC), and that domain-specific errors — Hijri/Gregorian calendar conflation, nisab threshold confusion, mis-assigned heir shares — dominate the failure profile. The benchmark provides the first reproducible measurement of formal Shariah reasoning in multilingual LLMs.