Culturally Respectful Is Not Enough: Auditing LLM Safety in Diabetes Advice During Ramadan
Abstract
Large language models are increasingly consulted for health information, yet their safety is rarely evaluated in culturally situated medical contexts where a user's religious practice changes the relevant risks, constraints, and answer style. We study Ramadan fasting among Muslims with diabetes, a setting in which safe advice must jointly handle hypoglycemia and dehydration risk, medication adjustment, religious significance, and individualized clinical judgment. We introduce RamadanSafeQA, a preliminary audit benchmark of 68 synthetic vignettes spanning five Ramadan-diabetes categories and IDF-DAR-style risk levels. We generate 816 responses from four LLMs- GPT-4o, Claude Sonnet 4.6, Jais 2 8B, and MedGemma 27B-under vanilla, safety-checklist, and guideline-grounded prompts, and manually score a shuffled subset of 530 responses with a four-item safety rubric. Cultural respect, clinician referral, and autonomy preservation are near ceiling across models, while medical safety varies sharply: fully-safe rates range from 0% for Jais 2 8B to 81% for Claude Sonnet 4.6 with checklist prompting. The failures are usually medical omissions or incompleteness, not bare refusal or overt religious disrespect. Guideline-grounded prompting improves three of four models, but does not help Jais in this English-language audit; its dominant failure mode is substituting supportive interpersonal scripts for clinical content. Our results suggest that culturally aware medical safety evaluation must measure both cultural and clinical axes, because culturally sensitive language can mask missing clinical substance.