Answer, Ask, or Escalate? Evaluating Public-Service Triage in Global South Languages
Abstract
Public-service assistants are often judged as if success means answering correctly. In practice, that is not enough. A safe system should some times answer, sometimes ask for one missing fact, and sometimes escalate the user to an official channel. We introduce a document- grounded triage benchmark with 90 reviewed items across Sinhala, Tamil, and Swahili, cov- ering civil-registration procedures in Sri Lanka and Tanzania through 15 language-specific service cards. Models must choose Answer, Ask, or Escalate under fixed official evidence. Answer-only evaluation substantially overstates readiness: Qwen 2.5 3B gets answer-only score 1.00 but only 0.344 triage-safe score, with 0.917 harmful over-answer rate. A smaller Qwen model fails in the opposite direction and asks on nearly all items. Claude Sonnet 4.6 reaches 0.844 triage-safe score with much lower harmful over-answering, which suggests the task is solvable but model-tier dependent. A 45-row reviewed human evaluation slice is consistent with the automatic ranking. These results suggest that public-service evaluation should measure whether models know when to answer, ask, or escalate, not only whether they answer routine cases. A public reproducibility package is available at https://github.com/globalsouthml/publicservicetriagepublicrepo