Resource-Aware Prompting for Sinhala-Tamil Machine Translation: Methodological Insights for Compute-Limited Settings
Imanthi Okadini ⋅ Randil Pushpananda ⋅ Chamila Liyanage ⋅ Ruvan Weerasinghe
Abstract
Sinhala and Tamil, the national languages of Sri Lanka, form a low-resource but linguistically related pair characterized by high morphological richness and a shared Subject-Object-Verb (SOV) word order. While traditional Neural Machine Translation (NMT) struggles with data sparseness for this pair, this paper provides a targeted empirical analysis of advanced prompting strategies under realistic resource constraints. We evaluate zero-shot, advanced retrieval-based few-shot, Multi-Agent Debate (MAD), and Decomposed Prompting (DecoMT) across models scaled from 8B to 120B parameters, utilizing a core subset purposefully selected to stress-test these strategies for morphological complexity. Our findings reveal a notable “reliability-capability gap” in smaller models ($\le20B$), which frequently fail to follow instructions despite possessing latent linguistic knowledge. We observe consistent trends suggesting that providing lexically similar anchors via BM25+CTQScorer retrieval nearly doubles the Llama 3.3 70B model’s performance in the Tamil-to-Sinhala direction. Furthermore, the MAD framework effectively mitigates cross-lingual leakage and reasoning failures, while DecoMT achieves stable character-level consistency. Our analysis highlights that while advanced prompting can rectify specific failures, the ''Curse of Multilinguality'' remains a primary bottleneck. This work prioritizes methodological insights over large-scale benchmarking, reflecting realistic constraints faced by researchers in low-resource settings.
Successful Page Load