Bridging the Visual Gap: Resource-Efficient VLM Adaptation for Meitei Mayek using Synthetic Multi-Modal Mixtures
Agniva Maiti ⋅ ADITYA TIWARI ⋅ Dwarikanath Mahapatra ⋅ Sudipta Roy ⋅ Yash Sinha ⋅ Dhruv Kumar ⋅ Murari Mandal
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in text recognition and visual reasoning, yet these advancements remain largely confined to high-resource languages, leaving indigenous scripts like Meitei Mayek severely marginalized. We introduce a resource-efficient synthetic bootstrapping methodology for adapting a pre-trained VLM to an entirely unseen, zero-resource orthography. Our pipeline synthesizes a massive 5-piece multi-modal mixture combining custom-rendered Meitei OCR data, dynamically translated VQA datasets, and instruction-tuning pairs, validated by native speakers through a Human-in-the-Loop (HiTL) protocol. Applying QLoRA to PaliGemma-3B, our fine-tuned adapter doubles VQA Exact Match accuracy (8.00\%~$\to$~16.00\%), raises ANLS from 8.62\% to 21.62\%, and achieves 76.0\% POPE Accuracy with 85.36\% F1-Score, against baselines that score absolute zero on on Meitei script comprehension. We release the adapter weights, translated datasets, and rendering pipeline as a replicable framework for extending VLMs to other under-represented scripts.
Successful Page Load