Single-Token Compositional Interfaces for Productive Morphology in Global South Languages
Abstract
We investigate a limitation of standard subword tokenization in multilingual language models: its tendency to fragment productive morphological forms, which are central to many languages in the Global South, increasing sequence length and weakening representation of compositional structure. We propose an alternative interface in which each productive word is treated as a single token, with its embedding computed compositionally from ordered morphological parts, enabling a controlled comparison against whole-word lookup and subword tokenization under matched transformer backbones and training data. Across a staged evaluation protocol, we show that compositional single-token representations consistently improve performance on morphology-sensitive tasks. On Turkish, the proposed interface outperforms both lookup and a pretrained multilingual WordPiece baseline in static reconstruction, generalization to unseen valid combinations, and next-token prediction. A multilingual extension shows clear gains for Hindi, consistent improvements for German, smaller gains for Finnish, and mixed results for Telugu. These results demonstrate that subword tokenization is not uniformly optimal and that compositional token interfaces provide a targeted alternative for morphology-rich Global South languages.