Tokenization as Cultural Erasure: How Corpus Composition Shapes the Representation of Aymara Morphology in NLP Systems
BRUNO FERNANDO SILVA PLATA
Abstract
Tokenization is not a neutral preprocessing step for agglutinative languages whose morphology encodes culturally meaningful distinctions. In Aymara, evidentiality, temporal orientation, and relational meaning are expressed through productive morpheme combinations that may become obscured when tokenizers are trained primarily on frequent surface forms. We present a controlled study of five SentencePiece Unigram tokenizers trained on linguistically stratified Spanish--Aymara corpora containing 17,856 translation pairs. Across 15 training runs with identical downstream T5 architectures, the tokenizer trained exclusively on morphologically simple forms achieves the strongest performance at every evaluation level, reaching $17.01 \pm 0.23$ chrF globally and $17.73 \pm 0.40$ chrF on compositional structures despite having the highest fertility and smallest vocabulary. We further show that a commonly used morpheme integrity metric may systematically favor boundary fusion in agglutinative settings, assigning the best-performing tokenizer the lowest score because correct segmentation reduces surface-form preservation. Based on these findings, we propose the Morphological Boundary Hypothesis: tokenizers trained on morphologically simple forms learn reusable roots and suffixes as independent vocabulary units, enabling better compositional generalization downstream. Our results suggest that tokenizer corpus composition substantially influences morphological representation quality in low-resource agglutinative language systems and that morphologically grounded tokenization can improve translation performance with minimal additional computational cost.
Successful Page Load