mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Abstract
Encoder-only language models are frequently used for a variety of language tasks, including classification and retrieval. However, there has been a lack of recent research efforts for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements to massively multilingual encoder training, including phased data curation and scheduled language inclusion. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data without excessive repetition. The model uses recent advances in architecture and training schemes to be faster and more multilingual than other models and we release weights, data, and code. We show that mmBERT significantly outperforms the previous generation, on various tasks, for both high and low-resource languages.