ICML Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Poster
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler · Gerard de Melo

[ Abstract ] [ Project Page ]

[ Slides] [ Poster] [ OpenReview]

Abstract:

We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and performance in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Based on our results, we conclude that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German.

Chat is not available.

Poster in Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler · Gerard de Melo

Poster
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)