Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)
Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?
Alexander Tsvetkov · Alon Kipnis
Large Language Models (LLMs) are increasinglydeployed in user-facing applications worldwide,necessitating the handling of multiple languagesacross a variety of tasks. However, there is no onemetric that can predict a LLM’s multilingual capabilities. To address this gap, we propose Compression Parity (CP) – a metric based on Shannon’sinformation measure – to assess the multilingualcapabilities of a LLM in a task-agnostic manner.We evaluate CP on open-sourced LLMs (Llama2,Gemma, Mistral) and demonstrate a strong correlation with existing task-specific metrics from theliterature – better than any of the existing metricswe are aware of, e.g., tokenizer parity and fertility.These findings show that CP is a good predictorof an LLM’s performance in a certain language,hence it may serve as a useful tool for rankingmultilingual LLMs’ capabilities regardless of thedownstream task.