ICML Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?

Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?

Alexander Tsvetkov · Alon Kipnis

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Large Language Models (LLMs) are increasinglydeployed in user-facing applications worldwide,necessitating the handling of multiple languagesacross a variety of tasks. However, there is no onemetric that can predict a LLM’s multilingual capabilities. To address this gap, we propose Compression Parity (CP) – a metric based on Shannon’sinformation measure – to assess the multilingualcapabilities of a LLM in a task-agnostic manner.We evaluate CP on open-sourced LLMs (Llama2,Gemma, Mistral) and demonstrate a strong correlation with existing task-specific metrics from theliterature – better than any of the existing metricswe are aware of, e.g., tokenizer parity and fertility.These findings show that CP is a good predictorof an LLM’s performance in a certain language,hence it may serve as a useful tool for rankingmultilingual LLMs’ capabilities regardless of thedownstream task.

Chat is not available.

Poster in Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?

Alexander Tsvetkov · Alon Kipnis

Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)