Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language
Abstract
Recent work shows that vision encoders capture ordinal attributes along linear axes, which can be recovered from as few as two labeled images. However, in the zero-shot setting, the text-driven rank axis for Vision-Language Models (VLMs) like CLIP remains suboptimal. In this work, we study the embeddings of Multimodal LLMs (MLLMs). We hypothesize that MLLMs can overcome this limitation due to three potential advantages: their inherent ordinal understanding, capacity for conditional embeddings, and a small cross-modal gap. We show that MLLMs are rankable using only text prompts. Experiments demonstrate that a text-driven rank axis for MLLM embeddings achieves 90% of the performance of the supervised linear rank axis, significantly outperforming the 61% observed in VLM embeddings. We validate that this capability stems from MLLMs' conditional embeddings and a smaller modality gap than VLMs. Furthermore, we demonstrate that this property generalizes to the audio domain. Our findings suggest that language provides a direct interface for probing latent ordinal structures in MLLMs.