Poster Mon, Jul 6, 2026 • 10:00 PM – 11:45 PM PDT HALL A #3908

TextME: Bridging Unseen Modalities Through Text Descriptions

Soyeon Hong ⋅ Jinchan Kim ⋅ Jaegook You ⋅ Seungtaek Choi ⋅ Suha Kwak ⋅ Hyunsouk Cho

Project Page

Abstract

Expanding multimodal representations to novel modalities is constrained by reliance on largescale paired datasets (e.g., text–image, text–audio, text–3D, text–molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, to the best of our knowledge the first modality expansion framework based on text-only training, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audioto-image, 3D-to-image). These results establish text-only projection training as a practical alternative to paired supervision for modality expansion. The code is available at https://soyeonhh.github.io/TextME/.

Lay Summary

Modern AI systems learn to connect different types of data, such as images and the text that describes them, by placing them in a shared representation space where related items sit close together. Extending such a system to a new data type, called modality expansion, normally requires large collections of paired data, where each item (e.g., an X-ray or a molecule) is matched to a text description. Such paired data is costly and, in specialized fields like medicine, often infeasible to collect. We observe that each pretrained model arranges its data with a consistent geometric structure, known as the modality gap, that can be exploited as a shortcut. Building on this, our framework TextME performs modality expansion using text-only training: it learns to map six modalities, including images, audio, 3D, X-rays, and molecules, into a unified space using only text descriptions, with no paired data. This reduces the required data by over 95 percent and enables cross-modal matching the system was never explicitly trained on, making modality expansion practical for specialized domains.