Text-Driven Fusion for Infrared and Visible Images: Achieving Image Scene Adaptation on Hyperbolic Space
Abstract
Infrared and visible image fusion aims to integrate complementary information from both modalities. However, most existing methods rely on Euclidean representations, which inherently impose geometric constraints that hinder effective semantic modelling. Specifically, Euclidean geometry imposes rigid distance metrics that distort multi-modal feature interactions, particularly in preserving parent-to-child semantic hierarchies. To overcome this, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. In particular, our approach models text-attribute correlation during training by leveraging BLIP-extracted prompts to align with vision-attribute, thereby enabling the formation of adaptive enhancement strategies semantically. Within the hyperbolic space, the text prompts act as topological anchors, guiding vision-attribute alignment through hyperbolic embeddings that naturally expand with semantic granularity. Using the Poincaré ball's negative curvature, we encode coarse-to-fine semantics without Euclidean distance saturation, while its exponentially growing periphery prevents texture distortion during cross-modal fusion. During inference, the fusion process autonomously adapts to the input content using learned text-attribute priors, eliminating any dependence on textual input. The experimental results show that the proposed method outperforms existing state-of-the-art methods on existing publicly available benchmark datasets.