GHVL: Geometry-Grounded Hyperbolic Vision-Language Models for Hierarchical Multimodal Representation Learning
Abstract
Vision-language models (VLMs) have achieved remarkable performance by aligning visual and textual representations in a shared Euclidean space. However, Euclidean representations inherently fail to capture hierarchical semantic structures present in multimodal data, such as fine-grained categories or conceptual hierarchies. We propose GHVL, a geometry-grounded hyperbolic VLM that maps images and text into a Poincaré manifold to induce hierarchy-aware representations. By leveraging the exponential capacity of hyperbolic space, GHVL preserves semantic distances across multiple hierarchy levels, enabling faithful modeling of fine-grained concepts. We introduce an adaptive, entropy-driven entailment loss to enforce hierarchical ordering between modalities and integrate it into contrastive objectives for cross-modal alignment. Evaluation on zero-shot classification and image-text retrieval benchmarks demonstrates consistent improvements over Euclidean baselines such as CLIP and Lorentz-based MERU, particularly in hierarchy-sensitive scenarios. These results highlight the importance of respecting geometric structure and demonstrate that hyperbolic representations provide a principled foundation for hierarchical multimodal understanding.