Poster
in
Workshop: Self-Improving Foundation Models Without Human Supervision

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

Amin Karimi Monsefi · Kishore Sailaja · Ali Alilooee · Ser-Nam Lim · Rajiv Ramnath

Keywords: Contrastive Learning self improving Clip Segmentation Vision-Language Models

Project Page [ OpenReview]

Abstract

In this paper, we introduce DetailCLIP, a self-improving vision-language foundation model designed to enhance fine-grained feature understanding through self-supervised learning. Foundation models like CLIP have demonstrated strong performance in global image-text alignment but often fail to capture detail-oriented features necessary for tasks such as segmentation. To address this, DetailCLIP integrates self-curated learning objectives that iteratively improve both high-level semantics and detailed visual representations. Specifically, our method employs patch-level self-distillation and pixel-level reconstruction losses to generate refined internal representations, while an attention-based token filtering mechanism curates semantically relevant information during training. By generating and refining self-curated learning signals, DetailCLIP improves segmentation performance and demonstrates superior generalization across diverse tasks. These task-agnostic objectives position DetailCLIP as a self-improving foundation model, enhancing multi-modal systems like CLIP with fine-grained feature understanding.

Chat is not available.