CARPRT: Class-Aware Zero-Shot Prompt Reweighting for Vision-Language Model
Abstract
Pre-trained vision-language models (VLMs) enable zero-shot image classification by computing the similarity score between an image and textual descriptions, typically formed by inserting a class label (e.g., "cat") into a prompt (e.g., "a photo of a").Existing studies have shown that the score between a given image-class pair is highly sensitive to the choice of prompt, and they proposed a scheme using a weighting vector to reassemble scores regarding different prompts. We observe that these studies assign the same weighting vector across all classes, by implicitly assuming the conditional independence of classes and weights, which, however, often does not hold in practice. For instance, a prompt like "an aerial view of" might be apt for "airport" but ill-suited for "apple". To address this, we propose class-aware zero-shot prompt reweighting (CARPRT), a scoring scheme that adjusts the weighting vector for each class by capturing the class-specific relevance of different prompts in a training-free manner. For each class and every available prompt, it first identifies the maximum image-text relevance score using that prompt-class pair across the dataset. These maximum scores are then normalized to estimate class-specific weights that reflect how effectively a prompt represents different semantic labels. Evaluations on standard fine-grained image classification benchmarks show that CARPRT outperforms existing class-independent reweighting, confirming that modeling prompt-class dependency is crucial for effective zero-shot prediction and even broader VLM-based application settings that rely on prompt ensembling.