Large-scale vision-language pre-trained (VLP) models, such as CLIP, excel at diverse applications in zero-shot settings. However, their performance often falters in specialized domains due to domain gaps and underrepresentation in training data. While supervised fine-tuning with human-annotated labels can address this, it's expensive and time-consuming, especially for complex tasks.
LATTECLIP proposes an unsupervised approach to fine-tune CLIP models for classification in specific domains, even without human-annotated labels. It leverages Large Multimodal Models (LMMs) to generate expressive text descriptions for images and groups of images, providing valuable contextual information to guide the fine-tuning process.
LATTECLIP utilizes LMMs to create different types of text descriptions, including:
These multi-level descriptions provide richer supervision for training, surpassing the limitations of pseudo-labels alone.
Directly fine-tuning CLIP with LMM-generated text can lead to poor performance due to hallucinations and noise. LATTECLIP addresses this by employing a prototype learning framework that learns class-specific representations from the generated texts.
LATTECLIP's prototype learning approach combines three key elements:
LATTECLIP was evaluated on 10 domain-specific classification datasets, demonstrating significant improvement in top-1 accuracy compared to baseline methods, including:
Ablation studies were conducted to analyze the contributions of different components of LATTECLIP:
LATTECLIP is a promising unsupervised method for fine-tuning CLIP models on specialized datasets where human annotations are costly. Its ability to leverage LMMs for expressive text generation and its robust prototype learning framework effectively address the challenges of noise and domain adaptation, achieving significant performance improvements without relying on human labels.
Ask anything...