[CVPR 2023 Series #6] LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

By Adrian Bulat Samsung AI Center - Cambridge
By Georgios Tzimiropoulos Samsung AI Center - Cambridge

The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).

In this relay series, we are introducing a summary of the 7 research papers at the CVPR 2023 and here is a summary of them.

- Part 1 : SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields (by Samsung AI Center – Toronto)

- Part 2 : StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos (by Samsung AI Center – Toronto)

- Part 3 : GENIE: Show Me the Data for Quantization (by Samsung Research)

- Part 4 : A Unified Pyramid Recurrent Network for Video Frame Interpolation (By Samsung R&D Institute - Nanjing)

- Part 5 : MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation (By Samsung R&D Institute United Kingdom)

- Part 6 : LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models (By Samsung AI Center - Cambridge)

- Part 7 : Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style (By Samsung AI Center - Cambridge)


Welcome to our research blog post, where we delve into the fascinating world of vision and language models. In recent years, large-scale pre-training of neural networks has paved the way for ground-breaking advancements in Vision & Language (V&L) understanding. These pre-trained models, such as BERT [5] and CLIP [1], have demonstrated their ability to capture the intricacies of the world, enabling them to adapt seamlessly to new tasks and datasets.

In this blog post, we focus on the remarkable potential of V&L models trained with contrastive learning, which have opened doors to few-shot and even zero-shot adaptation. By leveraging the power of contrastive learning, these models can quickly adapt to new downstream tasks with minimal training examples. Prompt engineering and learning have emerged as powerful techniques for adapting V&L models to novel tasks, drawing inspiration from their counterparts in Natural Language Processing (NLP). Initially, researchers relied on manual templates or prompts to create class-specific weights for zero-shot recognition. However, recent advancements have introduced the concept of "soft prompts" [2,3] – learnable vectors that are fed into the text encoder along with the class name. These soft prompts are learned from a few training examples with the entire V&L model kept frozen. The whole process can be seen as parameter efficient fine-tuning of the model on a small training dataset.

Despite the promising results of soft prompt learning, there is a noticeable challenge known as base class overfitting. While accuracy on the base classes improves significantly, the accuracy on unseen novel classes suffers. This issue arises because soft prompts are learned from only a few examples belonging to the base classes. Interestingly, hand-engineered prompts still outperform existing soft prompt learning methods when it comes to recognizing novel classes.

Our Approach

To address the problem of base class overfitting, we propose a simple, yet highly effective solution motivated by a keen observation: Prompt learning enhances accuracy on base classes, while prompt engineering excels in recognizing novel classes. Drawing on this insight, we introduce a novel text-to-text loss that enforces the learned prompts to be close, in embedding space, to the textual prompts. By exploiting the intrinsic information captured by the text encoder, we enable language-only optimization for V&L model adaptation, a unique approach compared to previous soft prompt learning methods that mainly focus on V&L interactions.

Our Contributions

We propose a novel framework for soft prompt learning which we call Language-Aware Soft Prompting (LASP). Our main contributions within the LASP framework are as follows:

1. Language-Only Optimization: For the first time, we propose language-only optimization for V&L model adaptation. Our novel text-to-text cross-entropy loss maximizes the probability of learned prompts being correctly classified compared to the hand-engineered ones. We demonstrate the effectiveness of this approach in alleviating base class overfitting.

2. Grouped Language-Aware Prompt Representation: To increase the representation capacity of prompts, we draw inspiration from grouped convolution and multi-head attention. We introduce a grouped language-aware prompt representation, where each group of prompts specializes in a different subset of pre-defined manual templates.

3. Addressing Visual-Language Misalignment: Prompt learning and more generally, contrastive pre-training, introduce a visual-language misalignment that impacts generalization. To tackle this challenge, we propose a re-calibration mechanism, which involves Layer Normalization fine-tuning and learning a class-agnostic bias.

4. Training with Virtual Classes: Leveraging our language-only learning framework, we propose training LASP with virtual classes, even when visual samples are unavailable. This strategy further enhances the robustness of the learned prompts.

Through extensive experiments, we showcase the superiority of our approach over existing soft prompting methods. Our methods set a new state-of-the-art for few-shot and zero-shot image classification on multiple datasets. Notably, we present a prompt learning method that outperforms the strong baseline of hand-crafted prompts and CLIP for recognizing novel classes.

Figure 1.  While standard prompt learning is based on image-text interactions (LvL loss), LASP additionally models text-text interactions using the proposed Text-to-Text loss LTT. There are G groups of learned prompts, passed through the text encoder to form G text embeddings summarizing the input. The LTT loss is then applied over the different groups of the text embeddings and the textual prompts. Moreover, to alleviate data distribution shift and visual-language misalignment, the LN layers of the visual encoder are fine-tuned, and the embeddings are “corrected” at the output space by the learnable vector b, shared for all classes. The text encoder remains entirely frozen. Notably, LASP can be trained with virtual classes by including, during training, class names for which no visual samples are available.

Language-Only Optimisation: Language-Aware Soft Prompting (LASP)

Our method, called Language-Aware Soft Prompting (LASP) aims to enhance the generalisation and robustness of few-shot adaptation in Vision & Language (V&L) models. Unlike previous methods that primarily focus on V&L interactions, LASP leverages language-only optimization to improve generalization and mitigate base-class overfitting.

As such, in addition to the vision-language classification loss, we introduce a second cross-entropy loss function that minimizes the distance between learned soft prompts and hand-engineered textual prompts. The textual prompts, obtained by encoding class-specific templates, act as class weights in the language space. This loss encourages the learnable prompts to be correctly classified based on the textual prompts. This support hard prompts can be randomly constructed or formed simply by taking the ones used in [1]. During training, we then calculate the probability of a learnable prompt being classified as a specific class. This probability calculation is based on measuring the cosine similarity between the encoded textual prompt for the target class and the encoded learnable prompt in the embedding space of CLIP’s text encoder.

The overall training objective is then a combination of the V&L loss and the language-aware loss, weighted by user-defined scaling coefficients. This combined objective ensures that the model learns both visual and language cues effectively, leading to improved performance on novel classes and reduced overfitting to base classes. The overall framework is depicted in Figure 1.

Intuitively, our method can be interpreted in several ways: as a regularizer, as a language-based augmentation and as a data-free distillation.

LASP as a regularizer: As the proposed loss encourages the learned prompts to be close in the embedding space to the textual ones, our method can be viewed a s a regularizer that prevents the leaned prompt-conditioned features form diverging too much from the hand-crafted ones.

LASP as a language-based augmentation: The current best practice for learning from visual data is to randomly apply a set of transformations (such as rotation, scaling etc.) at train time. Our text-to-text optimisation opens the door for a language-based augmentation for V&L adaptation too. In practice, we can achieve this by tareted prompting, where we can specify certain characteristics and/or apply text-based transformations to the class name, e.g.: “A sketch of dog” or “A rotated photo of a dog”.

LASP as a data-free distillation: Typically, knowledge distillation requires a training set of images, where a teacher network provides a training signal for the student. LASP’s text-to-text loss can be also interpreted as a data-free distillation (i.e. does not use any image data) where the learnable prompts define the “samples”. As CLIP learns a joint V&L space, similar concepts are close together across both domains. Hence, optimizing against a concept or object in the language domain, using the proposed loss, should also help make a step in the visual domain, improving the classification of the images.
Furthermore, LASP leverages the joint V&L space of CLIP to improve image classification even without using any image data. By optimizing against the textual prompts, LASP effectively distils knowledge from the language domain to enhance the model's performance in the visual domain.

Grouped Language-Aware Prompt Representation

Building on the success of techniques like grouped convolutions and multi-head attention, we propose a new approach called Grouped Prompt Representation. This approach aims to optimize prompt learning by dividing the set of textual prompts into separate groups, where each group specializes in a specific subset of prompts: similarly, to how grouped convolutions and multi-head attention combine the expertise of individual groups or heads, our grouped prompt representation leverages the specialization of each group to enhance prompt learning.

To create the groups, we evenly split the set of prompts into multiple subsets. Each subset is associated with a specific group and is optimized to capture unique aspects of the prompts. These specialized prompt groups learn transformations tailored to their respective subset of prompts. The model is then trained using an adapted text-to-text loss that extends the original one to incorporate the grouped prompts. This loss ensures that the model accurately predicts the class probabilities based on the prompts from each group.

During inference, the final prediction is obtained by averaging the similarity scores between each group's text feature and the visual feature. This aggregation strategy combines the information from different groups to make a robust and comprehensive prediction.

Addressing Re-aligning LASP

For some downstream tasks, there might be a discrepancy between the data distribution of the downstream image dataset and the one used during CLIP training. It is crucial to address this data distribution shift in the downstream adaptation process. However, optimizing the visual encoder directly to account for this shift can lead to overfitting on the base classes, where the V&L embeddings are pushed away from the joint space learned by CLIP. Early experiments with visual adapters have shown a negative impact on zero-shot accuracy.

To overcome this challenge, we propose an alternative approach: fine-tuning the Layer Normalization (LN) of the CLIP encoder. Fine-tuning the LN parameters provides a more robust way to adapt the visual encoder while maintaining alignment with the joint space learned by CLIP. By selectively adjusting the LN parameters, the model can capture the distributional shift without sacrificing zero-shot accuracy. This fine-tuning process helps the model effectively combat data distribution shift during downstream adaptation.

After fine-tuning the LN parameters, there is a possibility of misalignment between the visual and language modalities. To address this issue, we propose learning a "correction" at the output of the text encoder in the form of a learnable offset or bias. This offset aims to re-align the two modalities and ensure their compatibility. Specifically, we introduce a learnable vector b, which is summed to the set of weights W of the linear classifier obtained by passing the learned prompts to the text encoder. It is important to note that the learned offset is shared among all classes, allowing it to be readily applied even in the case of novel classes. This approach enables us to effectively correct for V&L misalignment and improve the overall alignment and compatibility between the two modalities.

Training with Virtual Classes (LASP-V)

A direct observation that can be drawn from the text-to-text optimisation loss is that, in practice, we do not have to use only the class names for which we have labelled image data, as the value of LASP is independent of the input image. To this end, we propose to learn the prompts using both annotated image-text pairs and class names outside the base set (for which we have no images available). We call this setting as training LASP with virtual classes. Our setting combines the best of both words: the guidance from the few annotated image samples and the zero-shot generalizability of language-based training. As we will show bellow, LASP with virtual classes can significantly improve the robustness of the prompts learned. We refer to this variant of our method in the tables bellow as LASP-V.

Experimental Results

Our method is trained and tested across a suite formed of 12 datasets (i.e., ImageNet, Caltech-101, Oxford Pets, Stanford Cars, Flowers-102, Food-101, FGVC Aircraft, SUN397, DTD, EuroSAT and UCF-101). Each dataset is split into a base and novel set, each containing samples belonging to an equal number of disjoint classes. The training is then performed in a few-shot manner (herein, 16 samples per class) on the base set while the testing on both base and new, separately. As the results from Table 1 show, our method outperforms prior work by a large margin, especially for the new set, where notably, our approach outperforms CLIP with hand crafted prompts, with LASP-V (with virtually classes) setting a new state-of-the-art.

Table 1.   Aggregated classification accuracy (top-1) across a suite of 12 datasets for the base and new set. H represents the harmonic mean of the two.

While the improvements vary depending on the distribution of the data, on certain datasets, such as EuroSAT the gains are as large as 17% (see Figure 2).

Figure 2.  Classification accuracy (top-1) on the novel classes subset on EuroSAT, DTD, UCF-101 and SUN397 datasets.


In summary, in this work, we introduced LASP - a language aware soft prompting method for V&L adaptation that is shown to outperform prior work by large margin. Our work is the first to propose a text-to-text optimisation framework for vision language adaption, largely alleviating the problem of base-class overfitting. Moreover, we showed that our approach, unlike prior work, is amenable to, including during training, virtual classes, i.e., class names for which no visual samples are available, significantly increasing the robustness of the learned prompts. We hope that LASP/LASP-V will serve as a strong baseline and steppingstone for future works in the area of few-shot adaptation for V&L models.

Link to the paper


[1] Learning transferable visual models from natural language supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, ICML 2021

[2] Learning to Prompt for Vision-Language Models, Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, IJCV 2022

[3] Conditional Prompt Learning for Vision-Language Models, Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, CVPR 2022

[4] Prompt Distribution Learning, Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, Xinmei Tian, CVPR 2022

[5] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, NAACL 2019