FFF: Fixing Flawed Foundations in Contrastive Pre-training Results in Very Strong Vision-Language Models

By Adrian Bulat Samsung AI Center - Cambridge
By Yassine Ouali Samsung AI Center - Cambridge
By Georgios Tzimiropoulos Samsung AI Center - Cambridge

The Conference on Computer Vision and Pattern Recognition (CVPR) is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field.

And CVPR considers a wide range of topics such as object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning related to computer vision and pattern recognition.

In this blog series, we are introducing some of our research papers at CVPR 2024 and here is a list of them.

#1. Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-Stage Action Localization (Samsung AI Center - Cambridge)

#2. FFF: Fixing Flawed Foundations in Contrastive Pre-training Results in Very Strong Vision-Language Models (Samsung AI Center - Cambridge)

#3. MR-VNet: Media Restoration Using Volterra Networks (Samsung R&D Institute India-Bangalore)


Large-scale contrastive image-text pre-training remains the prevalent method for training Vision-Language Models (VLMs) [1,2]. A key aspect behind the success of such approaches is the availability of vast amounts of web-collected data consisting of image-caption pairs, whereby the text associated with the images is often extracted from their corresponding HTML tags. Despite its scalability, these raw captions are frequently of low quality, impeding the training of VLMs and hindering their performance. Additionally, the VLMs are typically trained using very large batches, with the objective of maximising the similarity between each image and its associated collected caption, while simultaneously minimising the similarities with the rest of the samples from the batch. However, this one-to-one relationship assumption is generally incorrect, as it is likely that captions associated with other images can also correctly describe the current image. Intuitively, there are both many ways of describing a given image while also being likely that some images share semantic similarities. As such, spurious false negative pairs occur and they can make the training noisy and less efficient.

In this blog post, we present a new promising solution introduced in our paper, published at the Conference on Computer Vision and Pattern Recognition (CVPR), where we tackle these two problems with two simple, yet powerful, algorithms that constitute the proposed FFF (Fixing Flawed Foundations) method. The first algorithm uses text-image, image-image, and text-text similarities to eliminate incorrectly assigned negatives and mine new true positives on-the-fly. The second algorithm employs batch text augmentation for training with multiple pseudo-captions per image within the same batch instead of the noisy raw captions. Both solutions generate multiple new positives for each training image. This effectively corrects the assumption of one-to-one correspondence, used in prior work, to one that allows many-to-many correspondences. To address the latter, we propose to adopt a sigmoid loss based loss [2] for training the model.

Flaws of Web-collected Datasets & Their Solutions

To better motivate the proposed FFF and its different components, lets dive together into several key observations drawn from analysing the flaws of the CC3M web-collected dataset [3]. Specifically, we highlighted and analysed the following limitations, proposing in the process new solutions to alleviate such shortcomings:

Noisy and Repetitive Original Captions: The captions in web-collected datasets are often noisy and repetitive, with many generic captions that appear frequently being associated with multiple images. As such, many captions do not provide any information about their associated images, thus reducing the overall quality of the training data. A potential solution to this issue is re-captioning or pseudo-labelling, which consists of generating new captions using an off-the-shelf captioning model such as BLIP2 [4].
Quality Issues with Generated Pseudo-Captions: When compared to the raw captions, the generated captions are more diverse and semantically relevant to their associated images. However, they still suffer from some quality issues. Specifically, since most BLIP2 models were also trained on some web-collected data, they can produce captions that are ambiguous, contain hallucinations, and exhibit stylistic biases similar to those found in the raw captions. A potential solution to this issue is the use of multiple pseudo-captions, where instead of using a single pseudo-caption per image, we propose to use multiple ones so that their aggregate contains less noise and better reflects the image content.
The Presence of False Negatives: As pointed out in the introduction, when training VLMs with large batch sizes, it is likely that multiple captions correctly describe the contents of a given image, and training each image with only its associated ground-truth introduces many false negatives during training. A potential solution to this issue is mining of new positives, which consists of finding new positives in an online manner based on image-text, image-image, and text-text similarities.

FFF: Fixing Flawed Foundations of VLMs

Now, based on the limitations and solutions presented in the previous section, we introduce the proposed FFF method for improved training of VLMs. FFF consists of three main components: fixing incorrect negatives, batch text augmentation, and the loss function. We present each component below. The overall approach is depicted in Fig. 1.

Fixing Incorrect Negatives

During training, given a batch of text and image pairs, we first compute the text and image features using our text and image encoders. We then go one and compute three similarity matrices: text-text, image-image, and image-text, obtained by a taking the dot product between the L2 normalized features of each modality.

The next step is to generate positives using each one of these three similarity matrices. First, we define three sets of similarity thresholds, one for each matrix, and then use them to find new positive pairs. A given pair is considered to be positive if the similarity score is higher than their associated threshold. Finally, we combine the three sets of pseudo-labels into a single set by taking their union. Note that for text-text based pseudo-labelling, we also use image-text similarities to reduce the noise and avoid using generic captions as new positives.

Batch Text Augmentation

To improve data quality, we use BLIP2 [4], an off-the-shelf image captioner, to generate multiple pseudo-captions for each image in the training set and replace the original raw text captions. Inspired in part by [5], and to reduce the noise of the pseudo-labels, we train with all pseudo-captions as true positives within the same batch, which we call batch text augmentation.

Figure 1. Overall approach (b,c) combining fixing incorrect negatives with batch text argumentation, shown next to the CLIP baseline (a). The synthetic pseudo-captions are generated offline and packed as part of the dataset.

Loss Function

The standard loss used to train VLMs is the contrastive loss [6], computed over both image-text and text-image similarity. With this loss function, each image or text is assigned only to its corresponding pair. However, in our case, with multiple pseudo-captions per image and the new online mined positives, each image and text instance will be assigned to many positives, rendering the use of the standard contrastive loss unusable. As such, we opt for the binary formulation based on [2]. Note that the choice of this loss was also motivated by its robustness to false negatives with its negative logits bias.

Experimental Results and Comparisons

Effect of the Different Components: To evaluate the effectiveness of each proposed component of FFF, in the table below we show the obtained zero-shot classification accuracy on ImageNet with either no pseudo-captions, a single pseudo-caption, or five pseudo-captions, with or without fixing the incorrect negatives.

Table 1. Effect of fixing incorrect negatives. Zero-shot evaluation on ImageNet in terms of Top-1 (%) accuracy. All models pretrained on CC3M dataset.

We observe consistent gains for all three cases of interest: a) when using the web-collected captions (+2.7% gain), b) when using one pseudo-caption (+3.5% improvement), and c) when using all available pseudo-captions at once (+1.8%). Overall, compared to the baseline accuracy of 18.6%, our approach improves by +14.3% (top-1 accuracy of 32.9%). The results show that our approach provides gains across all options considered.

Comparison with State-of-the-Art on Zero-Shot Classification: In the figure shown below, we present the obtained zero-shot classification accuracy on different datasets and the overall average obtained over 11 image classification datasets.

Figure 2. Comparison with state-of-the-art for zero-shot classification for a selected set of datasets (ImageNET, CIFAR100, Pets and SUN397) alongside with aggregated results, averaged over 11 datasets. Results reported in terms of Top-1 (%) accuracy.

As the results show, our approach outperforms all prior methods, improving by 6.2% in absolute terms on top of the previous best result of HiDeCLIP [9] (which benefits from a better architecture) when aggregated across 11 datasets. Notably, we set a new state-of-the-art result on ImageNet (51.1%). Finally, we significantly improve upon ALIP [8], which also makes use of synthetic captions, outperforming it by 9.1%.

Comparison with State-of-the-Art on Zero-Shot Image-Text Retrieval: To further showcase the versatility of our approach, in the table below, we present a comparison with state-of-the-art for image retrieval on Flickr30k and MSCOCO datasets.

Table 2. Comparison with state-of-the-art for zero-shot image retrieval. Results reported on Flickr30k and MS-COCO test sets. All models were pretrained on YFCC-15M dataset.

As it can be observed, our approach offers significant gains across all metrics and datasets used, improving on top of the prior state-of-the-art ALIP [8] by 14.8% and 18.7% in terms of R@1 on Flickr30k for text and image retrieval, respectively. Similarly, we outperform the previous best result by 14.9% and 15.0% in terms of R@1 on MSCOCO for text and image retrieval. This highlights that our approach results in representations that can capture subtle and fine-grained details.


In our CVPR work, we propose a new approach, dubbed FFF, for vision-language pretraining based on multi-positive sample pairing that fixes incorrect negatives and addresses low caption quality. The latter is tackled by a newly introduced batch text augmentation strategy, in which multiple new positive pairs are concomitantly added via synthetic re-captioning. Departing from the typical contrastive loss, to enable efficient training under an arbitrary number of positives per sample, we pro- pose to train the model with a sigmoid loss. In the process, we highlight the crucial role of noise and caption quality in vision-language pre-training, offering an in-depth analysis. All in all, we show large improvements over the cur- rent state-of-the-art method for both zero-shot image recognition (+6% on average of 11 datasets) and retrieval (+19% on Flickr30k and ∼ +15% on MSCOCO).

Link to the paper

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models. Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos.
Conference on Computer Vision and Pattern Recognition (CVPR), 2024.


[1] Alec Radford et al.: Learning transferable visual models from natural language supervision. International conference on machine learning (ICML), 2021.

[2] Xiaohua Zhai et al.: Sigmoid Loss for Language Image Pre-Training. International Conference on Computer Vision (ICCV), 2023.

[3] Piyush Sharma et al: Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Association for Computational Linguistics (ACL), 2018.

[4] Junnan Li et al: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. International conference on machine learning (ICML), 2023.

[5] Elad Hoffer: Augment Your Batch: Improving Generalization Through Instance Repetition. Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[6] Ting Chen et al.: A Simple Framework for Contrastive Learning of Visual Representations. International conference on machine learning (ICML), 2020.

[7] Shijie Geng et al.: HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention. International Conference on Learning Representations, (ICLR), 2023

[8] Kaicheng Yang et al: ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption. International Conference on Computer Vision (ICCV), 2023.