AI

[CVPR 2022 Series #5] P>M>F: The Pre-Training, Meta-Training and Fine-Tuning Pipeline for Few-Shot Learning

By Shell Xu Hu Samsung AI Center - Cambridge
By Da Li Samsung AI Center - Cambridge
By Jan Stühmer Samsung AI Center - Cambridge
By Minyoung Kim Samsung AI Center - Cambridge
By Timothy Hospedales Samsung AI Center - Cambridge

The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).

This year Samsung Research’s R&D centers around the world present a total of 20 thesis papers at the CVPR 2022. In this relay series, we are introducing a summary of the 6 research papers.

Here is a summary of the 6 research papers among 20 thesis papers shared at CVPR 2022.

- Part 1 : Probabilistic Procedure Planning in Instructional Videos (by Samsung AI Center – Toronto)

- Part 2 : Day-to-Night Image Synthesis for Training Nighttime Neural ISPs (by Samsung AI Center – Toronto)

- Part 3 : GP2: A Training Scheme for 3D Geometry-Preserving and General-Purpose Depth Estimation
   (by Samsung AI Center - Moscow)

- Part 4 : Stereo Magnification with Multi-Layer Images(by Samsung AI Center - Moscow)

- Part 5 : P>M>F: The Pre-Training, Meta-Training and Fine-Tuning Pipeline for Few-Shot Learning
   (by Samsung AI Center - Cambridge)

- Part 6 : Gaussian Process Modeling of Approximate Inference Errors for Variational Autoencoders
  (by Samsung AI Center - Cambridge)


About Samsung AI Center – Cambridge

Many technical challenges must be overcome to provide excellent AI experience to users of Samsung devices. The Cambridge AI Center focuses on 'multimodal AI' technology, 'data efficient AI' and 'On-device AI technology that makes AI models work efficiently and robustly on Samsung's devices'.

First, it is essential to handle multimodality for a natural interaction experience with users. Second, AI model needs to be updated from less learning data to provide users with customized services and experiences. Lastly, these experiences should be able to achieve optimal performance for each device in various specs, and at the same time users expect the AI model to be consistently good even in a wide variety of daily environments.

In this year’s CVPR, Cambridge AI Center has published papers related to the acceleration of the inference process of VAE and the technology that improve the accuracy of the few-shot learning. In these papers, researchers propose new ideas in terms of 'deploying an efficient AI model for Samsung devices' and 'increasing the accuracy of the AI model in a user environment where there’s a few label data.'

Through future research, we will continue to develop technologies that innovate the efficiency and usability of the AI model for Samsung devices and services.

Motivation

How different are few-shot learning (FSL) and classical supervised learning? They are indeed very different in the sense of classical generalization theory, but we would like to argue that they are not that different in practice. While we do need millions of data to build up a good feature representation, which holds in general for both learning paradigms, but once we achieve a sensible feature representation, the task specific model can be pinned down with just a few shots.

Bearing this in mind, we propose a simple pipeline for few-shot image classification which leverages the recent advance of self-supervised learning [1] and foundation models [2]. Our pipeline, pre-training → meta-training → fine-tuning (P>M>F, shown in Figure 1), focuses on how to build on top of a self-supervisely pre-trained feature network (a.k.a. feature backbone) with the simplest meta-training algorithm and task specific adaptation.

Apart from the simplicity, the pipeline is advocated because (i) using foundation models amortizes the carbon cost (they are trained once and used by many) and (ii) it may actually broaden the access of the state of the art by allowing downstream research and applications to be done by more stakeholders, given that the up-front pre-training cost has already been paid.

Figure 1.  A schematic of our pipeline. Following the red arrows, the pre-trained class-agnostic feature backbone is turned into a generic one, which is then personalized differently with a few shots (plus data augmentation) on different tasks.

Self-Supervised Pre-Training

The biggest challenge in few-shot learning is to learn the feature backbone in a class-agnostic manner, such that it can be applied to unseen classes with fast adaptation. The current conventional wisdom achieves such an out-of-distribution (OOD) generalization through learning-to-learn or meta-learning scheme.

However, meta-learning is often hard to scale for various reasons (e.g., the higher-order gradient issue of MAML [3]). An alternative scheme emerges as self-supervised learning progresses, which is valid because many self-supervised losses (e.g., the DINO loss [4]) or contrastive losses (e.g., the MoCo loss [5]) are by default class-agnostic.

Although self-supervised learning does not explicitly seek OOD generalization, the learned feature representation is often very generic. As an example shown in Figure 2, DINO pre-trained vision transformer (ViT) can be turned into an object detector for novel objects by solving a Normalized Cut optimization without further training [6].

Figure 2.  An example showing DINO self-supervised ViT is capable of recognizing novel objects (note that pokemon is not included in the ImageNet -- the train-set of DINO). These figures are taken from [6], which show local features associated to image patches are learned to group semantically without semantic label supervision.

Given that self-supervised learning based methods yield surprising results for many computer vision problems [1], we are interested in whether self-supervised learning can be adopted to form a simple pipeline for few-shot learning. If so, how does it perform comparing to the meta-learning counterpart? Can we combine these two schemes in a simple way?

To answer these questions, we evaluate the pre-training regime (including algorithm and dataset) as well as network architecture on three few-shot learning benchmarks: Meta-Dataset (MD), miniImageNet (miniIN), and CIFAR-FS, where the average accuracy is reported over various-way-various-shot tasks for MD and 5-way-5-shot tasks for miniIN and CIFAR. We take ProtoNet (nearest-centroid) classifier [7] as the standard approach for meta-testing throughout, and compare different training configurations as listed in Table 1.

Table 1.  The impact of architecture, learning algorithm and dataset on downstream few-shot learning performance. Benchmarks: Meta-Dataset (MD), miniImageNet (miniIN) and CIFAR-FS. Pre-training options: DINO on ImageNet1k (IN1K) dataset, CLIP on YFCC100M dataset, BEiT on ImageNet21k (IN21K) dataset as well as an unfair supervised pre-training (Sup.) on IN1K for sanity check. For all configurations, we use ProtoNet (PN) classifier for meta-testing, which requires only the feature backbone. We can also apply ProtoNet in meta-training to update feature backbone (more details in the next section).

From the results in Table 1, we can draw the following conclusions to support the introduction of self-supervised pre-training to the FSL pipeline:

1. For a strong pre-training regime, such as DINO, the unsupervised pre-trained backbone (ID=2) performs favorably compared to fully supervised meta-trained backbones (ID=9, 10). This shows the capability of pre-training to learn class-agnostic feature representation.

2. Standard meta-training recipe does not work well for ViT (see ID=6). This somehow verifies our intuition that performing meta-training from scratch may be difficult.

3. Pre-training can be combined with meta-training to provide a significant improvement across the board compared to the conventional few-shot learning pipelines (see ID=4 vs. 6 and 7 vs. 9).

4. As a sanity check, self-supervised pre-training performs as well as or even better than supervised pre-training (see ID=0 vs. 1 and 2 vs. 3 and 4 vs. 5 and 7 vs. 8). This is noteworthy because some classes in meta-testing also appear in pre-training (e.g., miniImageNet is a subset of ImageNet); in this case, if supervision is available, meta-test tasks boil down to supervised classification on seen classes, which almost gives the upper bound of the performance.

Remark: Although self-supervised pre-training exhibits the potential to boost few-shot learning performance, it makes the pipeline incomparable to previous meta-learning algorithms as class overlap between pre-training and meta-testing may exist. But from a practical point of view, this class overlap issue is ubiquitous and it should not be the reason to prevent us benchmarking the capability of quickly constructing a classifier from very few labels.

Meta-Training

Given the pre-trained feature backbone, we are ready to deploy the backbone to any K-way-N-shot task using ProtoNet classifier, which means the training of backbone is completely unsupervised except that we use labeled support set to construct centroids for ProtoNet classifier when deploying the backbone to a task. If there are some tasks with labeled query set, we can use these labeled tasks to further update the backbone and additionally train complementary meta-models (e.g., the synthetic gradient generator in [8]).

We call this step meta-training. In the case of ProtoNet, this step is essential. As we can see from the t-SNE plots in Figure 3, DINO pre-training yields high-quality feature representation for novel tasks -- many semantic clusters have already emerged before giving semantic supervision, but the margin between clusters is still random. By meta-training the backbone using ProtoNet classification, tighter clusters are formed with clear margin between them.

Figure 3.  Comparison of feature representation in t-SNE plot for pre-trained backbone with and without meta-training.

We also explore how other meta-learning algorithms fit our pipeline. We conduct the same experiments on miniImageNet and CIFAR-FS with two SOTA meta-learning algorithms: MetaOptNet [9] and MetaQDA [10].

Table 2.  The impact of architecture and pre-training on state-of-the-art few-shot learners: MetaQDA, MetaOptNet.

From the results in Table 2, we can see that:

1. MetaQDA (ID=3) and MetaOptNet (ID=5) do improve on direct feature transfer (ID=0) and on the simple ResNet features they were initially evaluated with (see ID=5 vs. 4, 3 vs. 2).

2. With the stronger features, they are outperformed by the simpler ProtoNet (see ID=3 vs. 5 vs. 1).

This suggests previous conclusions about comparative meta-learner performance may need re-evaluating in the new regime of foundation models

Meta-Testing with Fine-Tuning

The last step in our pipeline is to fine-tune the backbone on model deployment. This is an important step since the model may be deployed to an unseen domain, where the learned feature representation may fail to generalize due to a substantial shift in data distribution. To this end, we propose a simple fine-tuning algorithm. Suppose that we observe a support set for a particular task, which consists of a few labeled examples. The idea is to fit the support set using the feature backbone with centroids derived from data augmentation of the support set. The PyTorch-like pseudo code of backbone update is shown as follows.

We observe that the fine-tuning performance is relatively sensitive to the choice of learning rate (lr). However, existing few-shot learning problem formulation does not offer a validation set for each task to choose the best learning rate for fine-tuning. In practice, this is not a big problem when deploying the model to a particular task, because we can always annotate a few more examples as the validation set. We also notice that the best learning rates for different tasks within a domain are almost the same.

This motivates us to select learning rate in a domain-wise fashion. Thus, we propose to sample 5 validation tasks from each domain and pick the best learning rate in the range of {0.0001, 0.001, 0.01, 0} that yields the best performance.

To validate the effectiveness of our fine-tuning algorithm, we conduct an experiment (see Figure 4) on Meta-Dataset to compare the results with and without fine-tuning.

Figure 4.  The impact of fine-tuning (FT) during meta-test on Meta-Dataset. Two train-test split options of Meta-Dataset: a) IN: meta-train on ImageNet (INet) only and b) MD: meta-train on 8 domains except Sign and COCO which are held out for meta-test. We compare DINO + PN (IN) and DINO + PN (IN) + FT to see whether FT leads to performance boost. We compare DINO + PN (IN) + FT and DINO + PN (MD) to see what is the gap between FT and meta-training for some domains.

It is clear that, for domains in which images have very different statistics from that of ImageNet, such as Omniglot and QuickDraw, or for domains in which class semantics are distinct from those in ImageNet, such as Aircraft and Sign, meta-trained backbone performs poorly (compared to the accuracy on INet) and fine-tuning indeed mitigates performance drop (e.g., compare orange bins and green bins on Omniglot and Aircraft).

Hence, we would recommend to always turning on fine-tuning when deploying the model since we can always collect a small validation set to decide if we need a positive learning rate.

Comparison to State of the Art

Now we compare our P>M>F pipeline with prior state of the art. Results are listed in Table 3 and 4. We would like to emphasize that our results are actually not comparable to much prior work in terms of the architecture and the use of external data. We draw this comparison to demonstrate how simple changes compare against 5 years of intensive research on few-shot learning.

Table 3.  Meta-Dataset: Comparison with SOTA algorithms. Please check our Arxiv paper for the citations.

Table 4.  Cross-domain few-shot learning: Comparison with SOTA algorithms. Please check our Arxiv paper for the citations.

"Zero-Shot" Demo

To test the OOD generalization of our pipeline, we make a demo to classify the user-uploaded image into arbitrarily user-defined classes. The idea is inspired by CLIP zero-shot classification [11], but we do not take a language model to generate class centroids. Instead, we rely on Google Image Search to retrieve support images given the user-defined class names. The demo is hosted at https://huggingface.co/spaces/hushell/pmf_with_gis

Figure 5.  P>M>F "zero-shot" Gradio demo.

Bibtex

Link to the paper

https://arxiv.org/abs/2204.07305

References

[1] Ericsson, Linus, et al. "Self-Supervised Representation Learning: Introduction, advances, and challenges." IEEE Signal Processing Magazine 39.3 (2022): 42-62.

[2] Bommasani, Rishi, et al. "On the opportunities and risks of foundation models." arXiv preprint arXiv:2108.07258 (2021).

[3] Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." International conference on machine learning. PMLR, 2017.

[4] Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[5] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[6] Wang, Yangtao, et al. "Self-supervised transformers for unsupervised object discovery using normalized cut." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[7] Snell, Jake, Kevin Swersky, and Richard Zemel. "Prototypical networks for few-shot learning." Advances in neural information processing systems 30 (2017).

[8] Hu, Shell Xu, et al. "Empirical bayes transductive meta-learning with synthetic gradients." arXiv preprint arXiv:2004.12696 (2020).

[9] Lee, Kwonjoon, et al. "Meta-learning with differentiable convex optimization." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[10] Zhang, Xueting, et al. "Shallow bayesian meta learning for real-world few-shot recognition." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[11] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.