AI

Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition

By Umberto Michieli Samsung R&D Institute United Kingdom
By Mete Ozay Samsung R&D Institute United Kingdom
By Jijoong Moon Samsung Research

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them.

#1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research)

#2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research)

#3. FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement (Samsung R&D Institute China-Beijing)

#4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America)

#5. Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment (Samsung R&D Institute China-Nanjing)

#6. Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition (Samsung R&D Institute United Kingdom)

#7. Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition (Samsung R&D Institute India-Bangalore)


Introduction

Smart devices are beginning to become ubiquitous in everyday life and their users are demanding for instance-level personalized detection of vision systems mounted on such devices [2]. For example, vacuum cleaners will soon be capable of monitoring the behaviour of users’ specific pets, and stay away from those specific pets that are mostly scared by the robot’s noise [3].

Nonetheless, users do not like to provide extensive feedback. Therefore, we introduce a new task of few-shot instance-level personalization of object detection models to detect and recognize personal instances of objects (e.g., dog1 and dog2 rather than just dog). The limited availability of the data distinguishes our task from previous instance-level personalization attempts [4] [5]. To the best of our knowledge, previous works assume large availability of labelled data and fine-tune (FT) the models through computationally expensive updates. However, FT-based methods inevitably fail when few-shot samples are provided.

The desiderata of our setup are as follows:

    
1.
Few-shot instance-level personalization of object detection models with backpropagation-free mechanisms;
    
2.
A computational and data efficient framework to minimize footprint on tiny mobile devices;

Our Framework: OBoI-AEE

The proposed system is composed of three main components, as we outline in Fig. 1 and below:

    
1.
An autoencoder object detection M_o=D_o∘E_o (e..g, YOLOV8 [6] is used in the experiments) trained to recognize generic objects.
    
2.
A mechanism that (i) downsample the predicted object-level output to the feature-level space spanned by E_o, (ii) generates an object-level binary mask that is applied to the features extracted by the object detector encoder, and (iii) Augments the Encoded Embeddings (AEE) by E_o via the concatenation of high-order central statistical feature moments.
    
3.
An Object-conditioned Bag of Instances, implemented via an instance-level prototype-based few-shot learner such as ProtoNet [7] or SimpleShot [8] that allows instance-level personalization of standard object detectors.


Figure 1. A generic object detector (e.g., YOLOv8) is adapted to detect personal instances via our backpropagation-free Object-conditioned Bag of Instances (OBoI) approach with augmented embeddings by multi-order statistics

Experimental setup

Models. In the analyses, we implement the model M_o by YOLOv8, pre-train the model on MSCOCO [9] and then on samples from Open-Images-V7 (OIV7) [10]of the same object-level classes as in the personal dataset. We used the default learning parameters [6] for pre-training. We design several setups whereby we assign a few samples to the training set and the remaining ones to the testing (80%) and validation (20%) sets.

Datasets. We employ two datasets to evaluate performance:

    
1.
CORE50 [5]: we consider a subset of 45 personal instances belonging to 9 object-level classes, acquired over 11 variable-background sequences, i.e., different domains.
    
2.
iCubWorld-Transformations (iCWT) [11]: we consider a subset of 9 object-level classes with 10 personal instances each acquired under 5 sequences with diverse affine transformations of the items.


On both datasets, we restrict the personalization stage to the frames being correctly labelled by YOLOv8n, maintaining a balanced number of samples per instance and per sequence.

Metrics. We compute the instance recognition accuracy averaged within each object-level class (Acc_o, %) and averaged over all instances (Acc_i, %) on the test set. We define the relative gain between two methods obtaining Acc_(i,1) and Acc_(i,2) (Acc_(i,2)>Acc_(i,1)) as:

Δ≔(Acc_(i,2)-Acc_(i,1))/(Acc_(i,1))

Our Results

Unless otherwise stated, we report all results on YOLOv8n, being the most suitable for deployed applications, and in the case of 2 instances per each object-level category

Same Domain. First, we consider the scenario where we have 1-Shot from All Sequences (1SAS) therefore the same domain is seen during few-shot training and testing. The results are shown in Table 1. We observe that gradient-based fine-tuning methods (e.g., FT) are not effective and obtain comparable results to a random classifier (lower bound). OBoI via PFSL methods such as SimpleShot and ProtoNet show large gains compared to FT by learning a metric space from the extracted features. In both cases, the augmentation of embeddings via our multi-order statistics boost the recognition accuracy significantly, especially in presence of multiple instances per object. Remarkably, we can personalize YOLOv8n to achieve 77.08% Acc_i when detecting 18 personal instances via just a few samples and via a backpropagation-free approach, assuming that a correct object-level classification and bounding box regression are output from the detection head. Figure 2a shows that our proposed solution consistently improves or achieves comparable results on every object class.

Table 1. Acc_i of OBoIs on 1SAS (same domain)

Other Domain. We designed a more realistic, yet challenging, setup considering 1-Shot from 1 Sequence (1S1S) during training and all remaining samples during testing. The model experiences different domains at training time and at testing time. Table 2 summarizes the main results. Accuracy is generally lower than 1SAS case due to less samples being used during training and different domain distribution between training and testing. Our method shows a significant improvement of the performance in every case, even in the presence of a domain shift, and especially in case of multiple instances per object category. We argue that the gain attained by our approach is lower than the previous one due to the difficulty in reliably matching multi-order statistics between a single input sample from a single domain and several target samples from several domains. Figure 2b reports Acc_o; similarly to the previous case, we confirm that our solution obtains robust results across most of the classes.

Table 2. Acc_i of OBoIs on 1S1S (other domain)

Figure 2. Acc_i and per-object Acc_o for OBoIs via ProtoNet. EE: encoder embeddings, AEE: augmented EE (ours)

Variable training shots. Figure 3 shows that our AEE improves personal recognition accuracy regardless of the number of available training samples (i.e., shots)

Figure 3. Acc_i at variable shots. Samples are drawn randomly from each sequence. PN: ProtoNet, SS: SimpleShot

Other YOLOv8 sizes are evaluated in Table 3. Larger YOLOv8 models can improve detection performance, and this correlates with the personal instance recognition accuracy. The improvement of larger YOLOv8 models comes at a cost of a significantly larger model size and slower FPS: YOLOv8x improves personal recognition by about 25% compared to YOLOv8n, while having about 22× larger size and 3.6× slower inference. The final choice depends on the hardware specifications of target devices.

Table 3. Ablation on YOLOv8 models. General object detection results are computed on the subset from OIV7. Personal instance recognition is evaluated on the subset from CORe50

Computational inference time of our AEE on top of the OBoI with ProtoNet increases by as little as 0.8% making our method lightweight with a nearly negligible impact.

Results on iCWT are shown in Table 4 against the highest baseline ProtoNet. Our approach exhibits robust gains across all setups ranging from 18 to 90 instances

Table 4. Acc_i on iCWT. PN: ProtoNet

Conclusions

We proposed a new task: few-shot instance-level personalization of object detection models to localize and recognize personal objects.

We proposed a new framework: OBoI-AEE, a backpropagation-free system that is data- and computational-efficient.

We believe that this setup and our method could pave the way to personal instance-level detection and could stimulate future research and applications.

Our full paper is available at: https://arxiv.org/abs/2404.01397

Our dynamic VFI pipeline is very flexible: any suitable model can be applied as the accurate VFI model and the fast VFI model. In this work, we employ RIFE [7] as our fast VFI model, VFIformer [8] or UPR-Net [9] as our accurate VFI model. RIFE is a real-time VFI algorithm with competitive results on public benchmarks, while VFIformer and UPR-Net integrate many subtle design choices to achieve superior performance on challenging cases.

Bibliography

[1] U. Michieli, J. Moon, D. Kim and M. Ozay, "Object-conditioned Bag Of Instances For Few-shot Personalized Instance Recognition," in IEEE ICASSP, 2024.

[2] N. Arora, D. Ensslen, L. Fiedler, W. Liu, K. Robinson, E. Stein and G. Sch¨uler, "The value of getting personalization right - or wrong - is multiplying," in McKinsey & Company, 2021.

[3] "An Intelligent At-Home Helper – How the Bespoke Jet Bot™ AI+ Takes Care of Your Pet When You’re Away," 2023.

[4] R. Camoriano, G. Pasquale, C. Ciliberto, L. Natale, L. Rosasco and G. Metta, "Incremental robot learning of new objects with fixed update time," in IEEE ICRA, 2017.

[5] V. Lomonaco and D. Maltoni, "CORe50: a new dataset and benchmark for continuous object recognition," in CoRL, 2017.

[6] Ultralytics, "Yolov8," 2023.

[7] J. Snell, K. Swersky and R. Zemel, "Prototypical networks for few-shot learning," in NeurIPS, 2017.

[8] Y. Wang, W. Chao, K. Weinberger and V. D. M. Laurens, "Simpleshot: Revisiting nearest-neighbor classification for few-shot learning," in arXiv:1911.04623, 2019.

Link to the paper

https://ieeexplore.ieee.org/document/10446073