Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-Stage Action Localization

By Enrique Sanchez Samsung AI Center - Cambridge
By Georgios Tzimiropoulos Samsung AI Center - Cambridge

The Conference on Computer Vision and Pattern Recognition (CVPR) is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field.

And CVPR considers a wide range of topics such as object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning related to computer vision and pattern recognition.

In this blog series, we are introducing some of our research papers at CVPR 2024 and here is a list of them.

#1. Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-Stage Action Localization (Samsung AI Center - Cambridge)

#2. FFF: Fixing Flawed Foundations in Contrastive Pre-training Results in Very Strong Vision-Language Models (Samsung AI Center - Cambridge)

#3. MR-VNet: Media Restoration Using Volterra Networks (Samsung R&D Institute India-Bangalore)


Action Localization is a challenging task that combines person detection and action recognition. Assuming we are given a clip of 1 or 2 seconds, the goal of Action Localization is to detect the bounding boxes of those persons appearing in the central frame, as well as to classify which actions is each conducting.

This task shares similarities with object detection, with the particularity that the detected objects are always people and the classes correspond to various actions that can sometimes co-occur. It poses the additional challenge that actions require temporal reasoning, as well as the fact that a person detector can contribute to false positives if a given person is not performing any of the target actions. For this reason, we often distinguish between foreground and background people as those that perform any of the actions of interest or not. The foreground people are referred to as actors.

Figure 1.a) Traditional two-stage methods work on developing strong vision transformers that are applied in the domain of Action Localization by outsourcing the bounding box detections to an external detector. ROI Align is applied to the output of the transformer using the detected bounding boxes, and the pooled features are forwarded to an MLP that returns the class predictions. b) Recent approaches in one-stage Action Localization leverage on the DETR capacity to model both the bounding boxes and the action classes. A video backbone produces strong spatio-temporal features that are handled by a DETR transformer encoder. A set of learnable queries are then used by a DETR transformer decoder to produce the final outputs.

Most works tackling the problem of Action Localization (Fig. 1) resort to using an external person detector finetuned to filter out background persons, and focus on the classification task alone. However, such approaches are expensive as they require two independent networks to run. In particular, first extracting the bounding boxes requires a network operating at high-resolution. To overcome these limitations, a relatively recent research trend advocates for training joint networks that can perform both tasks together in an end-to-end fashion, i.e. that are "single-stage”.

These recent works build on adapting the DETR architecture, well establish for the task of object detection, to handle video data so as to perform the classification task over the time domain. A DETR is a transformer encoder-decoder architecture with learnable queries that are each assigned to a unique output prediction, which is one-to-one matched to the ground-truth instances through a Bipartite Matching or Hungarian Loss [3,4]. While applying DETR to the domain of Action Localization has shown promising results, it is still open to further optimization as well as to better modelling of both subtasks properly.

Our solution

In this work, we observe that the architecture can be further simplified by introducing the following innovations:

      1. Train the Vision Transformer using the Bipartite Matching Loss in a similar fashion to that of DETR. Rather than using learnable queries and a DETR encoder-decoder, we propose to treat the output embeddings of the Vision Transformer as those that are assigned to ground-truth triplets using the aforementioned Bipartite Matching Loss.
      2. Divide and conquer: A Video-based Vision Transformer produces a set of spatio-temporal embeddings, which can be arranged for the task of detection and the task of classification independently, by keeping a one-to-one correspondence between the rearrangements to form the final outputs that will be matched to the ground-truth instances using a Bipartite Matching Loss. This allows to tackle two apparently far tasks such as that of detecting the actors, which requires only spatial reasoning over the central frame, and that of classifying actions, that requires spatio-temporal reasoning and hence benefits from the temporal output embeddings.


The goal of Action Localization is to detect the actors’ bounding boxes and to classify their corresponding actions in a video clip centred at a particular frame. Because not every person in a clip is performing one of the actions if interest, we distinguish between a person and an actor. A bounding box detector will produce a tuple <b, p(α)> indicating the bounding box coordinates b as well as the probability p(α) of that bounding box to belong to an actor or to the background. A classifier will produce for each of these bounding boxes a set of C logits p(a) corresponding to the class probabilities. The full prediction is composed of triplets of the form <b, p(α), p(a)>.

Multi-scale Vision Transformers (MViT, [1,2]) are architectures that apply a cascade of self-attention blocks over a tokenized video composed of L = THW patches, with T referring to the temporal number of patches and H,W corresponding to the spatial ones. The self-attention blocks operate over the L patches, producing a corresponding output of embeddings. The output of the transformer is therefore a list of L embeddings. Traditional two-stage methods re-arranges these L tokens into the corresponding spatio-temporal dimensions. By having used deployed an external bounding box detector to produce the bounding boxes b for the actors in the central frame, ROI-based classification is applied on the spatio-temporal rearrangement of the L embeddings, producing for each box the corresponding class probabilities p(a). These methods, while exhibiting high performance, are computationally expensive.

In this work, we propose to apply a dedicated MLP-head to each of the output tokens L, producing a direct set of L triplets of the form <b, p(α), p(a)>. Each of the three MLP heads will take the output embeddings and produce the corresponding output, it being either of the b, p(α), and p(a) outputs, respectively. Note that these MLPs add negligible complexity to the inference. Because the model produces a fixed set of L outputs, we need to first "match” them to the ground-truth instances in the clip. During inference, this is easily accomplished by directly using p(α), discarding all bounding boxes that have a confidence below a threshold. The outputs are already forming triplets, and no external detection networks or a convoluted head is needed!

To train this model, we propose the use of the standard Bipartite Matching Loss, also referred to as Hungarian Loss. This loss is indeed the core of DETR models. However, contrary to DETR, we do not use learnable queries with an encoder-decoder architecture that transforms these through self- and cross-attention to the backbone features to the output triplets. Instead, we let the visual tokens act as the queries, and we directly match their corresponding output embeddings to the ground truths. In an over simplistic way, we could say that our method bridges Vision Transformers with the DETR training recipe! Fig. 2 illustrates our proposed pipeline. Note that while the training will favour each patch to produce an output that relates to the content that is close by in the actual visual content, that need not be the case, each token can carry any triplet.

Figure 2.Our method builds a vision transformer only that is trained against a bipartite matching loss between the individual predictions given by the output spatio-temporal tokens and the ground-truth bounding boxes and classes. Our method does not need learnable queries, as well as a DETR decoder, and can combine the backbone and the DETR encoder into a single architecture

Bearing these remarks in mind, we make ourselves the following question: do we need all tokens (in a Vision Transformer these can be 2048)? If not, which tokens would be a better fit for the task of Action Localization? Note that, the tasks of actor detection and action classification are opposed by definition: while detecting actors requires only information from the central frame, the task of action recognition benefits from using temporal support. we can select which of these are of better use to the detection and recognition subtasks. Because the outputs of each triplet are processed by dedicated MLPs, we observe that a different set of tokens can be used for the task of detection and for the task of classification. The only technical limitation is that the outputs of each head need to be in one-to-one correspondence with those from the other heads, to form the triplets that will be matched to the ground-truth instances.

This allows us to "play” with the output embeddings to select only as many as necessary, from the temporal and the spatial domain. In other words, as we are not limited to use the same output tokens for each task, we can consider e.g. the HW tokens corresponding to the central frame, and apply temporal pooling to the whole output volume to produce a different set of HW tokens to compute the action probabilities p(a). There will be the same number of tokens for both subtasks, i.e. L = HW, however, the tokens that will be used for the bounding box detection will contain information related to the spatial domain in the central frame, whereas those used to classify the corresponding actions will contain the temporal information that is necessary for a better action understanding. We observe that having HW might be too little number of tokens, and we propose an alternative solution (see Fig. 3) that considers applying temporal pooling independently to the first and second halves of the output temporal embeddings for the action classification head, and only the two spatial embeddings around the central frames for the detection head.

This approach allows us to duplicate the number of tokens considering the different natures of each subtask. Once the final predictions are produced, we can proceed to the training using the standard Bipartite Matching loss as mentioned above.

Figure 3. The output spatio-temporal tokens are fed to 3 parallel heads. We use the central tokens to predict the bounding box and the actor likelihood while averaging the output tokens over the temporal axis to generate the action tokens. Each head comprises a small MLP that generates the output triplets.

We want to highlight that our method, which resides in reformulating the training objective of vision transformers for action localization and in the observation that such approach comes with different alternatives for token selection, is amenable to different backbones, token selection design, and even output resolution. The fact that tokens are fixed and assigned to ground-truth instances also implies that we can directly resize the input frames to a fixed squared resolution, without any concern regarding losing the aspect ratio, something not possible in ROI-based approaches. This prevents our method from the need of using different views, being computationally more efficient.

Quantitative Results

We validated our approach using the challenging AVA 2.2 [5,6]. We also validated the generalization capabilities of it in UCF101-24 [7] and JHMDB51-21[8]. The standard metric to assess the capabilities of a model or method is the mean Average Precision (mAP), which accounts for a combination of precision and recall. This metric should be read along with the complexity of the network. In this work, we target a maximal performance in the low-regime domain.

In Table 1, we observe the influence of re-arranging the output tokens as described above in the performance of the network. We observe that a correct number/balance of tokens is crucial to achieve better performance.

Table 1. We observe that having a proper balance of tokens is important to achieve the best performance.

In Table 2 and 3, we report the results of our method against state of the art works, along with their corresponding complexities. We see that our simplified pipeline delivers the best accuracy-complexity tradeoff.

Table 2. Results on AVA2.2. Please refer to the paper for the full Table description

Table 3. Results on JHMDB and UCF24. Please refer to the paper for a full Table description.

Qualitative Results

In Fig. 4, we provide three qualitative examples from three corresponding validation videos from AVA2.2. On the left top image, we represent the confidence scores p(α) for the actor-no actor prediction, for each of the 16 × 16 output tokens corresponding to one of the central frames. Those with high confidence p(α) > θ are selected as positive examples, and their corresponding bounding boxes and class predictions will then form the final outputs. The images in the left bottom are the bounding boxes predicted by each of the same 16 × 16 output tokens, with those in yellow corresponding to the positive tokens (i.e. to the final output bounding boxes). On the right we represent the last layer’s attention maps corresponding to each of the selected tokens in the left, i.e. their attention scores w.r.t. the whole 8 × 16 × 16 spatio-temporal tokens. We observe that the confident tokens not only attend to the central information to produce the bounding box, but also track the corresponding actor across the video to estimate the corresponding actions. In the second example, we only represent three actors for the sake of clarity. We observe that even with a change of scene, the attention maps can properly track each actor’s information. In the last example, the self-attention maps show how they can track each actor’s despite the self-occlusion. These examples show that our method can track the actor’s information and regress the bounding boxes with a Vision Transformer that assigns each vision token a different output, which are assigned to the ground-truth set through bipartite matching.

Figure 4.Qualitative examples of how our method can detect and classify actions.


[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

[2] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. MViTv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022

[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In ECCV, 2020.

[4] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. End-to-end people detection in crowded scenes. In CVPR, 2016

[5] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A video dataset of spatiotemporally localized atomic visual actions. In CVPR, 2018

[6] Ang Li, Meghana Thotakuri, David A Ross, João Carreira, Alexander Vostrikov, and Andrew Zisserman. The avakinetics localized human actions video dataset. arXiv, 2020.

[7] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012.

[8] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, 2013