AI

[CVPR 2023 Series #2] StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

By Isma Hadji Samsung AI Center - Toronto
By Ran Zhang Samsung AI Center - Toronto
By Konstantinos G. Derpanis Samsung AI Center - Toronto
By Richard Wildes Samsung AI Center - Toronto
By Allan Jepson Samsung AI Center - Toronto

The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).

In this relay series, we are introducing a summary of the 6 research papers at the CVPR 2023 and here is a summary of them.

- Part 1 : SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields (by Samsung AI Center – Toronto)

- Part 2 : StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos (by Samsung AI Center – Toronto)

- Part 3 : GENIE: Show Me the Data for Quantization (by Samsung Research)

- Part 4 : A Unified Pyramid Recurrent Network for Video Frame Interpolation(by Samsung R&D Institute - Nanjing)

- Part 5 : MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation (By Samsung R&D Institute United Kingdom)

- Part 6 : LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models (By Samsung AI Center - Cambridge)

- Part 7 : Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style (By Samsung AI Center - Cambridge)


Introduction

Observing someone perform a task (e.g., cooking, assembling furniture or fixing an electronic device) is a common approach for humans to acquire new skills. Instructional videos provide an excellent resource to learn such procedural activities for both humans and AI agents. However, the problem with using instructional videos from the web is that they tend to be long and noisy, i.e., only a limited number of frames in the video correspond to the instruction steps, while the remaining video segments are unrelated to the task (e.g., title frames, close-ups of people talking, and product advertisements). Thus, a major challenge when dealing with instructional videos is filtering out the uninformative frames and focusing only on the task-relevant segments, i.e., the key-steps. As a result, many recent efforts tackle the problem of instruction key-step localization, e.g., [2, 3, 4, 5, 6, 7]. However, most previous work aimed at temporally localizing key-steps from instructional videos relies on some form of supervision, and are therefore not deployable at scale and without human supervision.

In this blog, we present StepFormer [1], a novel self-supervised approach that simultaneously discovers and temporally localizes procedure key-steps in long untrimmed videos without requiring human annotation nor step descriptions, as illustrated in Figure 1.

Figure 1.  StepFormer for instruction step discovery and localization. StepFormer is a transformer decoder trained to discover instruction steps in a video, supervised purely from video subtitles. At inference, it only needs the video to discover an ordered sequence of step slots and temporally localize them in the video.

StepFormer Training and Inference Procedures

StepFormer is our model for procedure step discovery in instructional videos. Given an N second long video as input, StepFormer returns K step slots, s – a sequence of vectors capturing ordered instruction steps in the video. We train StepFormer on a large dataset of instructional videos, HowTo100M [5], without any supervision, by temporally aligning step slots with the narrations that accompany the video. Our full training pipeline is summarized in Figure 2.

Figure 2.  StepFormer training overview. (left) We first embed an untrimmed instructional video with a frozen UniVL encoder [8]. Next, we attend to the video with our StepFormer with learned step queries to extract step slots. (right) To form the training targets, we take the corresponding video subtitles, extract a sequence of verb phrases, and embed them with the UniVL text encoder [8]. (middle) To supervise StepFormer, we find a matching subsequence between the step slots and verb phrases via seq-to-seq alignment with outlier rejection [2]; the green entries in the alignment matrix denote correspondences. The resulting alignment is used to define a contrastive loss.

StepFormer is implemented as a transformer decoder with learnable input queries. Similar to the learnable object queries in DETR [9], StepFormer’s queries learn to attend to informative video segments and thus can be viewed as step proposals. To enforce the output step slots to follow the temporal order, we use an order-aware loss based on temporal alignment of the learned steps and video narrations. Since video narrations tend to be noisy and do not always describe visually groundable steps, we use a flexible sequence-to-sequence alignment algorithm previously designed by our team, DropDTW [9], which allows for non-alignable narrations to be dropped. With this alignment technique, we obtain a set of paired step slots and narrations. We use these pairings to form positive and negative pairs and rely on contrastive learning as our main supervisory signal. In addition, we introduce two regularization techniques: (1) slot smoothness, which enforces that each slot corresponds to continuous set of frames in the video and (2) slot diversity, which enforces slots to represent different steps in the video.

Figure 3.  Step localization inference. Given a video, StepFormer first extracts a sequence of step slots. The step slots can be interpreted as temporally ordered key-step candidates. Next, it aligns the sequence of step slots with the video sequence, using Drop-DTW [1]. This step puts the elements of the two sequences in correspondence, while identifying outliers. Here, the step slots and video frames of the same color are matched by Drop-DTW, while the white slots and frames are dropped from the alignment. The colored video segments represent the final step localization result.

At inference time, StepFormer detects and localizes key-steps without requiring narrations. StepFormer’s inference procedure is illustrated in Figure 3. It takes an instructional video as input and returns an ordered sequence of K step slots conditioned on this video. Each step slot corresponds to a procedural key-segment in the video and carries its semantics. However, given that we use a large fixed number of K step slots for all videos, some step slots may be duplicates or have weak binding with the video segment. For this reason, we need to select a subset of the step slots that concisely describe the given video. For this, we once again rely on sequence-to-sequence alignment using Drop-DTW as our inference procedure for step localization.

In summary, our work proposes an efficient architecture that can capture the ordered set of key steps that happen in an instructional video without requiring any annotations for training and that only requires the video input at inference time.

Evaluation Results

Given that our approach does not need human annotation, we train it on the largest un-annotated instructional video dataset, HowTo100M [5]. Following previous work [10], we evaluate our proposed method on the CrossTask [7] and ProceL [11] datasets. We also report results on the largest annotated instructional video dataset, i.e., COIN [6]. For all evaluations, the same model pre-trained on HowTo100M is used directly, without finetuning or adaptation to the downstream. Our results speak decisively in favor of our approach where we outperform previous baselines with sizeable margins on all standard metrics. Overall, the obtained results (i) highlight StepFormer’s capability to automatically discover key-steps even while being completely unsupervised and having no access to the datasets used for evaluation during training; and (ii) also show that our method yields better localization, as highlighted in the qualitative results in Figure 4.

Figure 4.  Qualitative comparison of temporal localization. Comparison of our self-supervised StepFormer with the weakly supervised baselines on CrossTask.

Conclusion

In summary, this work has introduced an efficient architecture that can capture the ordered set of key steps that happen in an instructional video without requiring any annotations for training and that only requires the video input at inference time. Evaluation results on multiple benchmarks and metrics show the superiority of our approach.

About Samsung AI Center – Toronto

The Samsung AI Research Center, in Toronto, was established in 2018. With research scientists, research engineers, faculty consultants, and MSc/PhD interns, its broad mission is to develop core AI technology that improves the user experience with Samsung devices. One research pillar at SAIC-Toronto is the integration of language and vision in support of a multimodal personal assistant (or, more generally, an artificial agent) that can better respond to a user’s natural language query if it can observe how the user is interacting with their environment. Specifically, we’d like to enable the agent to guide the user through a complex task, observing their progress, offering advice/assistance, and responding to their natural language queries. One application that falls under this research umbrella is the task of instructional following, where the role of the AI agent is to help a user follow a complex task and answer the user’s queries about the instructions. Below, we present a novel self-supervised approach to key-step discovery and localization in instructional videos without requiring any annotations for training. Unlike prior approaches, which require expensive human supervision, our self-supervised model discovers and localizes instructional steps in a video using video subtitles, and it only requires the video input at inference time. This work will be presented at CVPR 2023.

Link to the paper

https://arxiv.org/abs/2304.13265

References

[1] Nikita Dvornik, Isma Hadji, Ran Zhang, Konstantinos G. Derpanis, Animesh Garg, Richard P. Wildes, Allan D. Jepson. StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

[2] Nikita Dvornik, Isma Hadji, Konstantinos G Derpanis, Animesh Garg, and Allan Jepson. Drop-DTW: Aligning common signal between sequences while dropping outliers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.

[3] Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan Jepson. Flow graph to video grounding for weakly-supervised multi-step localization. In Proceedings of the European Conference on Computer Vision (ECCV), 2022

[4] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[5] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In Proceedings of the International Conference on Computer Vision (ICCV), 2019

[6] Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

[7] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Crosstask weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

[8] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.

[9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.

[10] Yuhan Shen, Lu Wang, and Ehsan Elhamifar. Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

[11] Ehsan Elhamifar and Dat Huynh. Self-supervised multi-task procedure learning from instructional videos. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.