[CVPR 2022 Series #1] Probabilistic Procedure Planning in Instructional Videos

By Isma Hadji Samsung AI Center - Toronto
By Nikita Dvornik Samsung AI Center - Toronto
By Konstantinos G. Derpanis Samsung AI Center - Toronto
By Richard Wildes Samsung AI Center - Toronto
By Allan Jepson Samsung AI Center - Toronto

The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).

This year Samsung Research’s R&D centers around the world present a total of 20 thesis papers at the CVPR 2022. In this relay series, we are introducing a summary of the 6 research papers.

Especially two papers submitted by its Toronto AI Center were selected for oral presentations. Opportunities to give oral presentations at CVPR are extended to the top 4~5% of the total number of papers submitted. For Samsung’s Toronto AI Center, this is the second time in two years they have earned such a chance, as they were also selected for oral presentation in 2020.

Here is a summary of the 6 research papers among 20 thesis papers shared at CVPR 2022.

- Part 1 : Probabilistic Procedure Planning in Instructional Videos (by Samsung AI Center – Toronto)

- Part 2 : Day-to-Night Image Synthesis for Training Nighttime Neural ISPs (by Samsung AI Center – Toronto)

- Part 3 : GP2: A Training Scheme for 3D Geometry-Preserving and General-Purpose Depth Estimation
  (by Samsung AI Center - Moscow)

- Part 4 : Stereo Magnification with Multi-Layer Images(by Samsung AI Center - Moscow)

- Part 5 : P>M>F: The Pre-Training, Meta-Training and Fine-Tuning Pipeline for Few-Shot Learning
  (by Samsung AI Center - Moscow)

- Part 6 : Gaussian Process Modeling of Approximate Inference Errors for Variational Autoencoders
  (by Samsung AI Center - Cambridge)

About Samsung AI Center – Toronto

The Samsung AI Research Center in Toronto is consist of research scientists, research engineers, faculty consultants, and MSc/PhD interns. Its broad mission is to develop core AI technology that improves the user experience with Samsung devices.

One research pillar at SAIC-Toronto is the integration of language and vision in support of a multimodal personal assistant (or, more generally, an artificial agent) that can better respond to a user’s natural language query if it can observe how the user is interacting with their environment. Specifically, we’d like to enable the agent to guide the user through a complex task, observing their progress, offering advice/assistance, and responding to their natural language queries. One application that falls under this research umbrella is the task of procedure planning, where the role of the AI agent is to help a user devise a plan to perform a complex task given an initial goal state.

Below, we describe a novel approach to such procedure planning, to be presented orally at CVPR 2022. Unlike prior approaches, which require expensive visual supervision, our approach relies on the tight integration of language and vision to solve the task with less labeling efforts. This interplay between vision and language is key to many of the core techs developed in our team.


Procedure planning is a natural task for humans -- one must plan out a sequence of actions to go from an initial state to a desired goal state (e.g., determining what actions should be taken to turn a large piece of raw steak into slices of grilled steak as illustrated in Figure 1 below). While effortless for humans, procedure planning is notoriously hard for artificial agents. Nevertheless, solving procedure planning is of great importance for building next-level artificial intelligence systems capable of analyzing and mimicking human behaviour, and eventually assisting humans in goal-directed problem solving, such as cooking or installing or repairing devices.

Figure 1.  Illustration of the proposed procedure planning with weak language supervision. Fully supervised approaches (bottom row) require intermediate visual steps and therefore need the start and end times for each intermediate step, i.e., {si,ei}. In contrast, our approach (top row) exploits natural language representations, { ιi}, as a cheaper surrogate supervision, which only requires labeling the order of events.

In this blog, we present our research work [1] which tackles procedure planning while addressing various downsides of precedent efforts. Our key contributions are threefold:

• First, we remove the need for strong visual supervision, which entails very expensive annotation of the start and end times for each instructional step. Instead, we propose to use their linguistic counterparts as an affordable surrogate.

• Second, we propose a generative adversarial framework to capture the inherent uncertainty in procedure planning, i.e., we model the various possible plans to go from start to goal state.

• Third, we use a memory-augmented transformer architecture that can generate all action steps in parallel, thereby greatly improving inference efficiency. The use of a memory module also improves long-range sequence modeling and sequence coherence.

Technical Novelty: Probabilistic Procedure Planning with Weak Supervision

Figure 2.  Overview of our procedure planning approach. First, we add random latent variables to the visual observations of the start and goal states to support diverse plan sampling. Second, we pass the input to the transformer decoder that interacts with the global memory to generate feasible procedure plans. Third, we produce state and action vectors and use losses calculated from natural language representations, together with others, to supervise our architecture.

The first key contribution of our approach is exploiting natural language representations of step labels as a source of weak supervision, instead of the expensive vision-based supervision. To supervise our model with language features, we adopt contrastive learning [2]. For each predicted intermediate feature, we use the corresponding ground truth language embedding as the positive example and all the other embeddings in the language vocabulary as negative examples.

To capture the uncertainty in prediction, where multiple plans are plausible (i.e., more than one procedure to grill a steak), we augment our model with a stochastic component using generative adversarial learning. More concretely, we augment the input with random latent variables, e.g., vectors sampled from a normal distribution, to enable diverse plan generation. For training, we use an adversarial loss as an additional objective to distinguish feasible plans from unfeasible ones.

We use a memory augmented non-autoregressive transformer decoder as our model. The decoder is a stack of standard multi-head attentions, all of which have access to the global learnable memory. Therefore, the memory-augmented transformer consists of two key operations: first, the query input is processed with the self-attention; second, the cross-attention module attends to the learnable memory to generate the output. Intuitively, the memory module can be seen as a collection of learnable plan embeddings shared across the entire dataset. See Figure 2 for details.

Collectively, our work proposes an efficient architecture that can capture the uncertainty inherent to the task of procedure planning, all while relying on cheaper supervision.

Evaluation Results

We examine our approach from both deterministic and probabilistic perspectives. The former assumes each test data owns one single plan as ground truth, and metrics of interest include mean accuracy, mean Intersection over Union, and success rate. Our approach achieves competitive results on all metrics against three public benchmarks (i.e., CrossTask [3], COIN [4] and NIV [5]) as well as demonstrates robust performance over a wide scope of planning horizons.

In contrast, the probabilistic point of view considers a set of plausible plans (i.e., distributions) as ground truth and focuses on stochasticity-oriented metrics that measures both sample quality and diversity, such as negative log-likelihood, Kullback–Leibler divergence and mode coverage. Our approach also yields state-of-the-art results on these metrics. A visualization of sampled results from our approach is shown in Figure 3. More detailed results are provided in our paper.

Figure 3.  Sampled plausible plans from our probabilistic generative model for the same {start, goal} observations. Our sampled results are both realistic and diverse.


In summary, this work has introduced a weakly-supervised method for probabilistic procedure planning using instructional videos. Different from previous work, we eschew the need for visual supervision in favor of a cheaper language surrogate. In addition, we demonstrated the crucial role of modeling uncertainty in obtained plans to yield a principled approach to planning in videos. Evaluation results on multiple benchmarks and metrics show the superiority of our approach.

Link to the paper


[1] H. Zhao, I. Hadji, N. Dvornik, K. G. Derpanis, R. P. Wildes, A. D. Jepson, P3IV: Probabilistic procedure planning from instructional videos with weak supervision, CVPR 2022.

[2] G. Michael, H. Aapo, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, AISTATS 2010.

[3] D. Zhukov, JB. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, J. Sivic, Cross-task weakly supervised learning from instructional videos, CVPR 2019.

[4] Y. Tang, D. Ding, Y. Rao, Y. Zheng, D. Zhang, L. Zhao, J. Lu and J. Zhou, COIN: A large-scale dataset for comprehensive instructional video analysis, CVPR 2019

[5] J.B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev and S. Lacoste-Julien, Unsupervised learning from narrated instruction videos. CVPR 2016.