Learning to Align Temporal Sequences

By Isma Hadji Samsung AI Center - Toronto
By Konstantinos G. Derpanis Visiting Professor, Samsung AI Center - Toronto
By Allan D. Jepson Vice President, Samsung AI Center - Toronto


Temporal sequences (e.g., videos) are an appealing data source as they provide a rich source of information and additional constraints to leverage in learning. By far the main focus on temporal sequence analysis in computer vision has been on learning representations (i.e., compact abstractions of the input data) targeting high-level distinctions between signals (e.g., action classification, “What action is present in the video?”). Alternatively, there is work that seek a finer-grained understanding of sequences (e.g., “What particular phase of the action is performed”) but rely on fully-supervised methods, where labels are attached to each element of a sequence. Acquiring such labels for training at large-scale is a highly laborious, expensive process.

In this blog, we present our recent work [1] published at CVPR 2021, which reduces the amount of labeled training data needed for learning useful representations. We cast the training problem as one of learning to globally align temporal sequences. Here, an alignment amounts to identifying corresponding elements between the two sequences while enforcing that the matches follow the temporal orderings of the signals; see Figure 1 for an illustration.

Figure 1. Alignment as a proxy task for representation learning. We introduce a representation learning approach based on (globally) aligning pairs of temporal sequences (e.g., video) depicting the same process (e.g., human action). Our training objective is to learn an element-wise mapping of the high-dimensional data to a low-dimensional one (termed the embedding space) that supports the alignment process. For example, here we illustrate the alignment (denoted by black dashed lines) in the embedding space between videos of the same human action (i.e., tennis forehand) containing significant variations in their appearances and dynamics. Empirically, we show that our learned embeddings are sensitive to both human pose and fine-grained temporal distinctions, while being invariant to appearance, camera viewpoint, and background.

Sequence Alignment as a Proxy Task for Representation Learning

A key aspect of our work is an extension of the classical dynamic time warping (DTW) algorithm [2]. DTW is a standard algorithm for measuring the similarities between temporal sequences that are not synchronized, e.g., signals may be dilated or translated with respect to each other. To avoid the combinatorial explosion in the matching problem, standard DTW uses dynamic programming to efficiently find the optimal alignment between sequences under certain constraints, e.g., the matches must respect the time ordering of the signals. In our work [1], we introduce a DTW-based training loss to guide the learning of elementwise representations that are distinctive to support sequence alignment.

More specifically, we propose a probabilistic path finding view of DTW that encompasses the following three key features.

  • Classical DTW cannot be used as a training loss because it contains components that are not differentiable, notably the min operator. We introduce a differentiable smoothMin operator that effectively selects each successive path extension, and show that this operator has a contrastive effect across paths which improves learning.
  • The pairwise embedding similarities that form our cost function are also defined as probabilities, using the softmax operator. Optimizing our loss is shown to correspond to finding the maximum probability of any feasible alignment between paired sequences. The softmax operator over element pairs also provides a contrastive component which we show is crucial to prevent the model from learning trivial embeddings.
  • As an additional supervisory signal, our probabilistic framework admits a straightforward global cycle-consistency loss that matches the alignments recovered through a cycle of sequence pairings.

Collectively, our method takes into account long-term temporal information that allows us to learn temporally distinctive embeddings that support fine-grained temporal distinctions (e.g., human pose), while being largely unaffected to unimportant aspects, such as camera viewpoint, background, and appearance. Figure 2 provides an illustrative technical overview of our alignment approach to representation learning.

Figure 2. Summary of our alignment process.Our sequence alignment approach to representation learning begins by encoding each element comprising our sequences (e.g., image frames) using a trainable framewise backbone encoder plus embedding network, ϕ(·), yielding two sequences of embeddings, X and Y. The cost of matching these two sequences is expressed as negative log probabilities and consists of two parts: (i) alignment losses, smoothDTW(·, ·), from X to Y and Y to X based on the cumulative cost along the optimal respective paths and (ii) a global cycle consistency loss that verifies the correspondences computed between each ordered pair of sequences, where · denotes matrix multiplication and I_(M ×M) is the square identity matrix. Note, our alignment cost smoothDTW(·, ·) is not symmetric in its two arguments (due to the asymmetric pairwise matching cost. Higher intensities in the cells comprising the accumulated cost matrices indicate lower values.

Results and Broader Impact

We evaluated the efficacy of our learned embeddings on challenging temporal fine-grained tasks, thereby going beyond traditional clip-level recognition tasks. Our approach yields state-of-the-art results on tasks such as action phase classification and video synchronization. Detailed results are provided in our paper.

The ability to temporally align sequences can support a variety of downstream tasks including:

Video synchronization could be useful for coaching (comparing a user's motion with an instructor's video), or video editing (e.g., generating synchronized split frame videos).↓

Fine-grained retrieval can use a video query, or perhaps a language query. We could support, "Show me when the gymnast pushes off the horse." This would involve having such temporal descriptions added to one example video that would be used for alignment.

Audio queries could assist with timing of events such as selecting frames at which a tennis player hits a ball.

It is feasible to search for audio intervals that match a query image or video clip.

A generalization is to embed and audio-visual clip and use the combined signal for improving the temporal alignment signal.

Video 1 provides examples of several downstream applications.

Video 1. Example downstream applications enabled by our alignment loss.


In summary, this work introduced a novel weakly supervised method for representation learning relying on sequence alignment as a supervisory signal and taking a probabilistic view in tackling this problem. We evaluate our learned representation on tasks requiring fine-grained temporal distinctions and show that we establish a new state of the art. In addition, we present various applications, thereby opening up new avenues for future investigations grounded on our sequence alignment approach.

Link to the paper


[1] Isma Hadji, Konstantinos G. Derpanis, and Allan D. Jepson, Representation Learning via Global Temporal Alignment and Cycle-Consistency, CVPR, 2021.

[2] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. TASSP, 26(1):43–49, 1978.