The problem of sequence-to-sequence alignment is central to many computational applications. Aligning two sequences (e.g., temporal signals) entails computing the optimal pairwise correspondence between the sequence elements while preserving their match orderings. For example, video or audio synchronization are important applications of sequence alignment in the same modality, while the alignment of video to audio or text represents a cross-modal task.
Dynamic Time Warping (DTW)  is a standard algorithm for finding the optimal alignment between two variable length sequences. To find a meaningful alignment, DTW assumes that every element in one sequence has a good match in the other sequence. In other words, DTW assumes that the sequences do not contain outliers, which is not the case for most realistic scenarios. A real-world example of sequences containing outliers is instructional videos. These are long, untrimmed videos depicting a person performing a given activity (e.g., “making a latte”) following a pre-defined set of ordered steps (e.g., a recipe). Typically, only a few frames in the video correspond to the instruction steps, while the rest of the video frames are unrelated to the main activity (e.g., the person talking); see Figure 1 (left) for an illustration. In this case, matching the outlier frames to instruction steps will not yield a meaningful alignment. Thus, there is a need for a sequence alignment algorithm that can deal with outliers.
Figure 1. Aligning instructional videos with DTW vs Drop-DTW
Both video sequences (top and bottom) depict the three main steps of “making latte”, however, there are unrelated video segments, i.e., outliers, in between the steps. Here, green denotes correct correspondences and red indicates false matches. Left: DTW aligns all the frames with each other and creates false correspondences where outliers are matched to signal. Right: In contrast, Drop-DTW finds the optimal alignment, while simultaneously dropping unrelated frames (crossed out), leaving only correct correspondences (green links).
In this blog, we present Drop-DTW, our work  published at NeurIPS’21, that addresses the problem of matching sequences that contain interspersed outliers as illustrated in Figure 1 (right). During the alignment process, Drop-DTW adaptively finds the outlier sequence elements and drops them from the matching, which results in an alignment limited to the inlier sequence elements. Using this algorithm, we are able to improve step localization performance in instructional videos, as well as learning from video and audio signal with minimal human supervision.
We make the implementation of Drop-DTW publicly available (see the official GitHub repository).
The key aspect of our work is an extension of the classical dynamic time warping (DTW) algorithm  to handle interspersed outliers during sequence matching. Given that matching any two elements incurs a certain cost (low for similar elements and high otherwise), the classical DTW casts sequence alignment as an optimization problem. The objective is to find a set of pairwise matches between sequence elements, such that the total matching cost is minimized, while the matches follow the temporal order and every element has at least one match. The alignment formulation of DTW completely ignores the fact that some of the elements in the sequences may be outliers and should not have a match in the other sequence.
To handle outliers, Drop-DTW explicitly allows any sequence element to be dropped during matching, i.e., no element from the other sequence will be assigned to the dropped element. Drop-DTW implements this by casting sequence alignment as an optimization problem with a novel component specifying the cost of dropping an element within the optimization process. As a result, Drop-DTW solves optimal temporal alignment while jointly detecting outliers and can effectively skip through irrelevant parts of the sequence during alignment. Figure 2 illustrates how the drop capability is implemented in Drop-DTW. Our algorithm is efficiently realized using a dynamic program that naturally admits a differentiable approximation and can be efficiently used at training and inference time.
Figure 2. Optimal alignment with DTW and Drop-DTW
Aligning two different videos where the digit “3” moves across the square frame. The colored matrices represent the pairwise matching costs, with darker cell indicating higher cost. The paths on the grid are alignment paths, while the points on them indicate a pairwise match between the corresponding row and column elements. (a) All three paths are feasible DTW paths, while only one of them (in green) is optimal. (b) When sequence X contains an outlier (i.e., digit “0”), DTW uses it in the alignment and incurs a high cost (red point). (c) In contrast, Drop-DTW skips the outlier and only keeps the relevant matches. Here, Drop-DTW identifies the 2nd element of the sequence x as an outlier and drops it, which incurs the cost dx2.
While Drop-DTW enables a variety of applications (see  for more information) in this blog post we showcase the more practical of the applications - step localization in instructional videos. For this task, we are given as input: (i) a long untrimmed video of a person performing a task (e.g., “making salad”), where the steps involved in performing the main task are interspersed with irrelevant actions (e.g., intermittently washing dishes), and (ii) an ordered set of textual descriptions of the main steps (e.g., “cut tomatoes”) in this video. The goal is to temporally localize each step in the video.
To approach this problem, we first embed the video frames and step text descriptions into high-dimensional vectors, such that we can compute pairwise match and drop costs. Then, we feed the embedded sequences to Drop-DTW for alignment. As a result, our algorithm labels each video frame with one of the instruction steps, or drops it from the alignment, which gives rise to the desired temporal video segmentation (see Figure 3).
Figure 3. Drop-DTW for step localization in instructional videos
Given a raw untrimmed video of “making salad” and an ordered list of the recipe steps as text descriptions, we embed each modality into a high-dimensional space using dedicated neural networks. The embedded video frames and sentence sequences are then aligned using Drop-DTW. As a result, every video frame is associated to one of the instruction steps, or dropped otherwise. Matched video frames are labeled with the corresponding instruction step, or marked as “background” in case they were dropped. This labeling is then used to obtain the start and end time of each instruction step.
Figure 4 below gives an example of how Drop-DTW localizes steps in an instruction video depicting “Making Kerala Fish Curry” and compares our approach to other alignment methods (e.g., OTAM and DTW). We can see that the predicted step locations (in the row “Drop-DTW”) are very close to the ground truth locations (in the row “Ground Truth”), especially when comparing to the competing methods – OTAM and DTW. This demonstrates the effectiveness of Drop-DTW for step localization in instructional videos.
Figure 4. Step localization with Drop-DTW vs alternative methods
Each row gives a temporal segmentation of the video into instruction steps, with different colors indicating different steps. First row depicts the ground-truth segmentation, while rows 2-4 demonstrate segmentations produced by different step localization methods. Drop-DTW allows to identify interspersed unlabelled clips and much more closely approximates the ground truth.
In summary, we introduced an extension to the classic DTW algorithm, which can handle interspersed outliers during sequence alignment. The proposed algorithm is efficiently implemented as a dynamic program and naturally admits a differentiable approximation. Drop-DTW can be used both as a flexible alignment cost between sequences and a loss during training. Finally, Drop-DTW has a wide range of applications, e.g., step localization in instructional videos, as highlighted in this blog.
 Hiroaki Sakoe and Seibi Chiba. Dynamic Programming Algorithm Optimization for Spoken Word Recognition. TASSP, 26(1):43–49, 1978.
 Nikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg and Allan D. Jepson, Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers, NeurIPS, 2021.