AI

Enhanced Bi-directional Motion Estimation for Video Frame Interpolation

By Xin Jin Samsung R&D Institute China-Nanjing
By Longhai Wu Samsung R&D Institute China-Nanjing
By Guotao Shen Samsung R&D Institute China-Nanjing
By Youxin Chen Samsung R&D Institute China-Nanjing
By Jie Chen Samsung R&D Institute China-Nanjing
By Jay Koo Samsung Electronics' Visual Display Business
By CHEUL-HEE HAHM Samsung Electronics' Visual Display Business

Introduction and Motivation

Video frame interpolation aims to increase the frame rate of videos, by synthesizing non-existent intermediate frames between original successive frames. With recent advances in optical flow, motion-based interpolation has developed into a promising framework. Estimating the bi-directional motions between input frames is a crucial step for most motion-based interpolation methods. Existing methods typically employ an off-the-self optical flow model or a U-Net for bi-directional motions, which may suffer from large model size or limited capacity in handling various challenging motion cases.

This blog presents a novel model to simultaneously estimate the bi-directional motions between input frames. Our motion estimator is extremely lightweight (15x smaller than PWC-Net [1]), yet enables reliable handling of large and complex motion cases (see Figure 1). Based on estimated bi-directional motions, we employ a synthesis network to fuse forward-warped representations and predict the intermediate frame. Our method achieves excellent performance on a broad range of frame interpolation benchmarks.

Figure 1.   First two columns: Overlay inputs and ground truth frame. Middle two columns: Motion field by PWC-Net and corresponding interpolation. Last two columns: Our motion field and interpolated frame.

Overview of Motion-based Video Frame Interpolation

Motion-based interpolation involves two steps: motion estimation, and frame synthesis. Motion field is estimated to guide the synthesis of intermediate frame, by forward-warping or backward-warping input frames towards intermediate frame. Forward-warping is guided by motion from input frames to intermediate frame, while backward-warping requires motion in reversed direction. In particular, when the bi-directional motions betweeninput frames have been estimated, the motions from input frames to arbitrary intermediate frame required byforward-warping, can be easily approximated by linearly scaling the motion magnitude. In this work, we follow the frame interpolation paradigm enabled by forward-warping, due to its simplicity in overall pipeline design.

Many of existing methods employ an off-the-shelf optical flow model (e.g., PWC-Net [1]) for bi-directional motions, which however suffer from large model size, need to run the model twice, and can hardly handle extreme large motion beyond the training data. Recently, a BiOF-I module is proposed for simultaneous bi-directional motion estimation [2]. It is based on a flexible pyramid recurrent structure, which enables customizable pyramid levels in testing to handle large motions. However, U-Net is over-simplified for optical flow, due to the lack of correlation volume, which is a vital ingredient in modern optical flow models. We present a solution that overcomes the limitations of both PWC-Net [1] and BiOF-I [2].

Enhanced Bi-directional Motion Estimation for Frame Interpolation

Figure 2.   Overview of our frame interpolation pipeline. (a) We repeatedly apply a novel recurrent unit across image pyramids to refine estimated bi-directional motions between input frames. (b) Based on estimated bi-directional motions, we forward-warp input frames and their context features, and employ synthesis network to predict the intermediate frame.

Our frame interpolation pipeline involves two steps: (a) bi-directional motion estimation, and (b) frame synthesis. Our main innovation is the bi-directional motion estimator. When trained, it can operate on flexible customizable image pyramids to handle large motion in testing. An overview of our pipeline is shown in Figure 2.

As shown in Figure 2 (a), the macro structure of our bi-directional motion estimator is a pyramid recurrent network. Given two input frames, we firstly construct image pyramids for them, then repeatedly apply a recurrent unit across the pyramid levels to refine estimated bi-directional motions. At each pyramid level, we first up-sample the estimated bi-directional motions from previous level. Based on scaled initial motion, we forward-warp both input frames to a hidden middle frame. Then, we employ an extremely lightweight feature encoder to extract CNN features for both warped frames. Lastly, we construct a correlation volume with CNN features of warped frames, and estimate the bi-directional motions from correlation injected features.

There are three key components involved in our recurrent unit: middle-oriented forward-warping, extremely lightweight feature encoder, and correlation based bi-directional motion estimation. In particular, due to reduced motion magnitude, our middle-oriented forward-warping has the chance to reduce the impacts of possible artifacts in warped images in case of large motion. Furthermore, warping both input frames to a hidden frame allows us to construct a single correlation volume for simultaneous bi-directional motion estimation.

Based on estimated bi-directional motions, we employ a synthesis network to predict the intermediate frame from forward-warped representations. Our synthesis network follows the design of previous context-aware synthesis networks, which take both warped frames and warped context features as input. In particular, we develop a high-resolution version of synthesis network, which up-samples the input frames for frame synthesis, and then down-samples the interpolation results with learned dynamic filters.

We name our frame interpolation method as EBME -- Enhanced Bi-directional Motion Estimation for frame interpolation. We construct three versions of EBME, with almost the same model size but increased computational cost: (1) EBME that combines our bi-directional motion estimator with the base version of synthesis network; (2) EBME-H that combines our motion estimator with the high-resolution version of synthesis network; (3) EBME-H∗ that uses the test-time augmentation with EBME-H.

Comparisons with State-of-the-art

While our method is trained only on Vimeo90K [3], we evaluate it on a broad range of benchmarks with different resolutions, including UCF101 [4], Vimeo90K [3], SNU-FILM [5], and 4K1000FPS [6].

Table 1.   Qualitative (PSNR/SSIM) comparisons to state-of-the-art methods on UCF101, Vimeo90K and SNU-FILM benchmarks. RED: best performance, BLUE: second best performance.

Figure 3.   Visual comparisons on two examples from the “extreme” subset of SNU-FILM. The first two rows show the synthesis results for detailed textures, while the last two rows demonstrate the results with complex and large motion.

We compare with many state-of-the-art methods on UCF101, Vimeo90K and SNU-FILM, including DAIN [7], CAIN [5], SoftSplat [8], AdaCoF [9], BMBC [10], ABME [11], XVFI [2], and ECM [12]. As shown in Table 1, Our EBME-H∗ achieves best performance on these benchmarks. Our EBME also outperforms many state-of-the-art models including DAIN, CAIN, AdaCoF, BMBC, XVFI, and ECM. It is worth noting that our models are about 4.5x smaller than ABME, and run much faster. Figure 3 gives two examples from the “extreme” subset from SNU-FILM. Our methods produce better interpolation results than ABME for some detailed textures (first two rows), and give promising results for large motion cases (last two rows), much better than CAIN and AdaCoF, and sightly better than ABME.

Table 2.   Comparisons on 4K1000FPS for 8x interpolation.

Figure 4.   Visual comparisons on 4K1000FPS. XVFI trends to miss the moving small objects, while our EBME-H gives interpolation results close to the ground truth.

We also report the 8x interpolation results on 4K1000FPS [2]. As shown in Table 2, our method achieves the best performance by SSIM, but slight inferior results to ABME [11] and XVFI [2] by PSNR. Note that XVFI is trained on 4K high-resolution data, while other models are trained on low-resolution data. Our method supports arbitrary-time frame interpolation, and can fully re-use estimated bi-directional motions when interpolating multiple intermediate frames at different time positions. By contrast, while XVFI can reuse the bi-directional motions, it must refine the approximated intermediate flow with an extra network at each time position. Figure 4 gives two interpolation examples. Our methods give better performance for moving small objects. The U-Net based pyramid motion estimator in XVFI might have difficulty in capturing the motion of extreme small objects.

Ablation Study of Design Choices

Table 3.   Impacts of the design choices of our motion estimator for frame interpolation.

We present analysis of our motion estimator on the “hard” and “extreme” subsets of SNU-FILM, which contain various challenging motion cases. As shown in Table 3, we verify the effectiveness of our design choices including simultaneous bi-directional motion estimation, middle-oriented forward warping, correlation-based motion estimation, etc. In particular, we show that simultaneous bi-directional motion estimation not only reduce computational cost (compared with running single-directional model twice), but also improves quantitative results partially due the correlation between bi-directional motions.

Figure 5.  (a) Our middle-oriented forward-warping can reduce possible artifacts in warped images in case of large motion. (b) Correlation is essential for estimating challenging motions, which however is overlooked by many frame interpolation methods.

Figure 5 gives visual evidences about the effectiveness of our warping method and correlation-based motion estimation. Due to reduced motion magnitude, our middle-oriented forward-warping can reduce the impacts of possible artifacts in warped images in case of large motion. Correlation volume enhances the ability of our estimator in estimating complex nonlinear motions. The significant of correlation volume is largely overlook by many of existing motion-based frame interpolation methods.

Our Paper

Xin Jin, Longhai Wu, Guotao Shen, Youxin Chen, Jie Chen, Jayoon Koo, Cheul-hee Hahm. Enhanced Bi-directional Motion Estimation for Video Frame Interpolation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2023.

Link to the Paper

https://arxiv.org/pdf/2206.08572.pdf

References

[1]. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.

[2]. Sim, Hyeonjun, Jihyong Oh, and Munchurl Kim. "XVFI: Extreme video frame interpolation." ICCV. 2021.

[3]. Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. IJCV, 2019.

[4]. [4] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.

[5]. Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020.

[6]. Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. XVFI: Extreme video frame interpolation. In ICCV, 2021.

[7]. Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. In CVPR, 2019.

[8]. Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020.

[9]. Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. AdaCoF: Adaptive collaboration of flows for video frame interpolation. In CVPR, 2020.

[10]. Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. BMBC: Bilateral motion estimation with bilateral cost volume for video interpolation. In ECCV, 2020.

[11]. Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetric bilateral motion estimation for video frame interpolation. In ICCV, 2021.

[12]. Sungho Lee, Narae Choi, andWoong Il Choi. Enhanced correlation matching based video frame interpolation. In WACV, 2022.