AI

[CVPR 2023 Series #4] A Unified Pyramid Recurrent Network for Video Frame Interpolation

By Xin Jin Samsung R&D Institute - Nanjing
By Longhai Wu Samsung R&D Institute - Nanjing
By Jie Chen Samsung R&D Institute - Nanjing
By Youxin Chen Samsung R&D Institute - Nanjing
By Jay Koo Visual Display Business, Samsung Electronics
By CHEUL-HEE HAHM Visual Display Business, Samsung Electronics

The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).

In this relay series, we are introducing a summary of the 7 research papers at the CVPR 2023 and here is a summary of them.

- Part 1 : SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields (by Samsung AI Center – Toronto)

- Part 2 : StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos (by Samsung AI Center – Toronto)

- Part 3 : GENIE: Show Me the Data for Quantization (by Samsung Research)

- Part 4 : A Unified Pyramid Recurrent Network for Video Frame Interpolation(by Samsung R&D Institute - Nanjing)

- Part 5 : MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation (By Samsung R&D Institute United Kingdom)

- Part 6 : LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models (By Samsung AI Center - Cambridge)

- Part 7 : Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style (By Samsung AI Center - Cambridge)


Motivation and Introduction

Video frame interpolation (VFI) is a classic low-level vision task that synthesizes non-existent intermediate frames between original consecutive frames. Before and after frame interpolation, if the time interval of original frames is fixed, VFI can generate high-rate smoother videos; if the frame rate is fixed, VFI can produce slow-motion videos. Besides, VFI technique supports many practical applications including novel view synthesis [1], video compression [2], cartoon creation [3], etc.

With recent advances in optical flow [4,5,6], flow-guided synthesis has developed into a popular framework for video frame interpolation. Specifically, inter-frame optical flow is firstly estimated, and then leveraged to guide the synthesis of intermediate frames via differentiable warping. Despite of promising performance [7,8,9], existing flow-guided methods share some common disadvantages. Firstly, while optical flow is typically estimated from coarse to fine by a pyramid network, the intermediate frame is synthesized in a single pass by a synthesis network, missing the opportunity of iteratively refining the interpolation for high-resolution inputs. Second, the obvious artifacts in warped frames may degrade the interpolation performance, which however has been largely overlooked by existing works. Last, existing methods typically rely on heavy model architectures to achieve good performance, blocking them from being deployed on platforms with limited resources.

This blog describes UPR-Net, a novel Unified Pyramid Recurrent Network for video frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis strategy can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), our base version of UPR-Net achieves excellent performance on a large range of benchmarks. Figure 1 shows the comparison of performance and model size on Vimeo90K [10] and the hard subset of SNU-FILM [11], where our UPR-Net series show competitive results on Vimeo90K, and form an upper envelop of all current frame interpolation methods on the hard subset of SNU-FILM.

Figure 1.  Comparison of performance and model size on Vimeo90K and the hard subset of SNU-FILM. Our UPR-Net series achieve state-of-the-art accuracy with extremely small parameters.

Overview of Flow-guided Video Frame Interpolation

Depending on whether or not optical flow is incorporated to compensate for inter-frame motion, existing methods can be roughly classified into two categories: flow-agnostic methods [11,12,13], and flow-guided synthesis [7,8,9,14]. This work follows the popular pipeline of flow-guided synthesis: estimating optical flow for desired time step, warping input frames and their context features based on optical flow, and synthesizing intermediate frame from warped representations.

Where technical choices may diverge in flow-guided synthesis, is the warping operation and the optical flow it requires. Backward-warping is traditionally used for frame interpolation, but acquiring high-quality bilateral intermediate flow for it is often challenging. Forward-warping can directly use linearly-scaled bi-directional flow between input frames, and has recently emerged as a promising direction for frame interpolation [8,9].

In this work, we adopt forward warping due to its simplicity in full pipeline design. However, during forward warping, the conflicted pixels mapped to the same target should be carefully addressed. For this, we adopt the average splatting [8] as forward-warping, which simply averages the pixels mapped to the same location.

Unified Pyramid Recurrent Network for Frame Interpolation

Figure 2.  Overview of our UPR-Net. Given two input frames, we first construct image pyramids for them, then apply a recurrent structure across pyramid levels to repeatedly refine estimated bi-directional flow and intermediate frame. Our recurrent structure consists of a feature encoder that extracts multi-scale features for input frames, a bi-directional flow module that refines bi-directional flow with correlation-injected features, and a frame synthesis module that refines intermediate frame estimate with forward-warped representations.

We illustrate the overall pipeline of UPR-Net in Figure 2. It unifies bi-directional flow estimation and frame synthesis within a pyramid structure, and shares the weights across pyramid levels. This macro pyramid recurrent architecture has two advantages: 1) reducing the parameters of the full pipeline; 2) allowing to customize the number of pyramid levels in testing to handle large motions.

Given a pair of consecutive frames, our goal is to synthesize the non-existent intermediate frame at the desired time step. UPR-Net tackles this task via an iterative refinement procedure across several image pyramid levels, from the top level with down-sampled frames, to the bottom pyramid level with the original input frames. At each pyramid level, UPR-Net employs a feature encoder to extract multi-scale CNN features for both input frames. Then, the CNN features and the optical flow up-sampled from previous level are processed by a bi-directional flow module to produce refined bi-directional flow. The refined optical flow is leveraged to forward-warp input frames and multi-scale CNN features. Combining warped representations with the interpolation up-sampled from previous level, a frame synthesis module is employed to generate refined intermediate frame. This estimation process is repeated until generating the final interpolation at the bottom pyramid level.

Figure 3.  Module design for bi-directional flow estimation and intermediate frame synthesis. (a) Our bi-directional flow module simultaneously estimates the bi-directional flow between input frames with correlation-injected features. (b) Our frame synthesis module explicitly refines the up-sampled interpolation from previous pyramid level with warped representations.

Bi-directional Flow Module

As shown in Figure 3 (a), our optical flow module is designed to simultaneously estimate the bi-directional flow between input frames with correlation-injected features. Specifically, at each pyramid level, based on the initial optical flow up-sampled from previous pyramid level, the CNN features of input frames are forward-warped towards the hidden middle frame to align their pixels. Then, the warped features are leveraged to construct a partial correlation volume. Finally, the warped features, correlation volume, and up-sampled bi-directional flow, are concatenated and processed by a 6-layer CNN to predict the refined bi-directional flow.

Frame Synthesis Module

As shown in Figure 3 (b), we employ a simple encoder-decoder synthesis network to predict intermediate frame from forward-warped representations. Our frame synthesis module follows the design of previous context-aware synthesis networks [8,9], but has two distinctive features. Firstly, we feed the synthesis network with up-sampled estimate of intermediate frame as explicit reference for further refinement. Second, our synthesis network is extremely lightweight, shared across pyramid levels, and much simpler than previous grid-like architecture [8,9].

Analysis of Iterative Synthesis

Iterative synthesis from coarse to fine has proven beneficial for high-resolution image synthesis. In this work, we reveal that it can also significantly improve the robustness of frame interpolation on large motion cases.

To understand this, we implement a plain synthesis baseline, which does not feed frame synthesis module with up-sampled estimate of intermediate frame. We comprehensively compare the interpolated frames by plain synthesis and iterative synthesis on large motion cases, and draw non-trivial conclusions in the following.

Figure 4.  (a) Plain synthesis may lead to poor interpolation, due to the obvious artifacts on warped frames. By contrast, iterative synthesis enables robust interpolation, although artifacts also exist on warped frames. (b) While plain synthesis can produce good interpolation on much lower (1/8) resolution, our iterative synthesis strategy can generate compelling results on both low and high resolutions.

As shown in the first row of Figure 4 (a), we observe that plain synthesis may produce poor interpolation, due to the obvious artifacts on warped frames. It is worth noting that these artifacts (e.g., holes in forward-warped frames) are inevitable for large motion cases, even when estimated optical flow is accurate. Nevertheless, as shown in the second row of Figure 4 (a), our iterative synthesis can produce robust interpolation for large motion cases.

We hypothesize that for iterative synthesis, the up-sampled interpolation from lower-resolution pyramid level may have less or no artifacts due to smaller motion magnitude. Thus, it can guide the synthesis module to produce robust interpolation at higher-resolution levels. This hypothesis is supported by Figure 4 (b), where we interpolate the same example at reduced resolutions. We find that the artifacts by plain synthesis are progressively eased with reduced resolutions. In particular, on 1/8 and 1/16 resolutions, plain synthesis gives good interpolation without artifacts. By contrast, our iterative synthesis can generate good interpolation at all scales, by leveraging the interpolations from low-resolution levels.

Quantitative Results

We construct three versions of UPR-Net by scaling the feature channels: UPR-Net, UPR-Net large, and UPR-Net LARGE, with increasing model size. While our UPR-Net series are trained only on Vimeo90K [10], we evaluate them on a broad range of benchmarks with different resolutions, including UCF101 [15], Vimeo90K [10], SNU-FILM [11], and 4K1000FPS [16].

In Table 1, we report quantitative results on low-resolution and moderate-resolution benchmarks, including UCF101, Vimeo90K, and SNU-FILM. In particular, we compare our UPR-Net series with many state-of-the-art methods, including DAIN, CAIN, SoftSplat, AdaCoF, BMBC, CDFI, ABME, XVFI, RIFE, EBME, VFIformer, and IFRNet. Our UPR-Net LARGE model achieves the best performance on UCF101 and second best result on Vimeo90K. Our UPR-Net large and UPR-Net models also achieve excellent accuracy on these two benchmarks. In particular, our UPR-Net model, which only has 1.7M parameters, outperforms many recent large models on UCF101, including SoftSplat and ABME. On SNU-FILM, when measured with PSNR, our UPR-Net series outperform all previous state-of-the-art methods. In particular, our base UPRNet model outperforms the large VFIformer model, in part due to its capability of handling challenging motion.

In Table 2, we report the 8x interpolation results on 4K1000FPS. Our UPR-Net large model achieves the best performance. Furthermore, our method enables arbitrary-time frame interpolation, and the bi-directional flow only needs to be estimated once for multi-frame interpolation.

Table 1.  Qualitative (PSNR/SSIM) comparisons to state-of-the-art methods on UCF101, Vimeo90K and SNU-FILM benchmarks. RED: best performance, BLUE: second best performance.

Table 2.  Comparisons on 4K1000FPS for 8x interpolation.

Figure 5.  Qualitative comparisons on SNU-FILM. The first example (row 1~2) is from the hard subset, and the second example (row 3~4) is from the extreme subset.

Qualitative Results

Figure 5 shows two examples from the hard and extreme subsets from SNU-FILM, respectively. Our methods produce better interpolation results than IFRNet large model for local textures (first two rows), and give promising results for large motion cases (last two rows), much better than CAIN and VFIformer, and sightly better than ABME and IFRNet large. Figure 6 shows two interpolation examples from 4K1000FPS, where our models are robust to large motion, and give better interpolation for local textures than XVFI.

Figure 6.  Qualitative comparisons on 4K1000FPS. The first example is interpolated at t = 1/8, while the second example is interpolated at t = 1/2.

Ablation Study of Design Choices

Table 3.  Ablation studies of our design choices. Default settings (independent of benchmark datasets) are marked in gray.

In Table 3, we present ablation studies of the design choices of our UPR-Net on Vimeo90K , the hard subset of SNU-FILM, and X-TEST of 4K1000FPS. Specifically, we verify that: 1) the recurrent design of bi-directional flow and frame synthesis modules can significantly improve the interpolation accuracy for high-resolution videos; 2) iterative synthesis enables more robust interpolation on large motion cases; 3) unified pipeline (shared feature encoder) is simple and elegant, and achieves slightly better accuracy; 4) correlation volume improves motion-based frame interpolation on all benchmarks; 5) context features is crucial to achieve good results on Vimeo90K, but surprisingly, does not lead to obvious better performance on large motion benchmarks.

About Samsung R&D Institute China-Nanjing

Samsung R&D Institute China-Nanjing (SRC-N) was setup in 2004.04. SRC-N mainly engages in Software development for TV, Smart Phone, Refrigerator and other electronic products. SRC-N strong technologies range from On-device AI Vision, TV SW Platform/Service, Mobile System/Game Solution, etc. SRC-N Intelligence Vision Lab with a broad mission is to develop core AI vision technology that improves user experience with Samsung devices. One of our research interests is picture quality enhancement with AI algorithms. Recently, IVL (Intelligence Vision Lab) researchers developed a novel unified pyramid recurrent network (UPR-Net) for video frame interpolation. UPR-Net is extremely lightweight, yet achieves state-of-the-art performance on a large range of benchmarks. This work has been accepted by CVPR 2023.

Link to the paper

https://arxiv.org/pdf/2211.03456.pdf
Code and trained models : https://github.com/srcn-ivl/UPR-Net

References

[1] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In CVPR, 2016.

[2] Guo Lu, Xiaoyun Zhang, Li Chen, and Zhiyong Gao. Novel integration of frame rate up conversion and HEVC coding based on rate-distortion optimization. TIP, 2017.

[3] Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, Dimitris Metaxas, Chen Change Loy, and Ziwei Liu. Deep animation video interpolation in the wild. In CVPR, 2021.

[4] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. LiteFlowNet: A lightweight convolutional neural network for optical flow estimation. In CVPR, 2018.

[5] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.

[6] Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.

[7] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. In CVPR, 2019.

[8] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020.

[9] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017.

[10] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. IJCV, 2019.

[11] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020.

[12] Xianhang Cheng and Zhenzhong Chen. Video frame interpolation via deformable separable convolution. In AAAI, 2020.

[13] Simone Meyer, Abdelaziz Djelouah, Brian McWilliams, Alexander Sorkine-Hornung, Markus Gross, and Christopher Schroers. PhaseNet for video frame interpolation. In CVPR, 2018.

[14] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. TPAMI, 2019.

[15] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.

[16] Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. XVFI: Extreme video frame interpolation. In ICCV, 2021.