AI
We developed a novel architecture named UPR-Net (Unified Pyramid Recurrent Network) for flow-guided video frame interpolation. UPR-Net is extremely lightweight, yet can achieve state-of-the-art performance on a large range of benchmarks. The preliminary results of UPR-Net have been published in CVPR2023, and recently the extended journal version of UPR-Net has been accepted by International Journal of Computer Vision(IJCV). This blog article is based on the journal version of UPR-Net.
- [CVPR 2023 Series #4] A Unified Pyramid Recurrent Network for Video Frame Interpolation
Figure 1.Qualitative results on challenging 4K frames, where UPR-Net significantly outperforms existing state-of-the-art solutions.
Video frame interpolation (VFI) is a classic low-level vision task, which aims to synthesize non-existent intermediate frames between original consecutive frames. Before and after frame interpolation, if the time interval of original frames is fixed, VFI can generate high-rate smoother videos; if the frame rate is fixed, VFI can produce slow-motion videos. VFI technique plays an essential role in many practical applications including novel view synthesis [1], video compression [2], cartoon creation [3], etc.
Flow-guided synthesis provides a popular framework for video frame interpolation [4][5][6], where optical flow is firstly estimated to warp input frames, and then the intermediate frame is synthesized from warped representations. Within this framework, optical flow is typically estimated from coarse-to-fine by a pyramid network [7], but the intermediate frame is commonly synthesized in a single pass, missing the opportunity of refining possible imperfect synthesis for high-resolution and large motion cases. While cascading several synthesis networks is a natural idea, it is nontrivial to unify iterative estimation of both optical flow and intermediate frame into a compact, flexible, and general framework.
In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. We show that our iterative synthesis significantly improves the interpolation robustness on large motion cases, and the recurrent module design enables flexible resolution-aware adaptation in testing. When trained on low-resolution data, UPR-Net can achieve excellent performance on both low- and high-resolution benchmarks. Despite being extremely lightweight (1.7M parameters), the base version of UPR-Net competes favorably with many methods that rely on much heavier architectures. Figure 1 shows the interpolation results of our UPR-Net on extremely challenging 4K frames. Code and trained models are publicly available at: https://github.com/srcn-ivl/UPR-Net.
Figure 2.Overview of our UPR-Net. Given two input frames, we first construct image pyramids for them, then apply a recurrent structure across pyramid levels to repeatedly refine estimated bi-directional flow and intermediate frame. Our recurrent structure consists of a feature encoder that extracts multi-scale features for input frames, a bi-directional flow module that refines bi-directional flow with correlation-injected features, and a frame synthesis module that refines intermediate frame estimate with forward-warped representations.
The overall pipeline of UPR-Net is illustrated in Figure 2. It unifies bi-directional flow estimation and frame synthesis within a pyramid structure, and shares the weights across pyramid levels. This macro pyramid recurrent architecture has two advantages: 1) reducing the parameters of the full pipeline; 2) allowing to customize the number of pyramid levels in testing to handle large motions.
Given a pair of consecutive frames, our goal is to synthesize the non-existent intermediate frame at the desired time step. UPR-Net tackles this task via an iterative refinement procedure across several image pyramid levels, from the top level with down-sampled frames, to the bottom pyramid level with the original input frames. At each pyramid level, UPR-Net employs a feature encoder to extract multi-scale CNN features for both input frames. Then, the CNN features and the optical flow up-sampled from previous level are processed by a bi-directional flow module to produce refined bi-directional flow. The refined optical flow is leveraged to forward-warp input frames and multi-scale CNN features. Combining warped representations with the interpolation up-sampled from previous level, a frame synthesis module is employed to generate refined intermediate frame. This estimation process is repeated until generating the final interpolation at the bottom pyramid level. For detailed network structures of our bi-directional flow module and frame synthesis module, please refer to our released code.
Coarse-to-fine processing is effective for high-resolution image synthesis. In this work, we reveal that it can also significantly improve the robustness of frame interpolation on large motion cases.
To understand this, we implement a plain baseline synthesis network, without feeding the frame synthesis module with up-sampled estimate of intermediate frame. We comprehensively compare the interpolated frames by plain synthesis and iterative synthesis on large motion cases, and draw non-trivial conclusions in the following.
Figure 3.(a) Plain synthesis may lead to poor interpolation, due to the obvious artifacts on warped frames. By contrast, iterative synthesis enables robust interpolation, although artifacts also exist on warped frames. (b) While plain synthesis can produce good interpolation at lower (1/8) resolution, our iterative synthesis can generate compelling results on both low and high resolutions.
As shown in the first row of Figure 3 (a), we observe that plain synthesis may produce poor interpolation, due to the obvious artifacts on warped frames. It is worth noting that these artifacts (e.g., holes in forward-warped frames) are inevitable for large motion cases, even when estimated optical flow is accurate. Nevertheless, as shown in the second row of Figure 3 (a), our iterative synthesis can produce robust interpolation for large motion cases. We hypothesize that for iterative synthesis, the up-sampled interpolation from lower-resolution pyramid level may have less or no artifacts due to smaller motion magnitude. Thus, it can guide the synthesis module to produce robust interpolation at higher-resolution levels. We verify this hypothesis and show the results in Figure 4 (b), where we interpolate the same example at reduced resolutions with plain synthesis and iterative synthesis.
The recurrent design of UPR-Net not only reduces the number of parameters, but also allows us to perform test-time resolution-aware adaptation without training. By contrast, it is difficult to adapt to various spatial resolutions in testing, by simply cascading several optical flow and frame synthesis networks into a pyramid network of fixed levels. Here, we describe two adaptation strategies to handle the extremely large motions (in high-resolution videos) beyond the training phase.
Thanks to our recurrent module design, we can customize the number of pyramid levels in testing to handle large motion. Specifically, when the resolution of test frames is twice of the resolution in training, we should add one extra pyramid level for testing. The design of our UPR-Net allows to up-sample estimates from any previous levels if necessary. We empirically verify that reduced resolution is beneficial for accurate estimation of large motion for extreme 4K resolution videos.
Table 1.Ablation studies about the architecture design of our UPR-Net.
Architecture design. In Table 1, we present ablation studies of the design choices of our UPR-Net on Vimeo90K [8], the hard subset of SNU-FILM [9], and X-TEST of 4K1000FPS [10]. Specifically, we verify that: 1) the recurrent design of bi-directional flow and frame synthesis modules can significantly improve the interpolation accuracy for high-resolution videos; 2) iterative synthesis enables more robust interpolation on large motion cases; 3) unified pipeline (shared feature encoder) is simple and elegant, and achieves slightly better accuracy; 4) correlation volume improves motion-based frame interpolation on all benchmarks; 5) context features is crucial to achieve good results on Vimeo90K, but surprisingly, does not lead to obvious better performance on large motion benchmarks.
Table 2.Ablation studies about the training datasets of our UPR-Net series for multi-frame interpolation.
Training data. By default, our UPR-Net is trained on triplets for intermediate frame interpolation. However, our UPR-Net naturally enables arbitrary-time multi-frame interpolation, since forward warping can use linearly-scale bi-directional flow towards arbitrary-time intermediate frame. In Table 2, we report the results of UPR-Net for 4x interpolation on HD dataset [4], and 8x interpolation on X-TEST. We show that even trained on triplets, our UPR-Net can achieve similar accuracy with RIFEm trained on septuplets. Furthermore, training on septuplets can significantly boost the performance of UPR-Net for multi-frame interpolation.
We construct three versions of UPR-Net by scaling the feature channels: UPR-Net, UPR-Net large, and UPR-Net LARGE, with increasing model size. While our UPR-Net series are trained only on Vimeo90K [8], we evaluate them on a broad range of benchmarks with different resolutions, including UCF101 [11], Vimeo90K [8], SNU-FILM [9], and X-TEST from 4K1000FPS [10].
Table 3.Quantitative comparisons on low-resolution UCF101 and Vimeo90K benchmarks for middle frame interpolation.
Low-resolution benchmarks. In Table 3, we report quantitative results on low-resolution benchmarks (UCF101 and Vimeo90K) for middle frame interpolation. We compare UPR-Net series with many state-of-the-art methods. Our UPR-Net LARGE model achieves the best performance on UCF101 and second best result on Vimeo90K. UPR-Net large and UPR-Net also achieve excellent accuracy on these two benchmarks. In particular, our UPR-Net model, which only has 1.7M parameters, outperforms many recent large models on UCF101, including SoftSplat and ABME.
Table 4.Quantitative comparisons on moderate-resolution SNU-FILM dataset for middle frame interpolation.
Moderate-resolution benchmark. In Table 4, we report our results on moderate-resolution SNU-FILM dataset for middle frame interpolation. When measured with PSNR, our UPR-Net series outperform all previous state-of-the-art methods. In particular, our base UPR-Net model outperforms the large VFIformer model, in part due to its capability of handling challenging motion.
Table 5.Quantitative comparisons on moderate-resolution HD dataset for multi-frame interpolation.
In Table 5, we report the 4x interpolation results on HD dataset. Our UPR-Net series outperform all other methods trained with Vimeo90K triplet dataset, and are slightly inferior to RIFEm which is trained with Vimeo90K-Septuplet.
Table 6.Quantitative comparisons on high-resolution X-TEST dataset for multi-frame interpolation.
High-resolution benchmark. In Table 6, we report the 8x interpolation results on X-TEST. Our UPR-Net large model (when skipping last two pyramid levels for optical flow and second-to-last level for frame synthesis) achieves the best performance. Furthermore, our method enables arbitrary-time frame interpolation, and the bi-directional flow only needs to be estimated once for multi-frame interpolation.
Figure 4.Qualitative comparisons on Vimeo90K, where UPR-Net can produce better interpolation results in details.
Figure 4 shows two examples from the Vimeo90K dataset. Our UPR-Net can give finer interpolation for texture details like feelers of butterfly and crossed fingers.
Figure 5.Qualitative comparisons on SNU-FILM. The first example are from hard subset, while others are from extreme subset.
Figure 5 shows four examples from SNU-FILM. Our methods produce better interpolation results than IFRNet large model for local textures, and give promising results for large motion case, much better than CAIN and VFIformer, and sightly better than ABME and IFRNet large.
Figure 6.Qualitative comparisons on two examples from 4k X-TEST dataset, which contain extremely large motion.
Figure 6 shows two interpolation examples from X-TEST, where our models are robust to large motion, and give better interpolation for local textures than XVFI.
Figure 7.A failure case on X-TEST, where our UPR-Net (with iterative synthesis) fail to fully solve the artifacts in warped frames.
Figure 7 shows a failure case on X-TEST. The obvious holes in forward-warped frames may lead to artifacts in interpolated frame when using plain synthesis. In this case, our iterative synthesis strategy gives better interpolation, but also suffer from slight artifacts (see the regions indicated by orange arrows).
IJCV paper link: https://link.springer.com/article/10.1007/s11263-024-02164-x
Code and trained models: https://github.com/srcn-ivl/UPR-Net
[1] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In CVPR, 2016.
[2] Guo Lu, Xiaoyun Zhang, Li Chen, and Zhiyong Gao. Novel integration of frame rate up conversion and HEVC coding based on rate-distortion optimization. TIP, 2017.
[3] Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, Dimitris Metaxas, Chen Change Loy, and Ziwei Liu. Deep animation video interpolation in the wild. In CVPR, 2021.
[4] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. In CVPR, 2019.
[5] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020.
[6] Jin, Xin, Longhai Wu, Guotao Shen, Youxin Chen, Jie Chen, Jayoon Koo, and Cheul-Hee Hahm. "Enhanced bi-directional motion estimation for video frame interpolation." In WACV, 2023.
[7] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
[8] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. IJCV, 2019.
[9] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020.
[10] Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. XVFI: Extreme video frame interpolation. In ICCV, 2021.
[11] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.