AI

Dynamic VFI (Video Frame Interpolation) with Integrated Difficulty Pre-Assessment

By Ban Chen Samsung R&D Institute China-Nanjing
By Xin Jin Samsung R&D Institute China-Nanjing

Introduction and Motivation

Video frame interpolation (VFI) aims to generate intermediate frames between consecutive frames. VFI is widely applied in industrial products, including slow-motion video generation, video editing, intelligent display devices, etc. Despite recent advances in deep learning bring performance improvement; VFI models are becoming more and more computationally expensive, making them infeasible for practical applications. While accurate VFI models typically synthesize high-quality results at the cost of efficiency, fast VFI models, on the other hand, often suffer from unreasonable artifacts.

This work aims to adjust the trade-off between accuracy and efficiency for VFI in a data-driven manner. Based on our investigation, the performance gain of heavy model mainly comes from challenging frames with fast moving objects or complex texture. While for static or clean frames, VFI models always show similar performance – no matter what model size is. Motivated by this phenomenon, we design a dynamic VFI pipeline, where different VFI models are applied alternatively according to interpolation difficulty of input frames. Our pipeline enables flexible adjustments of accuracy-efficiency trade-off for VFI, by assigning easy samples to fast models for efficiency, and difficult samples to heavy models for accuracy.

Figure 1.  Overview pipeline of Dynamic VFI. Firstly, VFIDPA model predicts an interpolation difficulty level for each image pair. We then classify input frames as easy or hard class by comparing predicted score with customized threshold. Finally, easy samples are sent to RIFE for efficiency, while hard samples go through VFIformer for better interpolation.

We argue that easy samples with small motion or clear texture can be well handled by simple models, and thus do not require heavy computation. Therefore, we present a dynamic Video frame interpolation (VFI) pipeline, dedicated to effectively generate pleasant intermediate frames with dynamic VFI model selection. Specifically, it firstly leverages an assessment model (VFI-DPA) to measure the interpolation difficulty level of input frames, and then dynamically selects an appropriate VFI model to generate interpolation results.

VFI Difficulty Pre-assessment Model (VFI-DPA)

Figure 2.  Overview pipeline of VFI difficulty pre-assessment model. It consists of a feature extraction module, a feature fusion module and a patch-wise prediction module.

The overall structure of our VFI difficulty pre-assessment (VFI-DPA) is shown in Figure 2. Given two frames, VFI-DPA predicts an interpolation difficulty score without seeing interpolation result. It can be considered as a lightweight and flexible plugin, which can be easily employed to choose VFI models (not restricted to RIFE and VFIformer) dynamically based on given inputs.

In our dynamic VFI, VFI-DPA is built upon AHIQ [1], a SOTA full reference image quality assessment (FR-IQA) model. Our VFI-DPA employs several innovations to significantly reduce the computation cost of AHIQ, while still improving its performance for interpolation quality pre-assessment.

VFI Difficulty Assessment Dataset

Figure 3.  Example frames from our VFI Difficulty Assessment dataset. Top to down rows show easy to hard cases: higher difficulty score represents lower interpolation difficulty level.

We contribute a large scale Video Frame Interpolation Difficulty Assessment dataset to train and validate VFI-DPA. Our dataset contains 13030 triplets, where each triplet contains 3 neighboring frames and an interpolation difficulty score. The interpolation difficulty increases from annotated Level 4 to Level 1. The difficult interpolation input images normally have large motions and rich texture. We provide illustration of reference images from different difficulty levels in Figure 3. It is shown that the difficult cases for frame interpolation typically have large motions and rich texture. In this work, we roughly categorize image pairs as easy and hard during inference, but it is worth noting that our dataset has 4 difficulty levels and supports more fine-grained model selection.

Experimental Performance

Table 1.  Dynamic VFI performance

As shown in Table I, our pipeline surpasses RIFE by a large margin with small extra computational cost. VFI-DPA model trained with different resolution datasets (256p, 720p, and 1080p) result in differences in accuracy-efficiency balance, and VFI-DPA trained with 720p dataset optimally improves the accuracy. We believe that our dynamic VFI framework can provide an effective solution for many industrial applications which are concerned with accuracy-efficiency trade-off.

Figure 4.  We represent easy input interpolation images pre-assessed by VFI-DPA. RIFE(third column) and VFIformer (forth column) generate similar visual effects. In this case, dynamic VFI chooses RIFE for high efficiency.

Figure 5.  The hard input images pre-assessed by VFI-DPA shows large motion between neighbouring frames. RIFE suffers from serve artifacts, such as missing objects and blurry objects. To avoid this, dynamic VFI chooses VFIformer for better performance.

Our integrated pipeline enables flexible adjustments of accuracy-efficiency trade-off for VFI, by assigning easy samples to fast models for efficiency, and difficult samples to heavy models for accuracy. For example, in figure 4, easy input interpolation images pre-assessed by VFI-DPA. RIFE (third column) and VFIformer (forth column) generate similar visual effects. In this case, dynamic VFI chooses RIFE for high efficiency. The hard input images pre-assessed by VFI-DPA shows large motion between neighboring frames. RIFE suffer from serve artifacts, such as missing objects and blurry objects. To avoid this, dynamic VFI choose VFIformer for better performance.

Link to the paper

https://arxiv.org/pdf/2304.12664.pdf

References

[1] S. Lao, Y. Gong, S. Shi, S. Yang, T. Wu, J. Wang, W. Xia, and Y. Yang, “Attentions help cnns see better: Attention-based hybrid image quality assessment network,” in CVPR, 2022.

[2] Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” in ECCV, 2022.

[3] L. Lu, R. Wu, H. Lin, J. Lu, and y. Jia, “Video frame interpolation with transformer,” in CVPR, 2022.