AI

Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment

By Chen Ban Samsung R&D Institute China - Nanjing

By Xin Jin Samsung R&D Institute China - Nanjing

By Youxin Chen Samsung R&D Institute China - Nanjing

By Longhai Wu Samsung R&D Institute China - Nanjing

By Jie Chen Samsung R&D Institute China - Nanjing

By Jayoon Koo Visual Display Business

By Cheul-Hee Hahm Visual Display Business

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them.

#1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research)

#2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research)

#3. FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement (Samsung R&D Institute China-Beijing)

#4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America)

#5. Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment (Samsung R&D Institute China-Nanjing)

#6. Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition (Samsung R&D Institute United Kingdom)

#7. Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition (Samsung R&D Institute India-Bangalore)

Motivation and Introduction

Video frame interpolation (VFI) aims to generate intermediate frames between consecutive frames. It is widely applied in video generation [1] video editing [2], frame rate up-conversion [3], etc. Despite improved accuracy driven by deep learning, VFI models are becoming more and more computationally heavy, making them infeasible for practical applications. This work aims to reduce the computational cost of VFI, while maintaining the capacity in handling challenging cases. Our key motivation is that the strength of accurate (yet heavy) models is mainly reflected in challenging cases with fast moving objects or complex texture. While for static or clean frames, simple and fast models can achieve competitive results. Our work follows the paradigm of sample-wise dynamic computation [4], with a novel model to evaluate sample difficulty. It is different with spatial-wise dynamic network[5], which assigns varying amounts of computation for different regions. The latter fails to achieve excellent accuracy for VFI, probably due to region-wise inference.

We propose a dynamic VFI pipeline (see Figure1), consisting of a VFI Difficulty Pre-Assessment (VFI-DPA) model, an off-the-shelf accurate (yet heavy) VFI model, and an off-the-shelf fast (yet less accurate) VFI model. VFI-DPA predicts an interpolation difficulty level for two input frames, based on which an appropriate (accurate or fast) VFI model is applied for frame interpolation. The design of VFI-DPA draws inspirations from full reference image quality assessment (IQA) that predicts human perceptual judgements of image quality. We apply systematic modifications to AHIQ [6] (an IQA model) to afford extreme fast (0.012s per sample) and effective VFI difficulty pre-assessment model. To train and evaluate our VFI-DPA model, we collect a large-scale dataset with manually annotated interpolation difficulty levels.

Our dynamic VFI pipeline is very flexible: any suitable model can be applied as the accurate VFI model and the fast VFI model. In this work, we employ RIFE [7] as our fast VFI model, VFIformer [8] or UPR-Net [9] as our accurate VFI model. RIFE is a real-time VFI algorithm with competitive results on public benchmarks, while VFIformer and UPR-Net integrate many subtle design choices to achieve superior performance on challenging cases.

Figure 1. Overview of our Dynamic VFI pipeline

Overview of Dynamic Video Frame Interpolation

Fig.1 shows our dynamic pipeline with integrated VFI difficulty pre-assessment (VFI-DPA). Firstly, VFI-DPA predicts an interpolation difficulty level for each pair of input frames. Then, easy samples are sent to RIFE for efficiency, while hard samples are passed through VFIformer or UPR-Net for better interpolation. We define 4 levels of interpolation difficulty, and by default consider the first two levels as easy levels and the other two as hard levels. In this dynamic pipeline, our key contributions are the architecture, loss, and dataset for VFI-DPA. While for off-the-shelf VFI models (RIFE, VFIformer and UPR-Net), we refer readers to original papers for details.

Video Frame Interpolation Difficulty Pre-Assessment

Figure 2. Our VFI Difficulty Pre-Assessment (VFI-DPA) model

The architecture of VFI-DPA is illustrated in Fig. 2. Given two frames, VFI-DPA predicts an interpolation difficulty level, without seeing the interpolation result. Our VFI-DPA consists of three modules: feature extraction, feature fusion and patch-wise prediction. Its overall architecture draws inspirations from AHIQ [6], which predicts the image quality by taking the restored image and the pristine-quality image as input. We make systematic modifications to AHIQ, making it much more lightweight, and more suitable for interpolation difficulty pre-assessment.

Feature extraction. AHIQ employs a two-branch feature extractor that consists of a ViT (base version) [10] branch and a CNN [11] branch for global representations and local textures, respectively. To reduce the computational cost while maintaining global and local representations, we use the tiny version of the hierarchical Swin Transformer (Swin-T) [12] as the Siamese (sharing weights) feature extractor for both input frames. As shown in Fig. 2, input images are fed into Swin-T, producing shallow texture and deep semantic features from stage 2 and stage 4 for further fusion.

Feature fusion. The feature fusion module (Fig. 2) employs a DeformNet to enhance the representations of input images. Our DeformNet combines the idea of deformable convolution in AHIQ with several novel designs. It first aligns the spatial resolution of deep feature with shallow feature by Pixelshuffle [13]. Pixelshuffle can reconstruct high-resolution features with preserved details, compared to traditional interpolation as in AHIQ. Then, deformable convolution is applied on shallow features with offsets learned from deep semantic features, driving the network to focus on salient regions in images. After that, we calculate temporal difference feature of input frames by subtraction, and construct rich spatial representations by concatenating deep and shallow features. The temporal difference feature aims to capture the motion between consecutive frames, which we believe is a crucial indicator of interpolation difficulty. Finally, temporal and spatial features are concatenated as final representations. Patch-wise prediction. We follow the patch-wise prediction in AHIQ. In Fig. 2, patch-wise prediction branch evaluates a difficulty score for each location in feature map, while spatial attention branch evaluates the importance of each location. Finally, we calculate the final score by weighted sum.

Training Loss for VFI-DPA

As in Fig. 2, our loss consists of a difficulty level loss and an auxiliary loss cooperated with perceptual similarity. Difficulty level loss. The difficulty level loss is defined as

L_difficulty= ‖S_GT-S_Pred ‖ (1),

where S_GT and S_Pred are ground truth and predicted scores.

Auxiliary loss. We hypothesize that perceptual similarity of consecutive frames, which implicitly reflects the motion magnitude and dynamic texture, has a positive correlation with the interpolation difficulty. We introduce an auxiliary loss to guide the interpolation difficulty prediction,

L_aux= ‖S_mean-S_lpips ‖ (2),

where S_mean is the mean of patch-wise score map and S_lpips is LPIPS [14] (a commonly used metric of perceptual quality) between input frames.

VFI-DPA Dataset

Our VFI difficulty pre-assessment dataset contains 13,030 frame triplets. Each triplet contains 3 neighboring frames (I_(t-1), I_t, I_(t+1)) and a human labeled difficulty score.
Data preparation. We firstly collect around 8,000 video clips from Vimeo90K [15], Adobe240 [16] and Internet. These video clips contains various scenes to ensure diversity, including film, game, sports, animal, etc. Then, we extract 13,030 frame triplets from these videos, with random time interval between triplets to stimulate different motion magnitude. We further divide our dataset into three subsets with different resolutions (1080p, 720p, and 256p). Manual annotations of interpolation difficulty should be resolution specific, as higher resolution often leads to higher interpolation difficulty even when the contents of the frames are exactly the same.
Data annotation. We use RIFE to predict the middle frame I_t^pred for each triplet, and annotate the interpolation difficulty level for each triplet by comparing I_t^pred with true middle frame I_t. We define 4 levels of interpolation difficulty, and level 4 means the most difficult.

Experiments

Evaluation datasets. We use SUN-FILM [17] to evaluate our dynamic VFI, which contains Easy, Medium, Hard and Extreme subsets with increased motion magnitude and interpolation difficulty. We also compare to previous spatial-wise dynamic VFI network [5] on Xiph-2K1, following the settings in [5]. We evaluate VFI-DPA on our annotated dataset.
Evaluation metrics. Video frame interpolation is evaluated by peak signal to noise ratio (PSNR) and structural similarity (SSIM [18]). We evaluate the accuracy of VFI-DPA under different error tolerance thresholds. Specifically, we calculate the absolute error between prediction and ground truth, and consider the prediction as True Positive when the error is less than a given threshold. We report the runtime by running models on a single A100 GPU.

Results

Dynamic VFI performance. As shown in Table 1, our dynamic pipeline strikes an excellent trade-off for VFI: it reduces the inference time of UPR-Net and VFIformer by using RIFE to processed easy samples, and meanwhile maintains the relatively high accuracy on the Extreme subset. VFI-DPA models trained with different resolutions (256p, 720p, and 1080p) result in different accuracy-efficiency balance. We consider the 256p version as default version because it trends to recognize more samples as hard samples. In addition, compared to RIFE, dynamic VFI obviously improves the accuracy on Hard and Extreme subsets, while the samples in Easy and Medium subsets are mainly processed by RIFE.

Table 1. Dynamic VFI performance on SNU-FILM

Fig. 3 illustrates four samples with different predicted difficulty levels. It clearly shows that difficult (level 3 and 4) cases have larger motions, and our dynamic VFI can reliably select accurate VFI model to handle these difficult cases.

Figure 3. Samples with different predicted interpolation difficulty levels. The first two columns show overlay of I_(t-1) and I_(t+1) and zoom-in red box. The third to fifth columns show our interpolated It (with red borders) and Ground Truth.

VFI-DPA performance. As reported in Table 2, our lightweight VFI-DPA model (256p version) achieves similar accuracy to much heavier AHIQ [6]. In addition, our VFI-DPA model is extremely fast (0.012s), and can afford real-time processing when combining it with RIFE and UPR-Net for dynamic video interpolation.

Table 2. VFI Per-Assessment performance under different error tolerance thresholds

Ablation Study of VFI-DPA

Lightweight backbone. As in Table 2, our lightweight hierarchical Siamese backbone reduces about 60% of parameters, and leads to about 0.07 accuracy degradation. Yet, when combined with other designs, it can achieve similar accuracy with original backbone of mixed CNN and ViT in AHIQ.
Influence of Pixelshuffle layer. In our DeformNet, we use the Pixelshuffle layer for up-sampling. Compared to nearest interpolation, Pixelshuffle improves performance by 0.02 (Table 2), as it can recover more spatial details when resizing features from low to high resolution. We believe that Pixelshuffle layer has potential in interpolation difficulty pre-assessment, where high resolution plays a key role.
Effect of temporal difference. Image difference implicitly captures inter-frame motion, which we believe is beneficial for difficulty pre-assessment. As verified in Table 2, it brings 0.03 improvement over the baseline.
Effect of auxiliary loss. The auxiliary loss helps to improve accuracy by 0.01 (Table 2), since it provides similarity information to guide interpolation difficulty learning.

Conclusion and Discussion

This paper presented a dynamic VFI pipeline, based on the estimation of interpolation difficulty of input frames. Experimental results showed that our dynamic VFI can achieve excellent trade-off between accuracy and efficiency, and outperforms previous spatial dynamic VFI network. Here, we briefly discuss two advantages of our VFI-DPA, which have not been fully explored in this work. First, our VFI-DPA is a lightweight plug-and-play module, which can enjoy the advances of VFI models to achieve better performance in the future. Second, the threshold to determine easy and hard samples can be adjusted, allowing us flexibly control the trade-off between accuracy and efficiency.

References

[1] Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, and Chenliang Xu, “Zooming slowmo: Fast and accurate one-stage space-time video superresolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 3370–3379.

[2] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely, “Deepstereo: Learning to predict new views from the world’s imagery,” in CVPR, 2016.

[3] Demin Wang, Andre Vincent, Philip Blanchfield, and Robert Klepko, “Motion-compensated frame rate upconversion - Part II: New algorithms for frame interpolation,” IEEE Transactions on Broadcasting, 2010.

[4] Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang, “Dynamic neural networks: A survey,” TPAMI, 2021.

[5] Myungsub Choi, Suyoung Lee, Heewon Kim, and Kyoung Mu Lee, “Motion-aware dynamic architecture for efficient frame interpolation,” in ICCV, 2021.

[6] Shanshan Lao, Yuan Gong, Shuwei Shi, Sidi Yang, Tianhe Wu, Jiahao Wang, Weihao Xia, and Yujiu Yang, “Attentions help cnns see better: Attention-based hybrid image quality assessment network,” in CVPR, 2022.

[7] Zhewei Huang, Tianyuan Zhang,Wen Heng, Boxin Shi, and Shuchang Zhou, “Real-time intermediate flow estimation for video frame interpolation,” in ECCV, 2022.

[8] Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia, “Video frame interpolation with transformer,” in CVPR, 2022.

[9] Xin Jin, Longhai Wu, Jie Chen, Youxin Chen, Jayoon Koo, and Cheul-hee Hahm, “A unified pyramid recurrent network for video frame interpolation,” in CVPR, 2023.

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.

[12] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in CVPR, 2021.

[13] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016.

[14] Richard Zhang, Phillip Isola, Alexei A Efros, EliShechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.

[15] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman, “Video enhancement with taskoriented flow,” IJCV, 2019.

[16] Wang Shen, Wenbo Bao, Guangtao Zhai, Li Chen, Xiongkuo Min, and Zhiyong Gao, “Blurry video frame interpolation,” in CVPR, 2020.

[17] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee, “Channel attention is all you need for video frame interpolation,” in AAAI, 2020.

[18] ZhouWang, Alan C Bovik, Hamid R Sheikh, and Eero PSimoncelli, “Image quality assessment: from error visibility to structural similarity,” TIP, 2004.

Link to the paper

https://ieeexplore.ieee.org/document/10446457

#ICASSP2024 #Dynamic VFI

AI