AI
The AI Vision Lab at Samsung R&D Institute China-Nanjing, with a broad mission, focuses on developing core AI vision technology to enhance user experience with Samsung devices. One of our research interests is picture quality enhancement using AI algorithms. Recently, our researchers developed the uniVIP model – a unified framework for arbitrary-time Video Interpolation and Prediction. With a single trained model, uniVIP delivers competitive results for video interpolation and outperforms existing state-of-the-art methods in video prediction. This work has been accepted by ICASSP 2025.
Nowadays, smart mobile devices are ubiquitous. These devices are typically equipped with high-definition video recording functionality, and produce high-resolution User Generated Content (UGC) videos every day. Therefore, it is of significant value to measure the quality of these videos, on mobile devices where videos are captured, and on social media platform where videos are uploaded, processed, and recommended.
In the last two decades, great efforts have been devoted to Video Quality Assessment (VQA), by gathering enormous human quality opinions [1]–[6], or designing automatic VQA models [1], [2], [7]–[14]. In recent years, deep learning based VQA models have developed into a successful family [1], [2], [11], [13], [14], with compelling accuracy on public benchmarks. However, efficient and effective VQA remains challenging, due to the computational burden for high-resolution videos and the diversity of contents of UGC videos.
To address the computational burden, Grid Mini-patch Sampling (GMS) [13] and Spatial-temporal Grid Mini-cube Sampling (St-GMS) [14] are recently proposed to extract a set of local mini-patches from original videos. Splicing these mini-patches forms fragments (see Fig. 1), a novel compact sample for VQA that enables end-to-end representation learning at acceptable computational cost. Furthermore, since the mini-patches are sampled at raw resolution, to a large extent, fragments can preserve technical quality cues that are concerned with low-level distortions like blurs and artifacts. However, the aesthetic perspective, which measures semantic factors like contents and composition, may also affect human quality assessment for videos [3], [15]. While fragments may contain rough scene-level semantics [14], in Fig. 1 we reveal the semantic degradation issue on high-resolution videos.
Taking both technical and aesthetic perspectives into account, DOVER [1] explicitly decouples VQA with a twobranch network, with each branch measuring one perspective. DOVER leverages fragments as technical inputs, and downsampled video frames as aesthetic inputs (see rightmost in Fig. 1). Furthermore, DOVER employs inductive biases like branch architecture, pre-training and regularization to drive each branch to focus on corresponding perspective, measuring two perspectives as separately as possible.
In this work, we challenge existing fragment-based technical quality evaluation for high-resolution videos. As illustrated in Fig. 2, we argue that technical quality should be measured in context rather than locally. For example, a video of snow scene almost always has low-lighting, but such low-lighting does not indicate low technical quality by considering the semantics. The technical branch in DOVER, which is separately trained on fragments, may have limited ability to perceive semantics of high-resolution videos, resulting in inaccurate technical quality scores for local mini-patches (see Fig 4 for examples).
Based on above analysis, we present SiamVQA – a simple but effective Siamese network for high-resolution VQA. SiamVQA shares weights between aesthetic and technical branches, aiming to enhance the semantic perception ability of the technical branch to boost technical-quality representation learning. Furthermore, SiamVQA employs a dual crossattention layer to fuse the high-level features of both branches. We empirically show the effectiveness of the design choices of SiamVQA. SiamVQA achieves state-of-the-art accuracy on high-resolution public benchmarks, and competitive results on lower-resolution benchmarks.
Figure 1. Illustration of the fragments for technical perspective and down-sampled frames for aesthetic perspective. We observe that the fragments sampled from high-resolution videos (e.g., 1080p) suffer from serious semantic degeneration, although the fragments of lower-resolution videos can preserve the semantics to a large extent. In this example, even human can hardly tell the semantics of original 1080p video, purely based on the sampled fragments.
Figure 2. Illustration of our argumentation that technical quality should be measured in context. From first to third rows, we show videos of snowy scene, railway in forest with the camera lens moving forward fast, and live show, respectively. Without considering the semantics, many local patches in first two videos are of low technical quality (low lighting, motion blur). But with semantics, to a large extent these local patches are natural, because heavy snow always leads to low-lighting, and through the lens of a fast-moving camera, nearby objects are always more blurred than distant objects.
As shown in Fig. 3 (b), SiamVQA consists of two identity Swin-T [4] for extracting both technical and aesthetic features. On top of Swin-T, a dual cross-attention layer is applied for feature fusion. Then a shared per-pixel regression head is employed to predict technical and aesthetic quality maps. Finally, these two quality maps are concatenated and average pooled to produce the final prediction of video quality. Formally, let xt and xt denote technical and aesthetic inputs, F(∙)and R(∙) denote the backbone network and regression head, {WtQ,WtK,WtV}and{WtQ,WtK,WtV}denote the embedding weights of query, key, and value for the cross-attention layer of technical and aesthetic branches. Then, technical and aesthetic quality maps Qt and Qa, and the final quality score s can be calculated as:
Figure 3. Architecture of our SiamVQA, with comparison to DOVER.
Weight-sharing in SiamVQA aims to improve technical-quality-related representation learning. This is based on our realization that technical quality should be measured in context, and if the technical feature extractor itself is also aesthetic feature extractor, it can perceive and leverage the semantics to boost technical-quality representation learning.
Furthermore, we notice that semantic pre-training on Kinetics-400 [17] is crucial for single-technical-branch VQA [14], DOVER, and our SiamVQA. In analogy to semantic pre-training and technical fine-tuning that performs sequentially in [14], our weight-sharing can be regarded as semantic and technical training in parallel. In Sec. III-B, we provide proof-of-concept experiment for verifying. By removing semantic pre-training, we observe that our weight-sharing can improve SRCC on LSVQ1080p [2] by as much as 5.6%.
We regard technical and aesthetic inputs as two modals for VQA. This is analogous to RGB video and optical flow for action recognition, where feature fusion typically performs better than score fusion [18]–[22]. In SiamVQA, we employ dual cross-attention (see Eq. 1) for feature fusion, aiming to mine useful representations from complementary branches.
We report main results on high-resolution datasets, but also cover lower-resolution datasets.
1) On High-resolution Datasets: As shown in Tab. 1, our SiamVQA achieves excellent results on high-resolution LSVQ1080p, LIVE-Qualcomm, and YouTube-UGC datasets. It outperforms DOVER [1] on all datasets, and surpasses all others methods by a large margin. In particular, when finetuned on LIVE-Qualcomm, SiamVQA outperforms DOVER by 1.9% and 3.1% in SRCC/PLCC. These results demonstrate the effectiveness of SiamVQA for high-resolution VQA.
Table 1. Qualitative (srcc/plcc) comparisons on high-resolution datasets under intra-dataset and transfer learning settings.
2) On Lower-resolution Datasets: As shown in Tab. 2, SiamVQA also achieves competitive results on lower-resolution LSVQtest, KoNViD-1k, and CVD2014. It still out- performs DOVER on average, and show obvious advantages over other methods. These results suggest that our design choices towards high-resolution VQA do not sacrifice the performance on lower-resolution VQA.
Table 2. Qualitative (srcc/plcc) comparisons on lower-resolution datasets under intra-dataset and transfer learning settings.
Figure 4 shows the two examples of per-branch and merged quality maps, where SiamVQA can leverage semantics to predict more accurate technical quality scores in context.
Figure 4. Visualization of branch-level and merged quality maps, produced by DOVER [1] and our SiamVQA. For these two examples, SiamVQA gives morhigher technical scores on low-lighting and blur regions, as these low-level distortion appears in snowy scene, and video captured by fasting moving camera.The true quality scores of these two examples are 63.8 and 62.9; predictions by DOVER are 49.11 and 38.11; predictions by SiamVQA are 63.04 and 62.19.
Table 3. ablation study of our design choices
We conduct ablation experiments on LSVQ1080p, and summarize the results in table. 3.
1) A Simple Two-Branch Baseline: We construct a simple two-branch baseline. It employs Swin-T for both branches without weight sharing, and directly fuses technical and aestheticscores for VQA. Surprisingly, on 1080p videos, this simple baseline performs competitively (even better) with DOVER which is based on different branch structures.
2) Effectiveness of Weight-sharing: We firstly show that our weight-sharing strategy improves over the unshared counterpart, with SRCC increased from 0.797 to 0.803.
We further investigate which branch contributes to the accuracy gain. We note that with our two-branch baseline trained, its technical or aesthetic branch can be used for prediction individually. We observe that weight-sharing can significantly improve SRCC from 0.601 to 0.648 when only using technical branch for inference, suggesting that our overall accuracy gain mainly stems from the improved technical branch.
Furthermore, we remove the semantic pre-training on Kinetics-400, which significantly degrades the accuracy as verified in [14]. Under this setting, the improvement by weight-sharing is more significant, with SRCC from 0.629 to 0.685.
3) Effectiveness of Feature Fusion: Our dual cross-attention strategy shows better accuracy than other feature fusion methods. Cross-attention tends to drive the network to learn more robust representations, by mining video-quality-related features from the complementary branch.
This work presented SiamVQA, a simple but effective Siamese network for high-resolution VQA. It leverages weight-sharing to enhance technical-quality-related representation learning in context, and achieve state-of-the-art accuracy for high-resolution VQA. We expect that our simple network design can encourage researchers to rethink the design principles of two-branch VQA networks.
Guotao Shen, Ziheng Yan, Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-hee Hahm. Exploring Simple Siamese Network for High-Resolution Video Quality Assessment. ICASSP. 2025.
Paper link: https://arxiv.org/pdf/2503.02330
[1] H. Wu, E. Zhang, L. Liao, C. Chen, J. Hou, A. Wang, W. Sun, Q. Yan, and W. Lin, “Exploring video quality assessment on user generated contents from aesthetic and technical perspectives,” in ICCV, 2023.
[2] Z. Ying, M. Mandal, D. Ghadiyaram, and A. Bovik, “Patch-vq: ’patching-up’ the video quality problem,” in CVPR, 2021.
[3] F. G¨otz-Hahn, V. Hosu, H. Lin, and D. Saupe, “Konvid-150k: A dataset for no-reference video quality assessment of videos in-the-wild,” IEEE Access, 2021.
[4] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szir´anyi, S. Li, and D. Saupe, “The konstanz natural video database (konvid-1k),” in 2017 Ninth international conference on quality of multimedia experience (QoMEX). IEEE, 2017.
[5] Z. Sinno and A. C. Bovik, “Large-scale study of perceptual video quality,” IEEE TIP, 2018.
[6] Y. Wang, S. Inguva, and B. Adsumilli, “YouTube UGC dataset for video compression research,” in IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), 2019.
[7] S. Wen and J. Wang, “A strong baseline for image and video quality assessment,” arXiv preprint arXiv:2111.07104, 2021.
[8] Z. Tu, C.-J. Chen, L.-H. Chen, N. Birkbeck, B. Adsumilli, and A. C. Bovik, “A comparative evaluation of temporal pooling methods for blind video quality assessment,” in ICIP, 2020.
[9] Z. Tu, C.-J. Chen, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik, “Efficient user-generated video quality prediction,” in 2021 Picture Coding Symposium (PCS). IEEE, 2021.
[10] J. Korhonen, “Two-level approach for no-reference consumer video quality assessment,” TIP, 2019.
[11] D. Li, T. Jiang, and M. Jiang, “Quality assessment of in-the-wild videos,” in ACM MM, 2019.
[12] B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang, “Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[13] H. Wu, C. Chen, J. Hou, L. Liao, A. Wang, W. Sun, Q. Yan, and W. Lin, “FAST-VQA: Efficient end-to-end video quality assessment with fragment sampling,” in ECCV, 2022.
[14] H. Wu, C. Chen, L. Liao, J. Hou, W. Sun, Q. Yan, J. Gu, and W. Lin, “Neighbourhood representative sampling for efficient end-to-end video quality assessment,” IEEE TPAMI, 2023.
[15] D. Li, T. Jiang, W. Lin, and M. Jiang, “Which has better visual quality: The clear blue sky or a blurry animal?” TMM, 2018.
[16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
[17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
[18] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in CVPR, 2016.
[19] R. Christoph and F. A. Pinz, “Spatiotemporal residual networks for video action recognition,” NeurIPS, 2016.
[20] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition,” in CVPR, 2017.
[21] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multistream networks exploiting pose, motion, and appearance for action classification and detection,” in ICCV, 2017.
[22] K. Gadzicki, R. Khamsehashari, and C. Zetzsche, “Early vs late fusion in multimodal convolutional neural networks,” in IEEE international conference on information fusion (FUSION), 2020.
[23] D. Ghadiyaram, J. Pan, A. C. Bovik, A. K. Moorthy, P. Panda, and K.-C. Yang, “In-capture mobile video distortions: A study of subjective behavior and objective algorithms,” TCSVT, 2017.
[24] M. Nuutinen, T. Virtanen, M. Vaahteranoksa, T. Vuori, P. Oittinen, and J. H¨akkinen, “CVD2014-A database for evaluating no-reference video quality assessment algorithms,” TIP, 2016.
[25] N. Murray, L. Marchesotti, and F. Perronnin, “AVA: A large-scale database for aesthetic visual analysis,” in CVPR, 2012.