AI

MR-VNet: Media Restoration Using Volterra Networks

By Siddharth Roheda Samsung R&D Institute India-Bangalore
By Amit Unde Samsung R&D Institute India-Bangalore
By Loay Rashid Samsung R&D Institute India-Bangalore

The Conference on Computer Vision and Pattern Recognition (CVPR) is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field.

And CVPR considers a wide range of topics such as object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning related to computer vision and pattern recognition.

In this blog series, we are introducing some of our research papers at CVPR 2024 and here is a list of them.

#1. Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-Stage Action Localization (Samsung AI Center - Cambridge)

#2. FFF: Fixing Flawed Foundations in Contrastive Pre-training Results in Very Strong Vision-Language Models (Samsung AI Center - Cambridge)

#3. MR-VNet: Media Restoration Using Volterra Networks (Samsung R&D Institute India-Bangalore)

Introduction



In our increasingly visual-oriented world, media such as images and videos play a crucial role in conveying information, expressing emotions, and preserving memories. However, images/videos are susceptible to distortions during the process of capturing (eg. sensor noise, blur, zoom, bad exposure), saving/sharing (eg. compression, down-sampling), and editing (eg. un-natural artefacts). It is crucial to restore such degraded images so as to prevent loss of information and ensure that the best visual quality is delivered to users.

In this blog post, we present a novel class of media restoration network architectures based on the Volterra Series formulation. In this new architecture, non-linearity is introduced into the system response function via higher order convolutions instead of traditional activation functions.

Volterra Filter

The Volterra Filter serves as an approximation for capturing the non-linear relationship between an input xt at time t, and the corresponding output yt.

Where, L represents the filter memory/length, and WK is the weight matrix corresponding to the kth order term. Note that the first term in the above equation represents a linear 1D convolution (commonly used in CNNs) between weight matrix W1 and the input sequence x. Whereas, the following terms represent higher order (non-linear) convolutions, which are responsible for making the relationship between xt and yt non-linear.

The distinctive feature of the above described equation is the non-linearity that is introduced via higher order convolutions rather than traditional activation functions.

The limitation of employing a Volterra Filter for Deep Learning tasks is that the computational complexity of such a model increases exponentially with increase in the desired filter order. Specifically an Kth order filter with length L necessitates solving LK equations.

Our Approach

We introduce a cascaded implementation of the Volterra Filter along with lossless and lossy approximations in order to alleviate the highly complex nature of the filter.

Specifically, we use the second order Volterra Formulation to implement a Volterra Layer. The zth layer of the Volterra Neural Network (VNN ) processes the input X(z-1) as,

Where, and represent the first and second order convolution operations in the zth layer.

When Z second order filters/layers are cascaded, the resulting Volterra Network achieves an effective order of

This leads to reduction in required parameters to realize this model from to . Where, P is the spatial kernel size, and L is the temporal filter size.

Figure 1. High level Block Diagram for implementation of the proposed VNN Image Restoration Model.

Lossless Approximation of the Volterra Layer: To implement the second order kernel using traditional libraries such as pytorch and tensorflow (which do not support higher order convolutions), we re-formulate the 2nd order layer as,

Where, s1 and s2 denote spatial shifts in the input feature map X(z-1), and S(s1,s2) represents the feature map circularly shifted along its rows and columns by s1 and s2 respectively. The second term in the proposed equation requires P2 convolutions. To mitigate redundancy and model complexity, we discard the symmetric terms, resulting in convolutions. This leads to lower complexity, without loss of information as only redundant terms are discarded.

Low Rank Lossy Approximation of the Volterra Layer:Implementing higher order convolution can incur significant costs despite the lossless approximation technique discussed previously. To address this we employ the concept of Low Rank Approximation for the second order term. We make a Qth rank approximation of the convolutional kernel as follows:

With this approximation, the Volterra Block is implemented as,

This leads to further reduction in required number of parameters to .

Generalized Activation

A Volterra Filter is capable of providing an approximation to any continuous function, which includes the standard activation functions such as ReLU, sigmoid, tanh, etc. Hence, it can be considered as a generalized activation.

Figure 2. Approximation of ReLU and Sigmoid using the Volterra Formulation

However, the traditional activation functions are fixed, whereas the Volterra Filter formulation allows for learning of a data dependent activation/non-linearity.

Experiments and Results

To assess the efficacy of the proposed Volterra Restoration Network, we conduct experiments targeting prevalent degradations in images and videos:

        • Motion-Blur: This degradation, arising from camera or subject motion, is addressed by training and testing the
            restoration network on the GoPro [14] and Reds [15] datasets for image deblurring.
        • Camera Sensor Noise: We aim to mitigate noise introduced by the camera sensor during image/video capture. Evaluation
            is performed using the SIDD [2] and CRVD [25] datasets.

We present two iterations of our proposed Volterra Layer-based architecture: MR-VNet-LYA, employing the Lossy approximation, and MRVNet-LLA, incorporating the proposed Lossless approximation.

Our experimental evaluations, conducted on diverse datasets including GoPro, Reds, SIDD (Image data), and CRVD (Video data), showcase the effectiveness of MR-VNet in comparison to state-of-the-art algorithms. Figure 3 & 4 showcase the superior de-blurring quality obtained by the proposed approach. A detailed quantitative assessment can be found in Tables 1 and 2, utilizing the GoPro and SIDD datasets respectively.

Figure 3.De-Blurring Results on GoPro

Figure 4.De-Noising Results on SIDD. Left to Right: Noisy Image, NAFNet, MR-VNet (Ours)

Table 2.De-Noising Performance on SIDD Dataset. Best results are bold, second best are underlined.

Figure 5.Comparison of PSNR and Computational Complexity of the various models on GoPro Dataset

Figure 5 depicts the comparison of computational complexity of the proposed MRVNet model with other SOTA methods. We note that by employing a more accurate approximation of the second order kernel, we can outperform NAFNET with much lower computational complexity.

Conclusion

In conclusion, our research introduces the Media Restoration-Volterra Network (MR-VNet) as a novel approach to image and video restoration. Leveraging higher order Volterra filters, the proposed architecture demonstrates promising capabilities in addressing common image and video degradations, such as motion blur and camera sensor noise. Through the development of two architectural variants, VNN-LYA and VNN-LLA, employing lossy and lossless approximations, respectively, we offer a comprehensive exploration of the network’s performance.