AI

Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer

By Dafeng Zhang Samsung R&D Institute China - Beijing
By Xiaobing Wang Samsung R&D Institute China - Beijing

Background

Recent works achieve excellent results in defocus deblurring task based on dual-pixel data using convolutional neural network (CNN), while the scarcity of data limits the exploration and attempt of vision transformer in this task. In addition, the existing works use fixed parameters and network architecture to deblur images with different distribution and content information, which also affects the generalization ability of the model. In this paper, we propose a dynamic multi-scale network, named DMTNet, for dual-pixel images defocus deblurring. DMTNet mainly contains two modules: feature extraction module and reconstruction module. The feature extraction module is composed of several vision transformer blocks, which uses its powerful feature extraction capability to obtain richer features and improve the robustness of the model. The reconstruction module is composed of several Dynamic Multi-scale Sub-reconstruction Module (DMSSRM). DMSSRM can restore images by adaptively assigning weights to features from different scales according to the blur distribution and content information of the input images. DMTNet combines the advantages of transformer and CNN, in which the vision transformer improves the performance ceiling of CNN, and the inductive bias of CNN enables transformer to extract more robust features without relying on a large amount of data. DMTNet might be the first attempt to use vision transformer to restore the blurring images to clarity. By combining with CNN, the vision transformer may achieve better performance on small datasets.

Network Architecture (DMTNet)

Figure 1.  The overall architecture of the proposed DMTNet

We propose a dynamic multi-scale network, named DMTNet, for dual-pixel images defocus deblurring, as shown in Figure.1. In this paper, we input the dual-pixel defocus blurring images( and ) into DMTNet and output one sharp image (). We concatenate the left and right views to get the input of DMTNet. Then, we send the into the patch partition module to get the token embeddings ,

where (H, W ) is the resolution of the input images, 3 and C is the number of channels. CONCAT is the operation that concatenate the right and left views on channel dimension. denotes patch partition module, which is used to obtain token embeddings as the input of transformer. In the patch partition module, we use 2D convolution operation with the fixed kernel and stride size to obtain the input features with specified patch size P. Then we reshape the features into a sequence of flattened 2D patches and layer

normalize it to get , and . Then we reverse to get the token embeddings . We take the as the input of the feature extraction module constructed by the transformer blocks to extract more robust features ,

where represents feature extraction module, which is composed of vision transformer blocks. Inspired by Swin Transformer, self-attention based on the window scheme is used in each transformer block to improve computational efficiency. Then, we send the robust features into the reconstruction module to restore the blurring image to clarity,

where represents sharp features from reconstruction module, and denotes reconstruction module, which contains DMSSRM. DMSSRM is a dynamic multi-scale selection network, which dynamically fuses multi-scale features according to the content information of the input images. Finally, sharp features are up-sampled via an up-sampling module to output the sharp image,

where and denote up-sampling module and the final sharp image respectively. We use PixelShuffle to preserve the information and reduce the parameters in the up-sampling module.

Summarily, can also be represented as follows,

where denotes the function of DMTNet.

Experimental Results

We compare our DMTNet method with 9 algorithms: EBDB, DMENet (CVPR2019), JNB, DPDNet (ECCV2020) [1], IFAN (CVPR2021) [2], RDPD (ICCV2021) [3], KPAC (ICCV2021) [4], Uformer-B (CVPR2022) [5] and Restormer (CVPR2022) [6] on the Canon DP testing dataset. DMTNet significantly improves defocus deblurring performance without using additional data, with improvements of PSNR 6.8%, 7.4%, and 7.1% on three scenarios, respectively, compared to the DPDNet. Although RDPD uses additional synthetic DP data, the performance improvement is negligible. In particular, when the number of parameters of our method is comparable to the SOTA methods, our DMTNet also achieves impressive performance. For example, DMTNet-T only uses the feature extraction module to restore the blurring image to clarity and achieves 25.50 dB PSNR, 0.65 dB higher than KPAC. Similarly, DMTNet-S, DMTNet-B and DMTNet-H are 0.28 dB, 1.24 dB and 1.26 dB higher than IFAN, RDPD and Uformer-B respectively. The visual results are presented in Figure 3, while the images restored by our DMTNet are clearer.

Table 1.  Comparison of performances on Canon DP dataset

Under different parameter capacities, our DMTNet achieves the best performance than existing solutions. No matter what evaluation metrics are used, our DMTNet is significantly better than existing state-of-the-art solutions. Especially, when we achieve the same performance as Uformer-B (CVPR2022), our parameters and FLOPs are 2.8% and 7.6% of Uformer-B. And our DMTNet-T (25.50dB) achieves comparable performance to RDPD (25.37dB) and DPDNet (25.12dB), while number of parameters are reduced by 95.53% and 96.44% respectively.

Figure 2.  The results of defocus deblurring with different parameters and Flops

Figure 3.  Qualitative comparisons with different defocus deblurring methods on Canon DP testing dataset

Conclusion

We propose a Transformer-CNN combined defocus deblurring model DMTNet. And our DMTNet combines advantage of both the powerful feature extraction capability of vision transformer and the inductive bias capability of CNN. The transformer improves the performance ceiling of CNN, and the inductive bias of CNN enables transformer to extract more robust features without relying on a large amount of data. Experimental results demonstrate that our DMTNet significantly outperforms existing solutions.

References

[1]. Abuolaim A, Brown M S. Defocus deblurring using dual-pixel data[C]//European Conference on Computer Vision. Springer, Cham, 2020: 111-126.

[2]. Lee J, Son H, Rim J, et al. Iterative filter adaptive network for single image defocus deblurring[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2034-2042.

[3]. Abuolaim A, Delbracio M, Kelly D, et al. Learning to reduce defocus blur by realistically modeling dual-pixel data[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 2289-2298.

[4]. Son H, Lee J, Cho S, et al. Single image defocus deblurring using kernel-sharing parallel atrous convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 2642-2650.

[5]. Wang Z, Cun X, Bao J, et al. Uformer: A general u-shaped transformer for image restoration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 17683-17693.

[6]. Zamir S W, Arora A, Khan S, et al. Restormer: Efficient transformer for high-resolution image restoration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 5728-5739.