AI

A Mixed Quantization Network for Efficient Mobile Inverse Tone Mapping

By Frederik Laboyrie Samsung R&D Institute United Kingdom
By Mete Ozay Samsung R&D Institute United Kingdom
By Paul Wisbey Samsung R&D Institute United Kingdom
By Cristian Szabo Samsung R&D Institute United Kingdom

At the British Machine Vision Conference (BMVC) 2021, we disseminated work on mobile inverse tone mapping. In this work, we tackled converting high resolution images to high dynamic range (HDR) images in real-time with a mobile-focused model which utilized low-bit quantization of its parameters in order to accelerate inference. Models for this task are unique in that they must heavily utilize both global and local information which typically has high compute requirements. In addition to this, mobile devices typically handle images with very high resolution which provides additional challenges as inference cost can be sensitive even to meager increases in resolution.

What happens when the model converts an image to HDR?

Figure 1. Mobile inverse tone mapping from single low dynamic range (LDR) images using MQNet. Employment of mixed quantization methods and efficient blocks in MQNet help reduce computational complexity of HDR image reconstruction (shown on the bottom row) from single LDR images (shown on the top row), and enable its deployment to mobile platforms, achieving a latency of approx. 21ms on a Samsung Note 20 Exynos 990

There can be some confusion around conversion of standard bit-rate images to HDR namely due to its differing meanings whether in reference to a camera pipeline or displays. When in reference to displays the term simply means displaying colours in a 10-bit colour range (0-1024) or higher as opposed to a standard 8-bit colour range (0-256) which is the typical range seen. Displaying in a 10-bit colour range in itself does not necessarily produce a richer, more attractive image. If an 8-bit image is crudely spread into a 10-bit range there will be no improvement in quality as each colour in an 8-bit range will just map to 1 colour in 10-bit range leaving the rest of the colour range unused. That is to say that HDR monitors are best used when the original content is filmed or rendered directly in HDR. Converting a standard 8-bit image to HDR then is about using global and local information in the image to essentially mimic the original content being HDR. This is useful as there is much standard 8-bit material stored currently for consumption and if we are able to convert images to HDR well then gaming and streaming of 8-bit content can better make use of HDR monitors.

HDR when in reference to a camera pipeline typically refers to the use of multiple exposures of a scene being combined in order to gain an image which has vivid detail in shadowy areas or bright areas in a way that could not be achieved by a single exposure. The HDR pipeline can essentially use information by overexposing dark areas and underexposing bright areas that when combined can give a beautiful image rich in local contrast.

For our task, we try to mimic this camera HDR pipeline without the use of multiple exposures. That is given a single image our model will be able to recover remarkable detail in over- and under-exposed areas. In images which are not poorly exposed, the model will still be able make use of local information to improve the overall look of the image and best make use of the improved colour range and provide output which still looks like it has gone through a camera HDR pipeline.

The term tone mapping refers to when the initial 10-bit image output by a camera HDR pipeline is mapped back to 8-bit in a way which best preserves all the information from the 10-bit image. Lots of new information has been gained via multiple exposures, therefore mapping back to 8-bit still provides an image with the same information albeit slightly less attractive. Our model performs the inverse, that is mapping 8-bit colours to 10-bit and is therefore referred to as inverse tone mapping. Note it is performing the inverse of this mapping but not the inverse of the process as our model sits earlier in the pipeline and uses the unprocessed 8-bit image as input rather than the final 8-bit image which has the information from multiple exposures.

How mixed quantization allows our model to operate in real-time.

Table 1. Results obtained using different quantization schemes. Latency (L.) is measured on the deployment platform (SN20E990).

Models which are not marked with strict latency constraints will typically operate where the input and parameters of the models are 32-bit float. For almost all tasks, it is possible to quantize the 32-bit floats into 8-bit integers, losing some accuracy but greatly accelerating inference due to computational operations being much cheaper when handling 8-bit values. However, due to our model necessarily having to output in 10-bit we cannot utilize full 8-bit quantization of the model's parameters which is why a mixed quantization scheme is used.

Our model, described in more detail below, is comprised of a U-Net backbone which generates high dimensional feature maps and a head which resolves those into an image. In order to gain as much speed up from quantization as possible but still output in high precision 10-bit we treat the backbone and head separately. The backbone is fully 8-bit quantized, the head however has its weights still into 8-bit and only its activations remain in 32-bit. This is more or less the minimum required to still output in 10-bit and both empirically and qualitatively was shown to be sufficient to successfully perform HDR conversion.

MQNet

Figure 2. Illustration of the base backbone and high precision head that comprise the MQNet. IRLB is used for fast inference and gated attention mechanisms [11] for improvement of feature representation learning accuracy. The dotted line indicates the separation between the fully quantized architecture and the dynamically quantized head. Input is added to output of the head to produce the overall output.

For MQNet’s backbone we utilize a U-Net architecture with skip connections, our quantization scheme reducing latency enough to allow for this powerful architecture to operate in real-time. In order to reduce computational cost further the components of this U-Net are those which were introduced in MobileNet V2 [7] namely inverted residuals and linear bottlenecks (IRLB blocks henceforth). IRLB blocks are not only cheaper than vanilla convolutions but suffer very little increase in latency when running inference on CPU. This means our model can run both on the GPU and CPU in real-time which makes it practical for many applications and doesn’t require costly transfer of data to NPU.

IRLB blocks are used exclusively at every layer except the first in the encoder part of the U-Net backbone. In the decoder the final two layers are cheap pointwise convolutional layers which follow IRLB blocks. Simple upsampling is used in the decoder to return feature maps to the original resolution which was favoured ahead of transposed convolutions due to its cheapness and lower propensity for artefacts. The resulting backbone is quite lightweight having 1M parameters as opposed to U-Net’s 7.76M parameters and U-Net++ having 9.04 M parameters.

Following the backbone is the head which recovers a high precision image from the 8-bit feature maps. It is comprised of convolution layers followed by instance normalization and it learns a residual which is combined additively to the original input image. Formally, the head produces the final HDR prediction by Ĥ = σ(I + φ(O)), where φ denotes a hyperbolic tangent activation function, I is the input LDR image and O is the output HDR image of the system. We use φ to learn the nonlinear transformation between pixel values of the LDR and HDR images, and the purpose of using σ, which is a sigmoid activation, is to map to relative illuminance values, i.e. [0, 1] interval.

Attention mechanisms

Despite being a latency-constrained model there are several cheap attention mechanisms we experimented with to improve performance. First, at the end of the first IRLB block in the decoder, we add Spatial Attention (SA) [10] gated blocks. We define SA blocks by:

(1)

where Of and If and If are the input and output respectively with f channels. C1 denotes a convolution with a filter and kernel of size 1, σ1 is a sigmoid activation, and ◦ indicates function composition. To improve the previous attention mechanism, we add channel information through a depthwise convolution in parallel to the SA mechanism, which results in the channel spatial attention (CSA) mechanism defined by

(2)

where Df is the depthwise convolution with f filters. Finally, although with a higher computational cost due to pooling mechanisms, we also test channel attention (CA) blocks. This operation, inspired by both residual layers and gated attention is defined by

(3)

where Cf denotes convolution with f filters and kernels of size 1, where f0 = f ◦ r and r is the reduction ratio of the attention mechanism. Finally, σ2 is the ReLU function and GP denotes global pooling. All three attention mechanisms are depicted in Figure 4.

Figure 3. Illustration of feature maps learned at the Att. 4 layers of the MQNet depicted in Figure 2. The rows shows in order results obtained without using an attention mechanism, followed by using SA, CSA and CA. The first column shows the predicted HDR image Ĥ and the rest shows the feature maps learned at different channels of the Att. 4 layer.

When analyzing the resulting feature maps in Figure 3 it is quite clear that the attention mechanisms gave the model a much better understanding of shadowy or bright areas which require additional attention. CAB was capable of improving PSNR by over 0.5 which is substantial in the context of this task.

Figure 4. Depiction of (a) Channel Attention (CA) block, (b) Spatial Attention (SA) and (c) Channel Spatial Attention (CSA) block. Conv refers to a standard convolution, a soft sigmoid denotes the sigmoid activation, the rectilinear symbol denotes ReLU activation and the ⊗ denotes element wise product.

Experimental results

Figure 5. Visual comparison of results obtained using, from left to right; input LDR images, HDRCNN [3], ExpandNet [6], SingleHDR [5] and our proposed MQNet. All images are produced with Balanced TMO from the suite Photomatix [1], similarly to [5].

Table 2. Comparison with other state-of-the-art single image HDR reconstruction methods. Performance metric and latency values reproduced with the same evaluation criteria and original codes. Blue and red indicates the best and second best accuracy. P. indicates number of parameters, L. M. indicates latency for mobile, M. RAM the maximum RAM memory consumed by the model, and O. the number of operations in multiply-accumulate units. Performance values are given in HDRVDP-Q score. *FHDR[4] uses recurrence: the present value is computed taking into account two iterations.

We compared MQNet in terms on accuracy, latency and size to an array of state-of-the-art methods. The methods selected are both those which take accuracy-latency tradeoff into consideration and those which do not, in this way we can see what is sacrificed by using such an efficient model. The results in Table 2 show that MQNet models provide accuracy (HDRVDP-Q score) on par with the larger models. Our model is also vastly faster in terms of mobile inference time on CPU and is still fastest on GPU platforms despite the operations not being optimized for this inference type.

Conclusion

In this blog post we presented our mixed quantization network has shown very promising results and can allow for HDR conversion in real-time even on over- and under-exposed inputs. It paves the way for increasingly complex models to be applied in real-time and is easily adaptable to be used on additional image-to-image tasks. We look forward to further benchmarking as we optimize this model both for camera and gaming pipelines. We provide links to the paper, supplementary and code below.

Links

https://www.bmvc2021-virtualconference.com/conference/papers/paper_0753.html

https://github.com/BCJuan/ITMMQNet

References

[1] HDRSoft Ltd. . Photo Editing Software for HDR & Real Estate Photography | Photomatix, 2017.

[2] Xiangyu Chen, Yihao Liu, Zhengwen Zhang, Yu Qiao, and Chao Dong. HDRUNet: Single Image HDR Reconstruction with Denoising and Dequantization. arXiv:2105.13084, 2021.

[3] Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafal K. Mantiuk, and Jonas Unger. HDR image reconstruction from a single exposure using deep CNNs. ACM transactions on graphics (TOG), 36(6):1–15, 2017. Publisher: ACM New York, NY, USA.

[4] Zeeshan Khan, Mukul Khanna, and Shanmuganathan Raman. FHDR: HDR Image Reconstruction from a Single LDR Image using Feedback Net- work. arXiv preprint arXiv:1912.11463, 2019.

[5] Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1651–1660, 2020.

[6] Demetris Marnerides, Thomas Bashford-Rogers, Jonathan Hatchett, and Kurt Debattista. Expandnet: A deep convolutional neural network for high dynamic range expansion from low dynamic range content. In Computer Graphics Forum, volume 37, pages 37–49. Wiley Online Library, 2018. Issue: 2.

[7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.

[8] Marcel Santana Santos, Tsang Ing Ren, and Nima Khademi Kalantari. Single image HDR reconstruction using a CNN with masked features and perceptual loss. arXiv preprint arXiv:2005.07335, 2020.

[9] S. M. A. Sharif, Rizwan Ali Naqvi, Mithun Biswas, and Kim Sungjun. A Two-stage Deep Network for High Dynamic Range Image Reconstruction. arXiv preprint arXiv:2104.09386, 2021.

[10] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.

[11] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu,and Yair Weiss, editors, Computer Vision { ECCV 2018, Lecture Notes in Computer Science, pages 294{310, Cham, 2018. Springer International Publishing.