The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).
This year Samsung Research’s R&D centers around the world present a total of 20 thesis papers at the CVPR 2022. In this relay series, we are introducing a summary of the 6 research papers.
Especially two papers submitted by its Toronto AI Center were selected for oral presentations. Opportunities to give oral presentations at CVPR are extended to the top 4~5% of the total number of papers submitted. For Samsung’s Toronto AI Center, this is the second time in two years they have earned such a chance, as they were also selected for oral presentation in 2020.
Here is a summary of the 6 research papers among 20 thesis papers shared at CVPR 2022.
- Part 2 : Day-to-Night Image Synthesis for Training Nighttime Neural ISPs (by Samsung AI Center – Toronto)
- Part 3 : GP2: A Training Scheme for 3D Geometry-Preserving and General-Purpose Depth Estimation
The Samsung AI Research Center in Toronto is consist of scientists, research engineers, faculty consultants, and MSc/PhD interns. Its broad mission is to develop core AI technology that improves the user experience with Samsung devices.
One research pillar at SAIC-Toronto is the development of core tech to improve the image quality of camera-captured images with emphasis on smartphone cameras, a key Samsung product.
One application that falls under this research umbrella is Night Mode photography, which in recent years has become the defining technology that allows a smartphone camera to compete with DSLRs in low-light conditions. Currently, the Night Mode module requires a laborious data capture process to deploy. Below, we describe a novel approach that substantially cuts down on this deployment time. Our work will be presented as an oral presentation at CVPR 2022.
Capturing images at night is extremely challenging for smartphone cameras because their compact form factor and small sensor size lead to significantly high levels of noise. Many modern smartphones are now using a dedicated neural network for ‘Night Mode’ operation. The neural network renders the noisy sensor image to the final processed photograph. Such Night Mode neural networks are trained using large, paired datasets of noisy/clean nighttime images. Capturing such nighttime image pairs is tedious, time-consuming, and error-prone. This blog presents a method that converts daytime images that are much easier to capture to noisy/clean nighttime image pairs suitable for Night Mode neural network training. Our day-to-night image synthesis framework significantly reduces the time and effort required to deploy neural networks targeting Night Mode. In addition, a network trained primarily with our synthetic night images yields performance on par with networks trained on real nighttime data.
Daytime images and nighttime images have very different visual characteristics. Outdoor daytime scenes are much brighter than nighttime scenes. Daytime scenes are dominated by sunlight illumination, while nighttime scenes are illuminated by manufactured illumination sources such as incandescent, fluorescent, and LED lights. Most importantly, daytime images have very little noise, whereas nighttime images have significantly high noise levels, particularly in smartphone cameras, as shown in Fig. 1. In general, daytime images are much easier to capture than nighttime images.
Figure 1. A comparison between daytime and nighttime images. Nighttime images have very high levels of noise, particularly in smartphone cameras.
Nighttime imaging is challenging because the amount of light in the scene is limited. Low-light environments are particularly troublesome for smartphone cameras because the small sensor size limits the amount of light per pixel. As a result, images must be captured using a high gain factor (i.e., high ISO) to amplify the signal, with the undesired effect of also boosting sensor noise. Cameras use dedicated onboard hardware, called an image signal processor (ISP), to convert the raw sensor data to the final processed output of the camera, which is usually a standard RGB (sRGB) image. When the camera's ISP processes noisy sensor images, photo-finishing operations further amplify the noise, resulting in aesthetically unappealing sRGB output images. One solution to reduce noise for a nighttime scene is to place the camera on a tripod and use a long exposure (e.g., several seconds). The long exposure allows more light to hit the sensor, but any movement of the camera or objects in the scene will cause the image to be blurry due to motion blur. For handheld smartphone cameras, using long exposures is not practical.
Recent advancements in deep networks designed to render noisy raw nighttime images to the processed sRGB outputs have shown impressive results. The idea is to replace the traditional camera ISP with a dedicated "Neural Night Mode ISP" trained for nighttime photography. However, training such a neural ISP brings challenges in terms of data collection. These networks require large-scale datasets of aligned noisy/clean image pairs captured in low-light and nighttime environments for training . In particular, each image pair comprises: (1) a noisy raw input image captured with a short exposure and a high ISO, and (2) a target ground truth low-noise raw image captured with a long exposure and low ISO that has been rendered through the ISP. Capturing such image pairs is tedious and time-consuming, requiring careful set up to ensure alignment between the image pairs. In addition, ground truth images are often prone to motion blur due to the long exposure. Motion artifacts are especially troublesome when imaging outdoor scenes, where it is inevitable that something in the scene may move. Further compounding data collection is that data capture and network training are required per sensor, as the raw images are sensor specific. From an industry standpoint, capturing data for Night Mode represents a significant burden. This is especially true for smartphone cameras, where sensors are continually updated, and many devices now have multiple cameras per device with different underlying sensors.
We present a method to reduce the reliance on carefully captured paired nighttime images. Specifically, we propose a procedure that processes daytime images to produce pairs of high-quality and low-quality nighttime images. There are several advantages to using daytime images. Unlike capturing nighttime images under low-light conditions, capturing daytime images with proper exposure is straightforward and does not require careful camera or scene setup. Outdoor daytime scenes are well-illuminated, resulting in images with significantly less noise than images taken at night. Moreover, there is no need to use a long exposure to capture daytime images, minimizing motion blur artifacts.
We show that our day-to-night image synthesis framework is useful for nighttime image processing and enhancement through training neural ISPs to render nighttime noisy raw images to their final clean sRGB outputs, as illustrated in Fig. 2.
Figure 2. We present a procedure to convert day images to aligned pairs of noisy/clean nighttime images. A Night Mode Neural ISP trained on our synthetic nighttime images can be used to process real noisy nighttime data.
Our day-to-night image synthesis procedure is applied to the raw images captured in day environments. Fig. 3 shows an overview of our framework.
Figure 3. An overview of our proposed day-to-night image synthesis framework. Our procedure involves removing illumination in the day raw image, lowering the exposure, relighting the scene with night illuminants, and adding noise to mimic a real nighttime raw image. For visualization, the raw images have been demosaiced, and a gamma has been applied.
Our pipeline begins with the noise-free daylight image . We first normalize the raw sensor data to produce . Cameras apply a white-balance routine to the raw image to remove the color cast introduced by the scene illumination. While this process ensures that the achromatic (gray) colors are corrected, it cannot guarantee that all colors are illumination compensated. Compared to other illuminations, images captured under daylight have the special property that they incur the least error in terms of overall color correction when white-balanced . This is why we specifically select images captured outdoors under daylight to apply our synthesis model. As the next step of our pipeline, we remove the day lighting in the image by applying the white balance to obtain .
Nighttime images usually have a lower average brightness than daytime images, and they are often illuminated by multiple different light sources. The next two stages of our pipeline are designed to model these effects. We lower the exposure of the day-time image by multiplying it with a random global scale factor to generate the dimmed image . Next, we locally relight the scene to generate using a small random set of night illuminations with random locations and falloffs. We finally denormalize to obtain our synthetic nighttime raw image .
This modified raw image can now be rendered through the camera ISP to produce the photo-finished image. The image at this stage, , represents the high-quality nighttime image, used as the target to train the Night Mode neural ISP. Adding noise to the modified raw data yields the low-quality image , which mimics real nighttime images and serves as the input to the Night Mode neural ISP.
We evaluate our day-to-night image synthesis algorithm by using the synthetic data generated by it to train a Night Mode neural ISP. We use a Samsung S20 FE smartphone to collect data. We capture 70 day images at ISO 50, and 105 nighttime scenes at ISO 50 (ground truth) and ISOs 1600 and 3200 (noisy input). Representative examples from our dataset are shown in Fig. 4.
Figure 4. Representative examples from our dataset. Our dataset contains day images and nighttime bursts.
We compare our day-to-night image synthesis with various baselines such as directly using the day images for training, day images with dimming, day images with dimming and a global relighting using a single illuminant. In addition, we compare against the popular unpaired image translation approach CycleGAN  to convert daytime images to nighttime images. We also compare against a fully supervised framework, similar to SID , where the network is trained purely using real paired nighttime data.
Table 1. Quantitative results of our method, along with comparisons. The models are partitioned based on whether the training data is synthetic only, a mix of synthetic and real, or purely real.
The results presented in Table 1 show that our method outperforms all synthetic data models. More notably, a network trained on our synthetic nighttime raw images, and augmented with a minimal amount of real data (5% to 10%), achieves performance close to that offered by training only on real nighttime paired data. Fig. 5 shows two visual examples.
Figure 5. Qualitative results of our method, along with comparisons. Inset shows zoomed-in region and PSNR (dB) / SSIM values.
We have presented a procedure to convert daytime raw images to pairs of noisy/clean nighttime raw-sRGB images. Currently, a great deal of time and resources are spent capturing paired training data for Night Mode ISPs, and this process is repeated for each new camera model that is released. Our day-to-night data augmentation strategy can greatly reduce the time and effort required to prepare Night Mode neural ISP training data.
 C. Chen, Q. Chen, J. Xu and V. Koltun, "Learning to see in the dark," in Computer Vision and Pattern Recognition, 2018.
 D. Cheng, B. Price, S. Cohen and M. S. Brown, "Beyond white: Ground truth colors for color constancy correction," in The International Conference on Computer Vision, 2015.
 J.-Y. Zhu, T. Park, I. Phillip and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in The International Conference on Computer Vision, 2017.