AI

LP-IOANet: Efficient High Resolution Document Shadow Removal

By Mehmet Kerim Yucel Samsung R&D Institute United Kingdom
By Albert Saa-Garriga Samsung R&D Institute United Kingdom
By Bruno Manganelli Samsung R&D Institute United Kingdom

Motivation

Imagine you came back from a work trip and need to submit an expense claim with the receipts you’ve kept, or your meeting has just finished and you need to transfer to your PC the meeting minutes you have on a piece of paper. Instead of transferring these contents manually, we all tend to immediately take a picture of these receipts/pages using our mobile phone. Especially with on-device scanner solutions, such as Samsung document scanner, it is even easier to avoid this laborious digitization process.

These digitization processes, however, are often hampered by low quality images. A common factor that degrades the image quality is actually due to how we tend to take the pictures! We often take pictures of our documents right above the pages for better visibility, but when doing so we often block the light source and cast shadows on the pictures. We can always change where we take the picture from or the position of the document to avoid casting these shadows. Now imagine this; what if we don’t have to?

As shown in previous works [1] [2] [3], including our recent work [4] , image-to-image ML models are effective in detecting and removing such shadows automatically. A practical solution, however, needs to check three boxes.

1- It has to be lightweight and performant enough for use on mobile devices.

2- It has to operate in high-resolutions without compromising quality or performance.

3- It has to be able to handle various document types; different languages, text-only documents, figure-heavy documents, etc.

Existing ML-based methods [2] [5] [3] [1] fail in achieving 1 and 2, and non-ML methods [6] [7] [8] fail in achieving 3. Our primary aim is to achieve all of them at the same time!

We present LP-IOANet (Laplacian Pyramid with Input/Output Attention Network), a novel document shadow removal solution that checks all three boxes (for more details see our paper). This work will be presented at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023. LP-IOANet outperforms the state-of-the-art method by a 35% relative improvement in mean average error (MAE), while running real-time on a mobile device in four times the resolution (of the state-of-the-art).

LP-IOANet

Our work consists of three main contributions; i) a performant network architecture for shadow removal, ii) a cheap-yet-effective way to operate in high-resolutions, and iii) dataset curation processes that helps generalize our models.

The Core Architecture

We first start with our base network architecture - IOANet. IOANet needs to be as small as possible, yet should be accurate as two-network coarse/refinement setups often used by alternative methods. We utilise an efficient encoder-decoder architecture, with which we previously achieved state-of-the-art results in monocular depth estimation [9]. This architecture works adequately off-the-shelf, but why not push it further?

We now take a step back and focus on what shadow removal does. Shadow removal simply means removing the shadows present in an image. By definition, it only “changes” the parts of the images with shadows. In other words, the rest of the non-shadow areas should not change at all. This means we do not need to spend network capacity on non-shadow areas. We can use this to our advantage!

To this end, we propose a simple extension to the core architecture; we introduce lightweight attention modules [10] over the input and the output of the network. As we previously showed in our previous works [11] [4], these modules help focus the network on shadow areas, as well as help with the color-correction and blending operations required after the shadow removal.

Figure 1.  Our IOANet shadow removal network. It is trained on low-resolution images (of our proposed datasets) with L1 and LPIPS losses. This figure represents the first phase of our training.

An Efficient Upsampler

As mentioned before, we want to perform fast and accurately in high-resolution images. We can directly use the IOANet with high-resolution inputs, but the memory consumption will be too high for on-device operation. We want to operate in high-resolutions with much less memory consumption, without a loss in accuracy.

We leverage Laplacian Pyramid Networks [12] for this purpose. This pyramid network wraps our IOANet and lets IOANet operate in low-resolutions. It preserves high-frequency content of the input image while downsampling it. This downsampled input is fed to the IOANet, and its output is progressively upsampled (4 times the resolution) while leveraging the preserved high-frequency content of the input. An issue with using the Pyramid Networks off-the-shelf is that it is too compute-intensive for on-device operation. We propose and use a significantly device-optimized Pyramid Network. Combined with the IOANet, we now have our final architecture, LP-IOANet!

Figure 2.  Our LP-IOANet pipeline. Using a pretrained IOANet (see Figure 1), LP-IOANet is then trained on our proposed A-BSDD dataset in high-resolution using L1 loss. This figure represents the second phase of our training.

Datasets

We want our solution to work effectively for any type of documents (i.e. different languages, text/figure ratios, colors, etc.). As is the case for other ML problems, we set our eyes on a key factor that can help us achieve this: data!

Due to the lack of freely available large-scale datasets, we cannot use existing datasets. However, we can synthesize our data – which we do to create not one, but three new datasets; BSDD, DOC3DS+ and A-OSR.

BSDD is formed of 3863 images created out of 1328 unique document images. We use Blender to cast various shadows to create the final dataset, and we largely follow the principles of SDSRD dataset creation process [3].

Doc3DS+ is created by using Doc3DShade [13] dataset, which we further process to preserve the background colors and to rotate images to be closer to our desired A4 document resolution.

Augmenting the datasets. An often overlooked factor in shadow removal is the colour of the shadow – natural shadows have different colours and intensities. Therefore, we need a way to mimic this behaviour in our synthetically generated data. We do so by applying illumination and color augmentation on our cast shadows in a random fashion, which helps dramatically with the usefulness of our training data distribution. We apply this augmentation on BSDD and OSR [14] datasets to acquire their final versions A-BSDD and A-OSR.

Figure 3.  An overview of our datasets. High-resolution refers to 768x1024 and low-resolution refers to 192x256.

Two-stage Training

Now that we have more data, what we need to do is to train our models on them. However, not all our datasets are high-resolution (i.e. only A-BSDD is!)! If we wanted to train LP-IOANet in one-stage, we can only train it on A-BSDD, and we cannot properly utilize A-OSR and Doc3DS+. Note that this is a limitation of the original pyramid network [12] as well. So, how do we fix this?

We propose a simple two-stage training regime which lets us leverage data of any resolution! The idea is simple; we first train the core architecture IOANet on all three datasets in low-resolution. In the second stage, we freeze IOANet and only train the “LP” of LP-IOANet on A-BSDD.

Experiments

We refer the readers to our full paper for further details; here we present some key highlights. We leverage the commonly used metrics mean average error (MAE), peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) for quantitative analyses and comparison.

Component Analyses

We have presented several design choices for LP-IOANet. Now, we justify their inclusion to our pipeline with empirical results.

Input-output attention. We show the benefit of adding the input-output attention blocks to the backbone architecture. The results given in Table 1 shows the clear improvement (i.e. 0.4 MAE) input-output attention blocks bring. Furthermore, they have virtually no effect on model complexity (i.e. GFLOPs) or memory consumption, and have a negligible effect (1.5ms) on runtime.

Table 1.  Low-resolution evaluation on BSDD test set on an RTX 3090 GPU. MAE and PSNR values are provided as all-regions/non-shadow region/shadow-region.

Upsampling. We now compare our pyramid network with existing upsampling solutions. The results available in Table 2 show that our solution runs nearly three times faster and has 20 times less computations (GFLOPs) than existing upsamplers. Furthermore, the loss in accuracy with our method is quite negligible, which makes our solution quite appealing.

Table 2.  High-resolution evaluation on BSDD test set on an RTX 3090 GPU and Samsung Galaxy S22 Ultra GPU. See our paper for further details on LPTN-lite.

Datasets. Finally, we show the effect of our datasets on our performance. Since all datasets are included in the first training phase (i.e. IOANet-only training), we report the results in low-resolution. The results shown in Table 3 shows that each of the three proposed dataset provides visible improvements, our datasets augmentation helps, and all datasets work better when combined.

Table 3.  Low-resolution IOANet evaluation on BSDD test-set using different datsaets. All refers to training on all datasets jointly.

Mobile Performance

A key advantage of LP-IOANet is its mobile-friendly nature. Table 4 shows that LP-IOANet is significantly faster at high-resolutions and it is the only solution that performs real-time on mobile devices.

Table 4.  Runtime performance (at high-resolutions) on Samsung Galaxy S22 Ultra GPU.

Comparison with the State-of-the-art

Finally, we compare our method against the state-of-the-art. BEDSR [3] holds the state-of-the-art in ML-based document shadow removal methods, therefore it is the primary baseline we compare against. The results are shown in Table 5.

Table 5.  Low-resolution (first three rows) and high-resolution (last two rows) evaluation on BSDD test-set. Runtime and memory are evaluated on RTX 3090 GPUs. The first two rows are trained only on A-BSDD. The last three rows are trained on all three datasets.

We first evaluate IOANet in low-resolution. The first two rows show IOANet outperforms BEDSR significantly in all metrics, despite running 10 times faster and consuming nearly 30 times less memory. Once we train IOANet with all our datasets, we introduce even further improvements across all metrics.

As we go to high-resolution, we see that BEDSR cannot even be trained on a 24GB VRAM GPU, therefore it is impossible to use it even on high-end GPUs. LP-IOANet, on the other hand, runs at nearly 85 FPS with minimal memory consumption. Note that LP-IOANet runs in real-time on mobile too!

Qualitative Results

Figure 4 shows document images from the test set of our datasets; the figure shows the a) input image, output of b) BEDSR, c) IOANet without attention (marked with *), d) our IOANet and e) our LP-IOANet. We also highlight the differences with overlaid red boxes.

Our method successfully handles artefacts (1st and 2nd rows) and preserves high-frequency content (3rd row), even in high-resolutions where errors naturally become more visible to the human eye.

Figure 4.  Example outputs of our method.

Conclusion

We introduce LP-IOANet, a lightweight high-resolution document shadow removal solution. It checks all of the boxes required for mobile deployment; it operates at high-resolution in real-time, while successfully generalizing to various distributions. LP-IOANet consists of the core architecture IOANet, and a lightweight upsampling solution that encapsulates it. When trained with our two-stage training regime on our proposed datasets, LP-IOANet improves the state-of-the-art by a relative 35%, while running in real-time on mobile device in high resolutions.

Link to the paper

Our paper "LP-IOANet: Efficient High-Resolution Document Shadow Removal" will appear at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

https://arxiv.org/pdf/2303.12862.pdf

References

[1] J. Wang, X. Li and J. Yang, “Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[2] X. Cun, C.-M. Pun and C. Shi, “Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting gan,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

[3] Y.-H. Lin, W.-C. Chen and Y.-Y. Chuang, “Bedsr-net: A deep shadow removal network from a single document image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

[4] M. K. Yücel, V. Dimaridou, B. Manganelli, M. Ozay, A. Drosou and A. Saà-Garriga, “LRA&LDRA: Rethinking Residual Predictions for Efficient Shadow Detection and Removal,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023.

[5] L. Fu, C. Zhou, Q. Guo, F. Juefei-Xu, H. Yu, W. Feng, Y. Liu and S. Wang, “Auto-exposure fusion for single-image shadow removal,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

[6] S. Bako, S. Darabi, E. Shechtman, J. Wang, K. Sunkavalli and P. Sen, “Removing shadows from images of documents,” in Asian Conference on Computer Vision, 2016.

[7] S. Jung, M. A. Hasan and C. Kim, “Water-filling: An efficient algorithm for digitized document shadow removal,” in Asian Conference on Computer Vision, 2018.

[8] N. Kligler, S. Katz and A. Tal, “Document enhancement using visibility detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[9] M. K. Yucel, V. Dimaridou, A. Drosou and A. Saa-Garriga, “Real-time monocular depth estimation with sparse supervision on mobile,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.

[10] Q. Hou, D. Zhou and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.

[11] K. Georgiadis, A. Saà-Garriga, M. K. Yucel, A. Drosou and B. Manganelli, “Adaptive Mask-Based Pyramid Network for Realistic Bokeh Rendering,” in Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, 2023.

[12] J. Liang, H. Zeng and L. Zhang, “High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

[13] S. Das, H. A. Sial, K. Ma, R. Baldrich, M. Vanrell and D. Samaras, “Intrinsic Decomposition of Document Images In-the-Wild,” in British Machine Vision Conference 2020, 2020.

[14] B. Wang and C. L. Chen, “Local Water-Filling Algorithm for Shadow Detection and Removal of Document Images,” Sensors, vol. 20, p. 6929, 2020.