A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

By Francesco Barbato Samsung R&D Institute United Kingdom
By Umberto Michieli Samsung R&D Institute United Kingdom
By Mete Ozay Samsung R&D Institute United Kingdom
By Jijoong Moon Samsung Research

In this blog we introduce our new work [1] published at the ACM MMSys’24 conference.


I have a challenge: I am going to show you a very noisy image. Would you be able to recognize its content confidently? No? Neither can AI models.
Like humans, they find it challenging to understand the content of corrupted images and, sadly, they tend to be much more sensitive to the problem than we are.

We show an example in the figure below: to a human, the images look identical, and they would have no issue identifying the content (elephant). However, we subtly modified the second image so that a target AI model completely fails to recognize the animal, predicting hamster instead.

Figure 1a. Prediction: African Elephant

Figure 1b. Prediction: Hamster

This is a very big problem for systems embedded on everyday devices, since (to a lesser degree) common natural corruptions lead to misclassifications as well.

In our paper, referred to as SyMPIE [1], we tackle the issue of AI model robustness efficiently, designing a modular system that cleans the input images before feeding them to the AI models. Our approach improves visual appearance and classification accuracy at the same time: in the next figure, we show the results of applying our strategy to the modified image.

Figure 2. Prediction: African Elephant

Notice how the colours are more saturated and the contrast is higher in the processed image, when compared to the input. It is not easy to spot the differences, but AI models are highly sensitive to this.

The desiderata of our setup are as follows:

Improve downstream task accuracy by enhancing the input images.
The cost of the enhancement step must be negligible when compared to the one of the downstream task.
The accuracy improvement brought by the enhancement must generalize to diverse corruptions and downstream tasks.

Our Method: SyMPIE

As the name suggests, our System for Modular Parametric Image Enhancement (SyMPIE) is a parametric image enhancer built with modularity in mind.

In Figure 3 we show a before/after schematic representation of a multimedia system making use of our architecture, which is inserted between input and AI model to clean the former.

Figure 3. An illustration of our modular system(SyMPIE) for efficient image enhancement targeting increased model robustness to corruption in different multimedia taskas. SyMPIE contains two modules, namely, NEM and DWM.

Figure 4 shows a more detailed view of the SyMPIE architecture and of the two modules that comprise it: Noise Estimation Module (NEM) and Differentiable Warping Module (DWM).

Figure 4. A detailed scheme of our modules working together to enhance the content of an image. The Noise Estimation Module(NEM) receives a corrupted input and predicts a triple of parameters (Cs, Cm, K). These parameters are used by the Differentiable Warping Module(DWM) to enhance the image using parametric operators.

The core idea behind SyMPIE is that direct estimation of warping parameters is more computationally efficient than cleaning the input in a black-box approach.

We confirm this in Table 1, where we computed the Floating Point Operations (FLOPs) needed for a single inference of the various methods. The results show how our method is (in the worst case) 10x faster than the competitors.

Table 1. Computational complexity of our method compared to other input-level image enhancement strategies.

Differently from other denoising approaches, SyMPIE does not require paired clean-corrupted data for training. It is trained end-to-end exploiting a frozen downstream network.

The image enhancement objective is an emerging property of downstream task optimization, since clean images lead to better results on the frozen model.

In Figure 5, we show a schematic representation of the training strategy employed for SyMPIE, while a more detailed description in Algorithm 1.

Figure 5. An overview of the training procedure of our modular system.

We employ a two-step procedure exploiting the exponential moving average of our modules to avoid a common fallacy found in denoisers, modal collapse upon iterated application.

Experimental Setup

We evaluate our approach on multiple image classification and semantic segmentation benchmarks, achieving a consistent improvement of 5%.

In particular, we used the ImageNetC, ImageNetC-Bar and VizWiz datasets for the evaluation on corrupted image classification. We also tested the architecture on an additional benchmark we developed by adding multiple corruptions from ImageNetC to the same image, which we call ImageNetC-Mixed.

For Semantic Segmentation we used the Cityscapes, ACDC and DarkZurich datasets. We investigated a common task in domain adaptation: Clean-to-Adverse Weather.


Table 2 shows the main results of our system, confirming an average improvement of 5%. Note that we trained SyMPIE only once with RN50-V2 as the downstream classifier, and used it as-is in all other experiments.

Table 2. Results for the image classification task on the ImageNetC dataset(higher is better).

In Tables 3 and 4 we show the results attained in the ImageNetC-Mixed and VizWiz datasets, again one can appreciate the improvement brought by our system.

Table 3. Accuracy on ImageNetC-mixed via ResNet50.

Table 4. Results for the VizWiz with ResNet50 V2.

To the best of our knowledge, VizWiz is the only real-world benchmark providing corrupted images. Remarkably, even if most corruptions in the dataset cannot be effectively modeled by our approach, SyMPIE is able to improve the performance even on the clean data. This suggests that the images considered clean by the designers of the dataset suffer from corruptions as well, which were cleaned by our method.

To further investigate the effect of applying our system to images showing corruptions not modeled by the current implementation, we studied the performance on the ImageNetC-Bar dataset, which we show in Table 5.

Table 5. Quantitative results on the ImageNetC-Bar dataset with ResNet50.

The improvement here is similar to the one in the VizWiz dataset.

Finally, we report the results in Semantic Segmentation in the following Figure and Table.

Table 6. Quantitative results for semantic segmentation using a DeepLabV2 [7] architecture with the ResNet50 backbone.

Figure 6. Quantitative results on ACDC semantic segmentation benchmark.

SyMPIE improves the accuracy by 4% even in this completely different task.


In this work, we introduced a small and efficient modular system to enhance the images in input to multimedia systems. Our SyMPIE could improve performance in a variety of scenarios and tasks at a fraction of the cost of competitors thanks to the explicit parameter estimation and warping.

Useful Links

Link to paper arxiv:
Link to open-source code:


[1] F. Barbato, U. Michieli, M. K. Yucel, P. Zanuttigh and M. Ozay, "A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation," Proceedings of ACM MMSys, 2024.
[2] M. Akbari, J. Liang, J. Han and C. Tu, "Generalized octave convolutions for learned multi-frequency image compression," arXiv preprint arXiv:2002.10032, 2020.
[3] P. Dhariwal and A. Nichol, "Diffusion models beat gans on image synthesis," Advances in neural information processing systems, vol. 34, pp. 8780-8794, 2021.
[4] Hendrycks, D., Zou, A., Mazeika, M., Tang, L., Li, B., Song, D. X., and Steinhardt, J. "Pixmix: Dreamlike pictures comprehensively improve safety measures". Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[5] X. Li and X. Guo, "SPN2D-GAN: Semantic Prior Based Night-to-Day Image-to-Image Translation," IEEE Transactions on Multimedia, 2022.
[6] Modas, A., Rade, R., Ortiz-Jiménez, G., Moosavi-Dezfooli, S.-M., and Frossard, P. "Prime: A few primitives can boost robustness to common corruptions". In European Conference on Computer Vision, Springer, pp. 623–640, 2022.
[7] X. Nie, J. Jia, H. Ding and E. K. Wong, "GiGAN: Gate in GAN, could gate mechanism filter the features in image-to-image translation?," Neurocomputing, vol. 462, pp. 376-388, 2021.
[8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," ArXiv:2103.00020, 2021.
[9] TorchVision maintainers and contributors, "TorchVision: PyTorch’s Computer Vision library," GitHub repository, 2016.
[10] A. Ulhaq, N. Akhtar and G. Pogrebna, Efficient Diffusion Models for Vision: A Survey, 2022.
[11] Yucel, M. K., Cinbis, R. G., and Duygulu, P. "Hybridaugment++: Unified frequency spectra perturbations for model robustness". In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5718–5728, 2023.
[12] H. Zhang, S. Liu, C. Wang, S. Lu and W. Xiong, "Color-patterned fabric defect detection algorithm based on triplet attention multi-scale U-shape denoising convolutional auto-encoder," The Journal of Supercomputing, pp. 1-26, 2023.
[13] Wang, H., Gui, S., Yang, H., Liu, J., and Wang, Z. "Gan slimming: All-in-one gan compression by a unified optimization framework". In European Conference on Computer Vision, Springer, pp. 54–73, 2020.
[14] Chen, R., Huang, W., Huang, B., Sun, F., and Fang, B. "Reusing discriminators for encoding: Towards unsupervised image-to-image translation". In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8168–8177, 2020.