AI
The Conference on Computer Vision and Pattern Recognition (CVPR) is one of the most prestigious annual conferences in the field of computer vision and pattern recognition. It serves as a premier platform for researchers and practitioners to present cutting-edge advancements across a broad spectrum of topics, including object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. As a cornerstone of the computer vision community, CVPR fosters innovation and collaboration, driving the field forward with groundbreaking research. In this blog series, we delve into the latest research contributions featured in CVPR 2025, offering insights into groundbreaking studies that are shaping the future of computer vision. Each post will explore key findings, methodologies, and implications of selected papers, providing readers with a comprehensive understanding of the innovations driving the field forward. Stay tuned as we uncover the transformative potential of these advancements in our series. Here is a list of them. #1. FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion (AI Center - Cambridge) #2. Augmenting Perceptual Super-Resolution via Image Quality Predictors (AI Center - Toronto) #3. VladVA: Discriminative Fine-tuning of LVLMs (AI Center - Cambridge) #4. Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation (Samsung R&D Institute United Kingdom) #5. Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning (AI Center - Cambridge) #6. DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching (Samsung R&D Institute United Kingdom) |
In the rapidly evolving field of artificial intelligence, diffusion models have emerged as powerful tools for generating high-quality images. However, a significant hurdle has been their inability to produce images at resolutions higher than those they were trained on without encountering issues like repetitive patterns or distorted structures. Retraining these models for higher resolutions is often computationally prohibitive. This blog post delves into a novel solution, FAM Diffusion, a method designed to enable pre-existing diffusion models to generate superior high-resolution images with flexible test-time resolutions, all without requiring additional training. We will explore the background of this challenge, the innovative methodology behind FAM Diffusion, the results, both qualitatively and quantitatively, discussing its significance in the landscape of image generation.
Diffusion models excel at creating images by progressively removing noise from an initial random signal, guided by a learned process. While effective at their native training resolutions (e.g., 512×512 or 1024×1024 pixels), directly attempting to generate images at, for instance, 4K resolution often leads to severe object repetition and unrealistic local patterns.
Previous attempts to overcome this limitation can be broadly categorized:
Figure 1. Comparison at 3x higher resolution than native, i.e. 3072 x 3072px between our approach and prior state-of-the-art. Notice that prior work suffers from repetitive patterns, unnatural texture and structure.
FAM Diffusion aims to take the best of these worlds: a single-pass generation strategy for low latency, while effectively leveraging the native resolution generation to guide the high-resolution process, and critically, addressing both global and local consistency issues often overlooked. See Figure 1 for a visualisation of some of these problems and the result of our approach, which largely alleviates these challenges.
FAM Diffusion introduces two simple yet effective modules - Frequency Modulation (FM) and Attention Modulation (AM) - that can be seamlessly integrated into any latent diffusion model without additional training or architectural changes. The process generally follows a test-time diffuse-denoise strategy: an image is first generated at the model's native resolution, then upsampled, noise is added, and finally, a denoising process generates the high-resolution output.
When upscaling images, maintaining global structural integrity is paramount. Simply steering the high-resolution denoising process with an upsampled low-resolution image (as done by "skip residuals" in some prior work) can indiscriminately force features, potentially harming high-frequency details.
The FM module uses the Fourier domain, where global structure is represented by low frequencies and finer details by high frequencies. During the high-resolution denoising stage, the FM module selectively conditions the low-frequency components of the image being generated using the low-frequency information from the diffused, upsampled native-resolution image. Crucially, it allows the denoising process full control over the high-frequency components, enabling the generation of new, sharp details.
In essence, high frequencies from the ongoing denoising are combined with low frequencies from the guidance image. In the time domain, this operation is equivalent to adding a low-frequency update to the denoised latent, directed towards the diffused latent, effectively giving the UNet a global receptive field without architectural changes.
While FM effectively handles global structure and prevents issues like object duplication, inconsistencies in local textures can still arise. For example, fur texture might incorrectly appear on a shirt collar, or facial features might be distorted compared to the native resolution image. We hypothesize this stems from incorrect attention maps during high-resolution denoising.
To address this, the AM module leverages attention maps from the denoising process at the native resolution to guide the attention maps at the high resolution. Since native resolution attention maps encode semantic relationships between image regions, they can regularize the high-resolution denoising towards generating consistent finer textures.
The self-attention mechanism in diffusion models computes an attention matrix M from query (Q), key (K), and value (V) projections of an input tensor. AM modifies the attention matrix of the high-resolution denoising process at specific UNet layers (primarily up-blocks known to preserve layout information). This helps preserve local structures and semantic consistency.
Figure 2. Overview of the FAM diffusion. (a) We first generate an image at native resolution, followed by a test-time diffuse-denoise process. We incorporate our Frequency Modulation module and Attention Modulation during highres denoising to control global structure and fine local texture, respectively. (b) Details of the Frequency Modulation, where we use the Fourier domain to selectively condition low-frequency components during high-res denoising while leaving high-frequency components fully controllable. (c) Details of Attention Modulation, where attention maps from the native image denoising are used to correct the highres denoising.
Across various scaling factors and base models (including SDXL and HiDiffusion), FAM Diffusion consistently achieved state-of-the-art performance on. Crucially, FAM Diffusion achieves these quality gains with only negligible latency overheads compared to direct inference at the target resolution. For SDXL, this was approximately 0.2, 0.3, and 0.7 minutes extra for 2×, 3×, and 4× scales, respectively. This is a stark contrast to methods like DemoFusion, which added significantly more latency (e.g., 14.2 minutes more than direct SDXL inference at 4×). FAM Diffusion was also shown to enhance the performance of single-pass methods like HiDiffusion while maintaining fast generation.
Figure 3. Qualitive comparison with prior state-of-the-art approaches. Note, that FAMFusion results in consistent and realistic structure, avoiding over sharpening and spurious artifacts.
While previous methods often focused more on global consistency, sometimes neglecting local texture artifacts or introducing significant computational costs. FAM Diffusion's FM module directly targets global consistency by leveraging the robust low-frequency information from the native resolution image in the Fourier domain. The AM module is a specifically addresses the often-overlooked problem of inconsistent local textures by using attention maps from the native resolution to guide the high-resolution generation.
Additionally, unlike methods requiring retraining or complex architectural modifications, FAM Diffusion is training-free and integrates seamlessly into existing latent diffusion models. Its single-pass nature and the computational efficiency of its modules result in negligible latency overhead, making it a practical solution for generating very high-resolution images. This contrasts sharply with patch-based methods that suffer from high latency due to redundant forward passes.
Generating high-resolution images with diffusion models without sacrificing quality or incurring prohibitive costs has been a significant challenge. FAM Diffusion presents a compelling, training-free solution by introducing two key innovations: Frequency Modulation (FM) for robust global structure consistency and Attention Modulation (AM) for coherent local texture patterns.
This method allows existing, powerful diffusion models to operate at flexible, much higher resolutions than they were trained for, producing images with superior qualitative and quantitative results compared to previous state-of-the-art techniques. Its ability to achieve this with negligible latency overhead and without any need for retraining or architectural modifications makes FAM Diffusion a highly practical and impactful advancement in the field of generative AI. It sets a new standard for high-resolution image generation, opening up new possibilities for applications requiring large-scale, detailed visual content.
https://arxiv.org/pdf/2411.18552
[1] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.MultiDiffusion: fusing diffusion paths for controlled image generation.In International Conference on Machine Learning, 2023.
[2] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton.Demystifying MMD GANs.International Conference on Learning Representations, 2018.
[3] Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma.DemoFusion: Democratising high-resolution image generation with no $$$.In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
[4] Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, and Xin Eric Wang.Photoswap: Personalized subject swapping in images.Neural Information Processing Systems, 2023.
[5] Jing Gu, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Yilin Wang, and Xin Eric Wang.SwapAnything: Enabling arbitrary object swapping in personalized image editing.European Conference on Computer Vision, 2024.
[6] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen.Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022.
[7] Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan.Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models.In International Conference on Learning Representations, 2024.
[8] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022.
[9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.GANs trained by a two time-scale update rule converge to a local Nash equilibrium.Neural Information Processing Systems, 2017.
[10] Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.In Neural Information Processing Systems, 2020.
[11] Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li.FouriScale: A frequency perspective on training-free high-resolution image synthesis.In European Conference on Computer Vision, 2024a.
[12] Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li.FouriScale: A frequency perspective on training-free high-resolution image synthesis.arXiv preprint arXiv:2403.12963, 2024b.
[13] Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, and Youngjung Uh.Visual style prompting with swapping self-attention.arXiv preprint arXiv:2402.12974, 2024.
[14] Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung.SyncDiffusion: Coherent montage via synchronized joint diffusions.In Neural Information Processing Systems, 2023.
[15] Zhihang Lin, Mingbao Lin, Zhao Meng, and Rongrong Ji.AccDiffusion: An accurate method for higher-resolution image generation.In European Conference on Computer Vision, 2024.
[16] Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo.HiPrompt: Tuning-free higher-resolution generation with hierarchical MLLM prompts.arXiv preprint arXiv:2409.02919, 2024.
[17] David Marr and Ellen Hildreth.Theory of edge detection.Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187–217, 1980.
[18] Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos.You only need one step: Fast super-resolution with stable diffusion via scale distillation.European Conference on Computer Vision, 2024.
[19] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.SDXL: Improving latent diffusion models for high-resolution image synthesis.In International Conference on Learning Representations, 2024.
[20] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall.Dreamfusion: Text-to-3D using 2D diffusion.arXiv preprint arXiv:2209.14988, 2022.
[21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[22] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
[23] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman.DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation.In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
[24] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev.LAION-5B: An open large-scale dataset for training next generation image-text models.In Neural Information Processing Systems - Datasets and Benchmarks Track, 2022.
[25] Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng.ResMaster: Mastering high-resolution image generation via structural and fine-grained guidance.arXiv preprint arXiv:2406.16476, 2024.
[26] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu.FreeU: Free lunch in diffusion U-Net.In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
[27] Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.In International Conference on Learning Representations, 2021.
[28] BA Wandell.Foundations of vision, 1995.
[29] Wenqing Wang, Haosen Yang, Josef Kittler, and Xiatian Zhu.Single image, any face: Generalisable 3D face generation.arXiv preprint arXiv:2409.16990, 2024.
[30] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou.Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation.In IEEE International Conference on Computer Vision, 2023.
[31] Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren.Learning in the frequency domain.In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[32] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte.Designing a practical degradation model for deep blind image super-resolution.In IEEE International Conference on Computer Vision, 2021.
[33] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In IEEE International Conference on Computer Vision, 2023.
[34] Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang.HiDiffusion: Unlocking higher-resolution creativity and efficiency in pretrained diffusion models.In European Conference on Computer Vision, 2024.