Publications

Hearable Image: On-Device Image-Driven Sound Effect Generation for Hearing What You See

Published

International Conference on Information and Knowledge Management (CIKM)

Date

2025.11.10

Research Areas

Abstract

There have been various studies in audio generation from image, text, or video. However, the existing approaches have not consider on-device environment because audio generation models are computationally expensive and require heavy storage capacity to save large number of weights. In addition, it is difficult to get stable generation outputs because unexpected results may occur depending on various model inputs. In image-to-audio generation, there are diverse images in smartphones, and too many visual contexts are contained in image features. Therefore, it is sometimes unpredictable which audio categories are generated from images. In this paper, we propose a robust on-device sound effect generation framework that is image-to-audio generation based on latent diffusion. First, to avoid unstable and unpredictable audio generation results, we propose a stable sound generation framework with Audio Feature Dictionary and Audio-Image Matching Pipeline to generate sound effects from predefined sound effect categories. If an image matches to sound effect categories, proposed framework directly generate sound effects from audio features corresponding to the matched categories. Second, we propose Multi-Category Generation and Generation Flow Map to generate robust and diverse sound effects depending on audio categories. Using global and local features of an image, we can select multiple categories of sound effects. Third, the framework can be implemented in smartphone devices because we train the proposed model with low computational cost and small number of model weights under 4-step latent diffusion inference. Various experiments show the proposed framework solves on-device sound generation problem with maintaining generation quality and audio-image matching performances compared to large scale models.