SRC-N Wins 1st and 3rd Places at DCASE2024 Challenge

Samsung R&D Institute China-Nanjing (SRC-N) has once again claimed the top ranks in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge.

DCASE, a premier competition in the acoustics field organized by the Institute of Electrical and Electronics Engineers Audio and Acoustic Signal Processing (IEEE AASP), attracts hundreds of teams from industry and academia each year. Building on the success of previous editions, this year’s Challenge expanded to 10 tasks. The evaluation continues to support the development of computational scene and event analysis methods by comparing different approaches using a common public dataset, furthering efforts to improve state-of-the-art performance.

Following their success in last year’s DCASE Challenge (second place at Task4B and third place at Task4A), SRC-N developers took on new challenges this year. They achieved first place in Task 7 – Sound Scene Synthesis and third place in Task 8 – Language-Based Audio Retrieval, further enhancing the research center’s technical expertise in AI-based acoustic signal processing.


Sound Scene Synthesis involves generating environmental sounds based on given textual descriptions. This next-generation task expands the scope from last year’s Foley sounds to more general sound scenes, aiming to add further controllability with natural language prompts. While most current text-to-audio (diffusion-based) solutions focus on training models using datasets of audio-caption pairs, our submitted system created a dataset with both positive and negative samples. We then fine-tuned existing TTA models with our own dataset. Our proposed systems achieved top rankings with a FAD Score of 5.985@FAD(PANNs) and a Perceptual Evaluation Score of 5.832.


Meanwhile, Language-Based Audio Retrieval entails retrieving audio signals using textual descriptions of their sound content (i.e., audio captions). This task takes human-written text queries as input and sorts audio files from a given dataset based on their match with the query. Currently, most language-based audio retrieval solutions (Text2Audio) focus on training effective audio and text encoders to obtain expressive audio and text presentations. However, our submitted system consists of audio and text encoders that were jointly pre-trained on a large multi-modality dataset, VAST-27M, which includes vision, audio, and subtitles. We further used several external audio caption datasets (AudioCaps, WavCaps, FSD50K, LAION-630k, and ClothoV2), adopted three multi-modality objectives for learning the system, and used mix-up as the data augment policy. With the model ensemble, our proposed systems achieved 0.406 mAP@10 and 0.278 R@1 on the ClothoV2 evaluation set.


Acoustic signal processing methods extract information carried in sounds and can be applied to intelligent scenarios, aligning perfectly with the strategy of SRC-N’s AI Software Team. Our team aims to research top-level AI technologies, utilize them on devices to offer product innovation, and create new value. In the era of generative AI, we believe cutting-edge technology will create whole new user experiences for immersive audio and multimedia editing. As such, we will seek opportunities to adopt these award-winning solutions and contribute to essential functions for new AI experiences and next-generation products.