RandMasking Augment: A Simple and Randomized Data Augmentation for Acoustic Scene Classification

By Jubum Han Samsung Research
By Mateusz Matuszewski Samsung R&D Institute Poland
By Olaf Sikorski Samsung R&D Institute Poland
By Hosang Sung Samsung Research
By Hoonyoung Cho Samsung Research


Sound recognition aims to enable intelligent systems to understand acoustic characteristics of the target sound or the surrounding environment based on acoustic features. Samsung also provides many sound recognition services through various devices. The Detection and Classification of Acoustic Scenes and Events (DCASE) challenges [1-2] have been held to stimulate the research in the field of sound recognition and deep learning has risen as a standard approach in many solutions [3-4] to acoustic scene classification tasks. Because this task is to classify audio environment into one of the predefined classes, it is important that network system understands the overall context of audio. Various augmentation methods that improve the generalization and performance of neural networks, which are first introduced in image classification tasks, are also widely used in acoustic scene classification tasks. Recent competitions have also demonstrated the importance of data augmentation methods [5-7] as an indispensable strategy commonly used to improve model performance.

Most augmentation approaches in acoustic scene classification tasks act on a spectrogram as if it were an image, hence allowing for the application of methods developed in the context of image classification. SpecAugment [8] was introduced in automatic speech recognition tasks and it was applied successfully in acoustic scene classification tasks [5-6]. SpecAugment proposed time warping, frequency and time masking. Masking is similar to random erasing in image classification tasks, and it works like dropout at the feature level [9]. It is important to preserve undistorted time and frequency information in acoustic scene classification tasks, so only time and frequency masking [5-6] were considered without warping and shifting. Because acoustic features are usually normalized to have zero mean value, zero masking value is commonly considered.

More augmentation methods for image classification were introduced in RandAugment [10]. Applying random, equally probable transformations to an image, RandAugment can generate various augmented images and shows improved performance in image classification tasks. Despite these advantages, many transforms introduced in RandAugment are difficult to apply in acoustic scene classification because of time and frequency information distortions of acoustic features.

We propose an advanced data augmentation approach named RandMasking Augment inspired by SpecAugment and RandAugment. A random transformation is applied to the masking region instead of acoustic features. In previous studies [5-8], there is a limit that only simple rectangular masking is applied. However, our method can generate more diverse masking regions and it makes network learn various acoustic features. Additionally, because a deformation is applied only in the masking region, time and frequency information of acoustic features can be preserved. The method is straightforward and improves neural network performance and robustness. Concepts from Mixup and FilterAugment can be combined in a masking region, and it gives additional performance improvements. We demonstrate that it outperforms other augmentation methods through extensive evaluation of RandMasking Augment on the DCASE 2018 Task1A and the DCASE 2019 Task1A datasets with various convolutional neural networks (CNN) architectures.

RandMasking Augment

The primary goal of RandMasking Augment is generating various types of augmented spectrograms with time and frequency domain masking. To improve neural network model performance and generalization, various masking transformations are applied with maintaining the unique characteristics of audio features. Several transformations [10] have been introduced in the context of image classification, however we are only interested in choosing those that can maintain time and frequency information of a spectrogram.

∙Rotate ∙Shear x & y ∙Translate ∙Value

Where Rotate means transformed masking region by the rotation matrix, Shear x & y are transformations by the shear x & y matrix, Translate moves masking region by the translate matrix, and Value is multiplying a random value in the masking area. Also x & y denote time and frequency axis, respectively. Figure 1. shows the examples of these augmentations. SpecAugment considers only basic types of time and frequency domain masking, but RandMasking Augment can generate more various mask shapes. The main difference between Rotate transform and Shear-x transform is the center of rotation. For Rotate, the center of rotation is the center of the masking region whereas for Shear-x it can be anywhere on a spectrogram. Transforms (b) to (f) shown in the Figure 1. have zero-value masking region subject to transformations changing its shape. Figure 1. (g) shows the value transform example where data within the masking region is attenuated randomly.

Figure 1.  Examples audio feature augmented by RandMasking Augment. From (b) to (g), only frequency domain examples are presented to express augmented masking concept clear. Also, time domain and both augmentations can be applied. According to probabilities of each transformation some augmented features are generated by multiple transform of masking as in (h).

SpecMix and SpecAugment++ take different approaches to incorporating Mixup on how to combine source and target samples. In SpecMix, the masked region of the source is cut out and replaced with a patch from the target spectrogram. Meanwhile, SpecAugment++ considers mean values of two samples. There are also differences in how they adapt labels. SpecAugment++ does not change the label of the source sample, but SpecMix changes the label of the source sample considering mixed area ratio. Mixup, CutMix and Label Smoothing [11] show that smoothed labels can overcome overfitting issues and improve model generalization than one-hot encoding labels.


We use the development dataset of the DCASE 2018 Task 1A and the DCASE 2019 Task 1A, which are commonly used as benchmark datasets for acoustic scene classification systems. These datasets contain real audio and recorded by single device from 10 scenes which are consists of indoor, outdoor, and transportation related scenes. The DCASE 2018 Task 1A was recorded in six large European cities and total dataset duration is 24 hours, and the DCASE 2019 Task 1A provides 40 hours of more extended audio. Training data is about 70% and test data is about 30% in both datasets, also every recording is split into 10 second long segments.


Table 1. shows classification results for the proposed method and other augmentation approaches. We observe that RandMasking Augment outperforms other augmentation approaches on both datasets. Since DCASE datasets consist of a relatively small number of samples, overfitting problems and network generalization issues can exist. Generating various audio features, RandMasking Augment can overcome this problem. The best performance is achieved combining Mixup and FilterAugment with RandMasking Augment. Also, adding only Mixup shows better performance. This improvement is because combined augmentation not only further expands the diversity of samples that the network is exposed to, but also generates more similar spectrograms produced from real world samples.

Table 1.  Comparisons of classification results on the DCASE 2018 Task1A and the DCASE 2019 Task1A datasets by different data augmentation methods applied to various CNN architectures.

Figure 2. shows box plot results of SpecAugment, each of our random masking transformations and RandMasking Augment. RandMasking Augment can generate various augmented audio features that maintain characteristics of original audio samples by combining several masking transform, and shows the best performance in both median value and quartile areas. SpecAugment and the proposed method induce the network to learn features outside of the mask regions, however as opposed to SpecAugment which applies zero-value masking, the proposed method does not entirely occlude masking region. This avoids situations when key features are essentially removed from the spectrogram, hence creating labels noise from the perspective of the model.

Figure 2.  Box plot results of SpecAugment and each random transformation for RandMasking Augment. Where • is average value, – is median value, ⋯ is average value connection lind and RandM++ is integrated augmentation with Mixup and FilterAugment. The results of all networks for each method are included in a box polt.

Our hypothesis was that acoustic characteristics of the 10-second-long samples in the DCASE dataset are maintained over their whole length. We expected that the incorporation of Mixup and FilterAugment would have a regularizing effect on training and there are several pieces of evidence that point to that being true. First, we observe that models trained with the extended RandMasking Augment shows improvement between architectures for either datasets. Second, the improvement that the extension yields is higher for the larger and more diverse dataset as in the DCASE 2019. It seems that our set of augmentations allows the model to better utilize the extra data and improve generalization. Finally, the extended RandMasking Augment consistently outperforms other the state-of-the-art augmentation approaches in all CNN architectures.


In this post, we described a simple and randomized data augmentation method for acoustic scene classification. This method applies random augmentation in masking regions and generated various audio features which maintain their acoustic characteristics. The proposed methods can be extended by mixing other audio samples in augmented masking region, applying different weights on frequency bands and considering mixed label. The proposed method is compared with other augmentation approaches in the DCASE datasets and shows improved performances over various popular CNN architectures.

Link to the paper

JuBum Han, Mateusz Matuszewski, Olaf Sikorski, Hosang Sung, Hoonyoung Cho. Randmasking Augment: A Simple and Randomized Data Augmentation For Acoustic Scene Classification. ICASSP. 2023.

© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


[1] A. Mesaros, T. Heittola, and T. Virtanen, “A multidevice dataset for urban acoustic scene classification,” in Proc. DCASE Workshop, pp.9-13, November 2018

[2] A. Mesaros, T. Heittola, and T. Virtanen, “Acoustic scene classification in DCASE 2019 challenge: Closed and open set classification and data mismatch setups,” in Proc. DCASE Workshop, pp. 164-168, October 2019

[3] Y. Han, and J. Park, “Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification,” in Proc. DCASE Workshop, pp.46-50, November 2017

[4] K. Koutini, H. Egnbal-zadeh, and G. Widmer, “Receptive-field-regularized CNN variants for acoustic scene classification,” in Proc. DCASE Workshop, pp. 124-128, October 2019

[5] G. Kim, D. K. Han, and H. Ko, “SpecMix: A mixed sample data augmentation method for training with time frequency domain features,” in Proc. Interspeech 2021, pp. 546-550, Aug./Sep. 2021

[6] H. Wang, Y. Zou, and W. Wang, “SpecAugment++: A hidden space data augmentation method for acoustic scene classification,” in Proc. Interspeech, pp. 551-555, Aug./Sep. 2021

[7] H. Nam, S. -H. Kim, and Y. -H. Park, “FilterAugment: An acoustic environmental data augmentation method,” in Proc. ICASSP, pp. 4308-4312, May 2022

[8] D.S. Park, W. Chan, Y. Zhang, C. -C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, pp. 2613-2617, September 2019

[9] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random Erasing Data Augmentation,” in Proc. AAAI, pp. 13001-13008, Feb. 2020

[10] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “RandAugment: Practical automated data augmentation with a reduced search space,” in Proc. NIPS, pp. 18613-18624, December 2020

[11] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z.Wojna, “Rethinking the inception architecture for computer vision,” in Proc. CVPR, pp. 2818-2826, June 2016

[12] D. P. Kingma, and J. L. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, May 2015

[13] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks for image classification with convolutional neural networks,” in Proc. CVPR, pp. 558-567, June 2019