Publications

A comparison of streaming models and data augmentation methodsfor robust speech recognition

Published

IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Date

2021.12.17

Research Areas

Abstract

We explore three recently proposed data augmentation techniques, namely, multi-condition training for acoustic room and noise simulation, vocal tract length perturbation for speaker variability, and SpecAugment for augmentation. Experimental results show that unidirectional models are in general more sensitive to the amount of noisy samples in the training set, and changing the amount of data augmentation affects the final performance of the models. The MoChA models perform better than RNN-T models under clean conditions showing that attention-based models have the potential to perform better than RNN-T models that do not use attention. However, MoChA models are more sensitive to training hyperparameters, nature of training data and ease of incorporating additional techniques to improve model performance. On the other hand, RNN-T performs better in terms of latency, inference time, ease of training, and more robust performance under noisy and far-field conditions, making them a better choice for streaming on-device speech recognition.