AI

Consistency Based Unsupervised ASR Personalisation

By Jisi Zhang Samsung R&D Institute United Kingdom
By Vandana Rajan Samsung R&D Institute United Kingdom
By Haaris Mehmod Samsung R&D Institute United Kingdom
By David Tuckey Samsung R&D Institute United Kingdom
By Pablo Peso Parada Samsung R&D Institute United Kingdom
By Md Asif Jalal Samsung R&D Institute United Kingdom
By Karthikeyan Saravanan Samsung R&D Institute United Kingdom

Introduction


Nowadays, voice assistants have been widely used by users to control smart phones. An automatic speech recognition (ASR) model plays a crucial role in the voice assistant system to recognise the user voice command, which is subsequently used for downstream tasks such as spoken language understanding and speech translation.

On-device ASR models trained on speech data of a large population might underperform for unseen users during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision, which requires efforts from users to provide or revise transcripts.

In this blog, we present our work [1] on unsupervised ASR personalisation by developing a novel consistency based training method via pseudo-labelling, which has been accepted at IEEE ASRU 2023.

Method: Consistency based unsupervised ASR personalisation


Consistency Constraint (CC) [2,3] forces a model to predict the same results on the same input with various versions of perturbation and has been shown effective for exploring unlabelled data. Applying various perturbations introduces randomisation to regularise a model, leading to more stable model generalisation [2].

We exploit CC to improve the robustness of training process for unsupervised ASR personalisation. A common pipeline for the unsupervised self-training method contains data filtering, pseudo-labelling, and training [4]. We introduce CC to the common unsupervised self-training pipeline, however, perturbations are applied to both pseudo-labelling and the training process, forcing the model to output a consistent label in the vicinity of the training sample. The personalisation pipeline based on the CC training method is shown in Figure 1.

Figure 1 : Unsupervised personalisation pipeline based on data filtering and consistency constraint

The full method is further described in Algorithm 1.

First, data filtering DataFilter is applied to the entire unlabelled set χ to obtain filtered set χ ̂.

Second, the model is trained on the filtered set, involving N rounds of pseudo-labelling and training. SpecAugment [5] is applied to both pseudo-labelling and training. In each round, the model f is trained for M epochs with the paired audio samples and pseudo-labels D ̂.

The proposed consistency constraint based training method is not restricted to any type of ASR model architectures. For example, when we apply it to a conformer transducer model [6], the loss function is implemented as incorporating the CC within the standard RNN-T loss:

Where y ̂ denotes the hard labels generated from pseudo-labelling, and x ̃ denotes the augmented input features.

Experimental results


    
•     
Data: In-house synthetic user data for a mobile phone use case. There are 12 speakers, each containing three styles of speech: (a) application launch/download commands (Apps), (b) contact call/text commands (Contacts), and (c) common real-world voice assistant commands (Dictations).
    
    
•     
Speech recognition system: A state-of-the-art, two-pass Conformer-T model [7] is used as a pre-trained ASR model for both filtering unlabelled data and adaptation to a target speaker. The first pass model is a conformer based transducer. The second pass model is an attention-based encoder-decoder model (LAS). NCM classifier [11] is a confidence estimation module that uses intermediate ASR features for data filtering.

Table 1 : Word Error Rate (WER) of the proposed method and existing methods for unsupervised personalisation

Table 1 shows that the proposed method outperforms several baselines from the literature, namely a noisy student training (NST) [9], an adapter based method [10], and an entropy minimisation method [8], achieving a new SOTA result. The proposed method not only improves the performance on data used for training, but also generalises well to unseen data from the target speaker.

Table 2 : The effect of three data filtering methods (CT, DUST, NCM) on the unsupervised personalisation performance

Table 2 shows the ASR personalisation performance of using either unfiltered whole data set or the filtered data set based on three filtering strategies, namely Confidence Thresholding (CT) [12], DUST [13], and the NCM. It shows that the CC based training is able to explore samples with erroneous labels to adapt the model. NCM is favoured for on-device personalisation, as it is a lightweight model that requires only one-time training and can be easily deployed on device.

Figure 2 : Word Error Rate Reduction (WERR) compared to the pre-trained using consistency training (CC) and unsupervised NST for 20 rounds with a choice of 1, 3 or 5 epochs per round. Higher values are better.

We study the effect of increasing the number of rounds and epochs per round on the overall Word Error Rate Reduction (WERR). Figure 2 shows that our method performs up to 40% better than unsupervised NST which is more susceptible to overfitting due to being easily stuck in a local minima. We further observed that training with five epochs per round can lead to divergence for Dictation which is a classic example of overfitting due to increased model updates. Conversely, training for a single epoch per round leads to sub-optimal convergence due to the increased stochasticity in regenerating pseudo-labels with input augmentation every round.

Figure 3 : ASR personalisation results for each of the 12 individual users. The pre-trained model, NST trained on unfiltered data, and the proposed method are compared in the plot.

Finally, we investigate the performance of the proposed method in each individual user, and the analysis is shown in Figure 3. There is a wide range of speech recognition accuracy among the test users, whose WERs range from 10% to 45%. The proposed method improves the recognition accuracy for most users on both held-in data and held-out data, demonstrating the robustness of the training process to erroneous labels.

Conclusions


We proposed a novel unsupervised personalisation training method to address a domain shift issue when an ASR model is deployed in the wild.

    
•     
The proposed method performs data filtering of unlabelled user data and applies a consistency constraint to the training.
    
    
•     
The consistency constraint has been applied by introducing data perturbation to the iterative pseudo-labelling process.
    
    
•     
The consistency based training can be used in conjunction with a wide range of data filtering strategies.


References


[1] Zhang, J., Rajan, V., Mehmood, H., Tuckey, D., Parada, P.P., Jalal, M.A., Saravanan, K., Lee, G.H., Lee, J. and Jung, S., Consistency Based Unsupervised Self-Training for ASR Personalisation. In IEEE ASRU 2023.
[2] Sajjadi, M., Javanmardi, M. and Tasdizen, T., Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 2016.
[3] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A. and Li, C.L., Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 2020.
[4] Deng, J., Xie, X., Wang, T., Cui, M., Xue, B., Jin, Z., Li, G., Hu, S. and Liu, X., Confidence score based speaker adaptation of conformer speech recognition systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, pp.1175-1190, 2023.
[5] Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D. and Le, Q.V., SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019.
[6] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R., Conformer: Convolution-augmented Transformer for Speech Recognition. Interspeech 2020.
[7] Park, J., Jin, S., Park, J., Kim, S., Sandhyana, D., Lee, C., Han, M., Lee, J., Jung, S., Han, C. and Kim, C., Conformer-Based on-Device Streaming Speech Recognition with KD Compression and Two-Pass Architecture. In IEEE SLT. 2023.
[8] Lin, G., Li, S., & Lee, H. Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition. Interspeech 2022.
[9] Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.C., Li, B., Wu, Y. and Le, Q.V., Improved noisy student training for automatic speech recognition. Interspeech 2020.
[10] Deng, J., Xie, X., Wang, T., Cui, M., Xue, B., Jin, Z., Geng, M., Li, G., Liu, X., & Meng, H.M., Confidence Score Based Conformer Speaker Adaptation for Speech Recognition. Interspeech 2022.
[11] Gupta, A., Kumar, A., Gowda, D., Kim, K., Singh, S., Singh, S. and Kim, C., Neural utterance confidence measure for RNN-Transducers and two pass models. In ICASSP 2021.
[12] Kahn, J., Lee, A. and Hannun, A., Self-training for end-to-end speech recognition. In ICASSP 2020.
[13] Khurana, S., Moritz, N., Hori, T. and Le Roux, J., Unsupervised domain adaptation for speech recognition via uncertainty driven self-training. In ICASSP 2021.