AI
Nowadays, voice assistants have been widely used by users to control smart phones. An automatic speech recognition (ASR) model plays a crucial role in the voice assistant system to recognise the user voice command, which is subsequently used for downstream tasks such as spoken language understanding and speech translation.
On-device ASR models trained on speech data of a large population might underperform for unseen users during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision, which requires efforts from users to provide or revise transcripts.
In this blog, we present our work [1] on unsupervised ASR personalisation by developing a novel consistency based training method via pseudo-labelling, which has been accepted at IEEE ASRU 2023.
Consistency Constraint (CC) [2,3] forces a model to predict the same results on the same input with various versions of perturbation and has been shown effective for exploring unlabelled data. Applying various perturbations introduces randomisation to regularise a model, leading to more stable model generalisation [2].
We exploit CC to improve the robustness of training process for unsupervised ASR personalisation. A common pipeline for the unsupervised self-training method contains data filtering, pseudo-labelling, and training [4]. We introduce CC to the common unsupervised self-training pipeline, however, perturbations are applied to both pseudo-labelling and the training process, forcing the model to output a consistent label in the vicinity of the training sample. The personalisation pipeline based on the CC training method is shown in Figure 1.
Figure 1 : Unsupervised personalisation pipeline based on data filtering and consistency constraint
The full method is further described in Algorithm 1.
First, data filtering DataFilter is applied to the entire unlabelled set χ to obtain filtered set χ ̂.
Second, the model is trained on the filtered set, involving N rounds of pseudo-labelling and training. SpecAugment [5] is applied to both pseudo-labelling and training. In each round, the model f is trained for M epochs with the paired audio samples and pseudo-labels D ̂.
The proposed consistency constraint based training method is not restricted to any type of ASR model architectures. For example, when we apply it to a conformer transducer model [6], the loss function is implemented as incorporating the CC within the standard RNN-T loss:
Where y ̂ denotes the hard labels generated from pseudo-labelling, and x ̃ denotes the augmented input features.
Table 1 : Word Error Rate (WER) of the proposed method and existing methods for unsupervised personalisation
Table 1 shows that the proposed method outperforms several baselines from the literature, namely a noisy student training (NST) [9], an adapter based method [10], and an entropy minimisation method [8], achieving a new SOTA result. The proposed method not only improves the performance on data used for training, but also generalises well to unseen data from the target speaker.
Table 2 : The effect of three data filtering methods (CT, DUST, NCM) on the unsupervised personalisation performance
Table 2 shows the ASR personalisation performance of using either unfiltered whole data set or the filtered data set based on three filtering strategies, namely Confidence Thresholding (CT) [12], DUST [13], and the NCM. It shows that the CC based training is able to explore samples with erroneous labels to adapt the model. NCM is favoured for on-device personalisation, as it is a lightweight model that requires only one-time training and can be easily deployed on device.
Figure 2 : Word Error Rate Reduction (WERR) compared to the pre-trained using consistency training (CC) and unsupervised NST for 20 rounds with a choice of 1, 3 or 5 epochs per round. Higher values are better.
We study the effect of increasing the number of rounds and epochs per round on the overall Word Error Rate Reduction (WERR). Figure 2 shows that our method performs up to 40% better than unsupervised NST which is more susceptible to overfitting due to being easily stuck in a local minima. We further observed that training with five epochs per round can lead to divergence for Dictation which is a classic example of overfitting due to increased model updates. Conversely, training for a single epoch per round leads to sub-optimal convergence due to the increased stochasticity in regenerating pseudo-labels with input augmentation every round.
Figure 3 : ASR personalisation results for each of the 12 individual users. The pre-trained model, NST trained on unfiltered data, and the proposed method are compared in the plot.
Finally, we investigate the performance of the proposed method in each individual user, and the analysis is shown in Figure 3. There is a wide range of speech recognition accuracy among the test users, whose WERs range from 10% to 45%. The proposed method improves the recognition accuracy for most users on both held-in data and held-out data, demonstrating the robustness of the training process to erroneous labels.
We proposed a novel unsupervised personalisation training method to address a domain shift issue when an ASR model is deployed in the wild.