AI
Automatic Speech Recognition (ASR) systems have made significant advancements in recent years, but they still face challenges when recognizing speech containing rare-words (words not often used in everyday’ s conversations) or named entities [Le et al., 2021, Munkhdalai et al., 2022, Munkhdalai et al., 2023]. For instance, the ASR system might present some difficulties when a user is trying to call their best friend with a non-native name using the voice assistant. The results might be far from the intended user’s command.
In these challenging scenarios, Contextual Biasing (CB) tries to detect if word from a list of words is present in the input speech, and then biases the ASR output to create the correct transcription if the word is detected. The list of words is designed depending on the command, for instance, if the “call” command is used then the list of words can be the entire list of contacts on the device.
In this blog, we present our work (Jalal et al., 2023) on locality enhanced dynamic biasing and sampling strategies for CB, which has been accepted at ASRU 2023. We propose end-to-end biasing on the sub-word level with locality enhanced attention distillation over the CB output. Furthermore, we conduct a correlation based analysis to demonstrate the superiority of the proposed models’ robustness and faster convergence. Different sampling strategies are explored to train CB for better generalisation in realistic scenarios and analysed for CB adaptation. We analyse the representation learning dynamics between different sampling strategies and models, while empirically visualising the bias correlations. To the best of our knowledge this analysis is novel in the contextual biasing research domain.
The majority of research was focused on biasing/correcting the entities after the ASR model prediction [Alon et al., 2019, Pundak et al., 2018, Chai Sim et al., 2019, Huang et al., 2020, Le et al., 2021]. However, end-to-end approaches are becoming more popular where the ASR biasing module is integrated within the ASR [Pundak et al., 2018, Munkhdalai et al., 2022, Zhang and Zhou, 2022, Munkhdalai et al., 2023, Alexandridis et al., 2023]. An example of these end-to-end approaches is NAM [Munkhdalai et al., 2022]. The NAM module uses a Neural Associative Memory (NAM) that learns to find a word piece (WP) token (this can be either a word or a sub-section of the word) from the input list of words that is acoustically similar to the current input frame and follows a sequence of preceding WP tokens. By reconstructing previously stored patterns using partial or corrupted variances, the NAM captures the associative transition between WP sub-sequences within the same phrase. Built upon this work, [Munkhdalai et al., 2023] introduces some improvements with a top-k search, and achieved 16x inference speedup.
Training a reliable CB module requires effective sampling strategies when constructing contextual phrases lists [Le et al., 2021, Bleeker et al., 2023]. Sampling contextual phrases for training poses several challenges. Firstly, it is crucial to include appropriate examples that can effectively bias the ASR system towards a desired context. However, including useful context without being computationally expensive is essential [Munkhdalai et al., 2023]. Secondly, the sampling method should aim to achieve generalization that prevents overfitting [Le et al., 2021]. To mitigate this the sampling method should include negative examples that serve as distractors, i.e., phrases that do not provide any valuable contextual information. By incorporating distractors, the CB module learns to distinguish between relevant and irrelevant contextual cues, improving its ability to accurately bias the ASR system [Zhang and Zhou, 2022, Munkhdalai et al., 2022]. Finding an optimal sampling method becomes imperative to curate a manageable yet representative set of contextual phrases. We have explored three sampling strategies for training (SMa, SMb, SMc) and one sampling strategy for evaluation (SMd). Initially, n-grams are made from training transcripts. These n-grams are kept in a global random n-gram pool. We assume the contextual phrase batch for each utterance transcript may have a different number of positive samples (PS) and negative samples (NS).
For all the sampling methods, the negative n-grams are sampled from the random n-gram pool.
SMa: With this strategy, a given transcript is split into n-grams where n ranges from 1 to 3. From here, k n-grams are chosen (positive samples) and shuffled randomly, and these are the corresponding context phrases for the current utterance.
SMb: The name/rare-word entities are selected from the current transcript. The n-grams are made from the neighbouring words of those entities (positive samples).
SMc: The name/rare-word entities are selected from the current transcript. Using those entities, k n-grams are randomly selected (positive samples) and shuffled from entity n-gram pool.
SMd: The name/rare-word entities are selected from the current transcript. These k words (positive samples) along with the negative n-grams from random n-gram pool are randomly selected.
We use [Munkhdalai et al., 2022, Chang et al., 2021] for the base of the CB module, which models the transition between contiguous, adjacent sub-word tokens of OOV words. We distil the context embedding using locality bounded information [Hassani et al., 2023] from the sub-word representation and acoustic representation. The neighbourhood attention models LE-CB-v1 & LE-CB-v2 are with sliding windows as locality enhanced contextual biasing (LE-CB) modules.
Initial Bias representation learning
Initially, the input text of each phrase that is given to the contextual biasing module is tokenized and split into sub-word (WP) tokens using WordPiece algorithm [Wu et al., 2016]. The context encoder is used to extract contextual embedding representation for each sub-word unit.
Figure 1 : Locality enhanced contextual biasing.
Neighbourhood attention over bias representation
We use local biases for further attention weight distillation. Localised inductive biases are helpful for learning better representation. However, multihead attention requires training with a huge amount of data or augmentation techniques for learning those nuanced localised biases [Touvron et al., 2021]. Local attention techniques such as [Liu et al., 2021, Ramachandran et al., 2019, Hassani et al., 2023] improves it by using a sliding window for selecting salient regions and calculating the neighbouring regions. In our scenario, we select specific time-frames and compute attention with the neighbouring frames using neighbourhood attention (NA) [Hassani et al., 2023]. NA[Hassani et al., 2023] claims to be faster and achieve transitional equivariance. Two different scenarios are considered LE-CB-v1 and LE-CB-v2 in Fig 1. The NA representations are combined with the biasing vector with a weight hyperparameter λ before combining it with the context-agnostic initial audio representation.
Setup
The ASR baseline follows the Conformer encoder [Gulati et al., 2020], and trained with multi-conditioned training (MCT) [Ko et al., 2017] with LibriSpeech and CommonVoice data. The CB baseline for this work is based on [Munkhdalai et al., 2022]. In this scenario, the pre-trained ASR is frozen, and the CB module is trained. All the models trained with SMa, SMb, SMc are evaluated with the SMd. The correlation among neural embeddings for sampling strategies and models are calculated with SVCCA [Raghu et al., 2017].
Table 1 : Contextual Biasing results with 10 context batch for each utterance and evaluation with SMd.
Results
The results with LE-CB-v1 and LE-CB-v1 are presented in Table 1. The baseline NAM contextual biasing architecture; LE-CB-v1 & LE-CB-v2 are the proposed models; and CB-C is the convolution based local representation learning for the ablation study. The results from Table 1 clearly demonstrates that the proposed locality enhanced distillation of contextual bias (LE-CB-v1, LE-CB-v2) achieved superior results compared to the baseline and baseline NAM.
Figure 2 : Contextual bias embedding representation learning analysis with spoch-wise embedding similarity.
Figure 2 shows the correlation among CB embeddings in between successive epochs, which demonstrates the proposed model is learning better bias representation among all sampling scenarios. The LE-CB models are more robust because after converging, bias embedding of an utterance remains closer over the next epochs (curves are less jittery) even with different context phrase batches. It is seen that training with SMb sampling method results in best representation learning and robustness. For more analysis, please refer to the paper.
We have proposed a novel contextual bias distillation technique, which uses local attention to learn neighbouring biased among sub-words and acoustic representation. Furthermore, sampling strategies have been demonstrated for training contextual biasing. The impact of sampling methods while training a contextual biasing model is discussed and future work could be extending this to adapt the model for large number of contextual phrases.