AI

Locality Enhanced Dynamic Biasing And Sampling Strategies For Contextual ASR

By Md Asif Jalal Samsung R&D Institute United Kingdom
By Pablo Peso Parada Samsung R&D Institute United Kingdom
By Karthikeyan Saravanan Samsung R&D Institute United Kingdom
By Jisi Zhang Samsung R&D Institute United Kingdom

Introduction


Automatic Speech Recognition (ASR) systems have made significant advancements in recent years, but they still face challenges when recognizing speech containing rare-words (words not often used in everyday’ s conversations) or named entities [Le et al., 2021, Munkhdalai et al., 2022, Munkhdalai et al., 2023]. For instance, the ASR system might present some difficulties when a user is trying to call their best friend with a non-native name using the voice assistant. The results might be far from the intended user’s command.

In these challenging scenarios, Contextual Biasing (CB) tries to detect if word from a list of words is present in the input speech, and then biases the ASR output to create the correct transcription if the word is detected. The list of words is designed depending on the command, for instance, if the “call” command is used then the list of words can be the entire list of contacts on the device.

In this blog, we present our work (Jalal et al., 2023) on locality enhanced dynamic biasing and sampling strategies for CB, which has been accepted at ASRU 2023. We propose end-to-end biasing on the sub-word level with locality enhanced attention distillation over the CB output. Furthermore, we conduct a correlation based analysis to demonstrate the superiority of the proposed models’ robustness and faster convergence. Different sampling strategies are explored to train CB for better generalisation in realistic scenarios and analysed for CB adaptation. We analyse the representation learning dynamics between different sampling strategies and models, while empirically visualising the bias correlations. To the best of our knowledge this analysis is novel in the contextual biasing research domain.

Background: Contextual Bias for ASR

The majority of research was focused on biasing/correcting the entities after the ASR model prediction [Alon et al., 2019, Pundak et al., 2018, Chai Sim et al., 2019, Huang et al., 2020, Le et al., 2021]. However, end-to-end approaches are becoming more popular where the ASR biasing module is integrated within the ASR [Pundak et al., 2018, Munkhdalai et al., 2022, Zhang and Zhou, 2022, Munkhdalai et al., 2023, Alexandridis et al., 2023]. An example of these end-to-end approaches is NAM [Munkhdalai et al., 2022]. The NAM module uses a Neural Associative Memory (NAM) that learns to find a word piece (WP) token (this can be either a word or a sub-section of the word) from the input list of words that is acoustically similar to the current input frame and follows a sequence of preceding WP tokens. By reconstructing previously stored patterns using partial or corrupted variances, the NAM captures the associative transition between WP sub-sequences within the same phrase. Built upon this work, [Munkhdalai et al., 2023] introduces some improvements with a top-k search, and achieved 16x inference speedup.

Why do we focus on sampling methods?


Training a reliable CB module requires effective sampling strategies when constructing contextual phrases lists [Le et al., 2021, Bleeker et al., 2023]. Sampling contextual phrases for training poses several challenges. Firstly, it is crucial to include appropriate examples that can effectively bias the ASR system towards a desired context. However, including useful context without being computationally expensive is essential [Munkhdalai et al., 2023]. Secondly, the sampling method should aim to achieve generalization that prevents overfitting [Le et al., 2021]. To mitigate this the sampling method should include negative examples that serve as distractors, i.e., phrases that do not provide any valuable contextual information. By incorporating distractors, the CB module learns to distinguish between relevant and irrelevant contextual cues, improving its ability to accurately bias the ASR system [Zhang and Zhou, 2022, Munkhdalai et al., 2022]. Finding an optimal sampling method becomes imperative to curate a manageable yet representative set of contextual phrases. We have explored three sampling strategies for training (SMa, SMb, SMc) and one sampling strategy for evaluation (SMd). Initially, n-grams are made from training transcripts. These n-grams are kept in a global random n-gram pool. We assume the contextual phrase batch for each utterance transcript may have a different number of positive samples (PS) and negative samples (NS).

For all the sampling methods, the negative n-grams are sampled from the random n-gram pool.
SMa: With this strategy, a given transcript is split into n-grams where n ranges from 1 to 3. From here, k n-grams are chosen (positive samples) and shuffled randomly, and these are the corresponding context phrases for the current utterance.
SMb: The name/rare-word entities are selected from the current transcript. The n-grams are made from the neighbouring words of those entities (positive samples).
SMc: The name/rare-word entities are selected from the current transcript. Using those entities, k n-grams are randomly selected (positive samples) and shuffled from entity n-gram pool.
SMd: The name/rare-word entities are selected from the current transcript. These k words (positive samples) along with the negative n-grams from random n-gram pool are randomly selected.

Locality Enhanced Contextual Bias


We use [Munkhdalai et al., 2022, Chang et al., 2021] for the base of the CB module, which models the transition between contiguous, adjacent sub-word tokens of OOV words. We distil the context embedding using locality bounded information [Hassani et al., 2023] from the sub-word representation and acoustic representation. The neighbourhood attention models LE-CB-v1 & LE-CB-v2 are with sliding windows as locality enhanced contextual biasing (LE-CB) modules.

Initial Bias representation learning
Initially, the input text of each phrase that is given to the contextual biasing module is tokenized and split into sub-word (WP) tokens using WordPiece algorithm [Wu et al., 2016]. The context encoder is used to extract contextual embedding representation for each sub-word unit.

Figure 1 : Locality enhanced contextual biasing.

Neighbourhood attention over bias representation
We use local biases for further attention weight distillation. Localised inductive biases are helpful for learning better representation. However, multihead attention requires training with a huge amount of data or augmentation techniques for learning those nuanced localised biases [Touvron et al., 2021]. Local attention techniques such as [Liu et al., 2021, Ramachandran et al., 2019, Hassani et al., 2023] improves it by using a sliding window for selecting salient regions and calculating the neighbouring regions. In our scenario, we select specific time-frames and compute attention with the neighbouring frames using neighbourhood attention (NA) [Hassani et al., 2023]. NA[Hassani et al., 2023] claims to be faster and achieve transitional equivariance. Two different scenarios are considered LE-CB-v1 and LE-CB-v2 in Fig 1. The NA representations are combined with the biasing vector with a weight hyperparameter λ before combining it with the context-agnostic initial audio representation.

Experimental Results

Setup
The ASR baseline follows the Conformer encoder [Gulati et al., 2020], and trained with multi-conditioned training (MCT) [Ko et al., 2017] with LibriSpeech and CommonVoice data. The CB baseline for this work is based on [Munkhdalai et al., 2022]. In this scenario, the pre-trained ASR is frozen, and the CB module is trained. All the models trained with SMa, SMb, SMc are evaluated with the SMd. The correlation among neural embeddings for sampling strategies and models are calculated with SVCCA [Raghu et al., 2017].

Table 1 : Contextual Biasing results with 10 context batch for each utterance and evaluation with SMd.

Results
The results with LE-CB-v1 and LE-CB-v1 are presented in Table 1. The baseline NAM contextual biasing architecture; LE-CB-v1 & LE-CB-v2 are the proposed models; and CB-C is the convolution based local representation learning for the ablation study. The results from Table 1 clearly demonstrates that the proposed locality enhanced distillation of contextual bias (LE-CB-v1, LE-CB-v2) achieved superior results compared to the baseline and baseline NAM.

Figure 2 : Contextual bias embedding representation learning analysis with spoch-wise embedding similarity.

Figure 2 shows the correlation among CB embeddings in between successive epochs, which demonstrates the proposed model is learning better bias representation among all sampling scenarios. The LE-CB models are more robust because after converging, bias embedding of an utterance remains closer over the next epochs (curves are less jittery) even with different context phrase batches. It is seen that training with SMb sampling method results in best representation learning and robustness. For more analysis, please refer to the paper.

Conclusions

We have proposed a novel contextual bias distillation technique, which uses local attention to learn neighbouring biased among sub-words and acoustic representation. Furthermore, sampling strategies have been demonstrated for training contextual biasing. The impact of sampling methods while training a contextual biasing model is discussed and future work could be extending this to adapt the model for large number of contextual phrases.

Selected References


[1] [Akbik et al., 2019] Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., and Vollgraf, R. (2019). FLAIR: An easy-to-use framework for state-of-the-art NLP. In NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.
[2] [Alon et al., 2019] Alon, U., Pundak, G., and Sainath, T. N. (2019). Contextual speech recognition with difficult negative training examples. In ICASSP, pages 6440–6444.
[3] [Gulati et al., 2020] Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R. (2020). Conformer: Convolution augmented transformer for speech recognition. ArXiv, abs/2005.08100.
[4] [Hassani et al., 2023] Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. (2023). Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185–6194.
[5] [Le et al., 2021] Le, D., Jain, M., Keren, G., Kim, S., Shi, Y., Mahadeokar, J., Chan, J., Shangguan, Y., Fuegen, C., Kalinli, O., et al. (2021). Contextualized streaming endto- end speech recognition with trie-based deep biasing and shallow fusion. arXiv preprint arXiv:2104.02194.
[6] [Liu et al., 2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer:Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002.
[7] [Munkhdalai et al., 2022] Munkhdalai, T., Sim, K. C., Chandorkar, A., Gao, F., Chua, M., Strohman, T., and Beaufays, F. (2022). Fast contextual adaptation with neural associative memory for on-device personalized speech recognition. In ICASSP, pages 6632–6636.
[8] [Munkhdalai et al., 2023] Munkhdalai, T., Wu, Z., Pundak,G., Sim, K. C., Li, J., Rondon, P., and Sainath, T. N.(2023). Nam+: Towards scalable end-to-end contextual biasing for adaptive asr. In SLT, pages 190–196.
[9] [Pundak et al., 2018] Pundak, G., Sainath, T. N., Prabhavalkar, R., Kannan, A., and Zhao, D. (2018). Deep context:end-to-end contextual speech recognition. In SLT, pages 418–425.
[10] [Raghu et al., 2017] Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. (2017). Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30.
[11] [Ramachandran et al., 2019] Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., and Shlens, J. (2019). Stand-alone self-attention in vision models. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch´e-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
[12] [Touvron et al., 2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. (2021). Training data-efficient image transformers amp; distillation through attention. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR.
[13] [Vaswani et al., 2017a] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017a). Attention is all you need. Advances in neural information processing systems, 30.