AI

[INTERSPEECH 2024 Series #1] Relational Proxy Loss for Audio–Text based Keyword Spotting

By Youngmoon Jung Samsung Research
By Seungjin Lee Samsung Research
By Joon-Young Yang Samsung Research
By Jaeyoung Roh Samsung Research
By Chang Woo Han Samsung Research
By Hoon-Young Cho Samsung Research

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)

Introduction



Keyword spotting (KWS) aims at identifying pre-specified keywords in audio streams. Recently, KWS has attracted considerable research attention with the popularity of voice assistants that are triggered by keywords such as “Alexa,” “Hi Bixby,” or “Okay Google.” KWS techniques can be broadly classified into two categories: fixed KWS [1-3] and flexible (or user-defined) KWS [4-13]. Unlike fixed KWS where users are required to utter fixed keywords exclusively, flexible KWS allows users to enroll and speak arbitrary keywords. Specifically, in flexible KWS, user-defined or customized keywords can be enrolled either in the form of audio [4-7] or text [8-13]. Because the text-based keyword enrollment does not require multiple utterances of a target keyword, but can be easily achieved through text typing, the demand for text-enrolled flexible KWS is increasing.

Text-enrolled flexible KWS systems [8-13] typically utilize a text encoder during the enrollment phase and an acoustic encoder during the test phase, where both encoders are optimized employing deep metric learning (DML) [14] objectives such as contrastive loss [12], triplet-based loss [8], and proxy-based loss [9,10]. Particularly in text-enrolled flexible KWS, acoustic and text embeddings representing the same keyword must become closer, while those representing different keywords must not. Considering that the acceptance or rejection of a keyword is determined via measuring the similarity between audio and text inputs, we refer to this task as audio-text based KWS.

Figure 1. Illustration of Relational Proxy Loss.

In this blog post, we focus on the DML loss function and propose Relational Proxy Loss (RPL). We leverage structural relations of acoustic embedding (AE) and text embedding (TE) motivated by Park et al.'s work [15]. We incorporate this concept in the context of DML by employing the RPL, which utilizes the relational information within AEs and TEs. Similar to [9], we treat TEs to proxies. The primary information of proxies representing the corresponding word classes is well-presented by the structural relations, so we can assume that the AEs belonging to the corresponding classes are expected to follow the same relations. Fig. 1 illustrates the distinction between (a) traditional DML-based approaches and (b) our proposed RPL. While conventional DML losses compare an AE and a TE by computing their similarity (point-to-point), the proposed RPL compares structural relations within the embeddings (structure-to-structure). Through the implementation of RPL, the structural relations of AEs are adjusted and brought closer to those of TEs, thereby enhancing the comparability between AE and TE. Specifically, we compute distance- and angle-wise relations for both AE and TE independently. We then combine the point-to-point and structure-to-structure approaches, yielding better overall system performance than either approach alone.

Proposed method



Motivated by [9], we treat TEs as proxies representing distinct word classes because each TE is generated for a specific word class. The primary goal of the Relational Proxy Loss (RPL) is to transfer structural information from TEs to AEs by exploiting the relationships among TEs. By applying the RPL, we train the acoustic encoder to adopt the same relational structure as that of the text encoder through a relational potential function ϕ that quantifies the relational energy of the given n-tuple.

Figure 2. Illustration of three types of Relational Proxy Loss (RPL) in the embedding space: (a) distance-wise RPL (RPL-D), (b) angle-wise RPL (RPL-A), and (c) Prototypical RPL (RPL-P). d and l represent the distances between embeddings used in RPL-D and RPL-P, respectively. c_(y_2)^a refers to the prototype of the class to which the a_2 belongs.

First, we establish two types of potential functions: a distance-wise function ϕD using pairwise examples, and an angle-wise function ϕA using ternary examples. Based on ϕD and ϕA, we define distance-wise RPL (RPL-D) and angle-wise RPL (RPL-A), respectively. Both are visualized in Fig. 2(a) and (b), respectively. RPL-D can be formulated by using Huber loss [17] with a scaling factor 1 that aims to minimize the difference of the relational energies measured via ϕ between the text and acoustic encoders. Through this method, the acoustic encoder is guided towards focusing on the distance structures of AEs. An angle-wise potential function ϕA calculates the angle created by the three instances in the embedding space. To simplify computation, we opt for calculating the cosine value of the angle, which can be represented using an inner product, rather than directly obtaining the angle value. The RPL-A transfers relational knowledge about angles from the text encoder to the acoustic encoder, encouraging the acoustic encoder to focus on the angle structures of AEs. Finally, we introduce an extra kind of distance-based RPL that makes use of prototypes (i.e., class centroids). Adopting the concept from [18], we define the prototype of the AE for the class k. As shown in Fig. 2(c), the RPL-P transfers distance information between proxies (i.e., TEs) to the distance between the AE and its corresponding class centroid.

In the above, it is assumed that the classes of TEs are distinct from one another, as this is commonly observed within a mini-batch setting. When the classes coincide, the RPLs behave similarly to other types of losses. For example, RPL-D resembles the cross-entropy loss used for AE classification, since RPL-D pulls ai and aj closer together due to ϕD (ti,tj) being equal to zero. Also, RPL-P is akin to the Prototypical loss [18] if the class label yi matches k and the value of ϕD (ti,ckt) becomes zero. As such, the AE moves towards its corresponding centroid because of the term ϕD (ai,cka).

Experiments

1. Model

For the acoustic encoder, we adopted the ECAPA-TDNN [27] with 256 filters in the convolutional layers (2.2M parameters). After statistics pooling, a 512-dimensional AE was obtained. In the text encoder, we incorporated a pre-trained byte pair encoding (BPE) [28] tokenizer and a two-layer bi-LSTM with 512 hidden units (10M parameters). We used the implementation of a character-level tokenizer available at HuggingFace [29] and trained the model with a vocabulary of 100 tokens on an internal dataset consisting of 50M English texts. The tokenizer split the input text into a sequence of subword units, and then the resulting sequence was transformed into a 1024-dimensional vector sequence by a trainable look-up table. The 1024-dimensional vector sequence was fed into the following bi-LSTM. At each time step, the outputs of the bi-LSTM were concatenated and then passed through a global average pooling, followed by a fully connected layer, producing a 512-dimensional TE. The design choices for these architectures were determined based on the computational resources available for deployment on our target devices.

2. Experiment setup

We used the King-ASR-066 [19] dataset which contains recordings from 2.6k native English speakers for training. Each sample was recorded at a 16 kHz sampling rate in home/office environments. All utterances were divided into word-level segments by forced alignment using the Montreal Forced Aligner [20], and we made use of 4.6k hours of speech data. The training set comprised phrases containing one to five words, with a total of 211,676 classes. For data augmentation, we convolved the clean speech signals with synthetic room impulse responses (RIRs) from the OpenSLR dataset [21] and added various kinds of noise, such as babble, car, and music, at randomly selected signal-to-noise ratios (SNRs) between -3 and 25 dB. To evaluate the performance of all systems under consideration, we used the Wall Street Journal (WSJ) [22] corpus, acquired using the WSJ recipe available within the Kaldi toolkit [23]. To simulate real-world scenarios, we generated noisy and reverberant speech by convolving synthetic RIRs from the OpenSLR dataset and adding the MUSAN noise dataset [24] with SNRs varying from 5 to 25 dB. During evaluation, we randomly selected 1.3M positive and 13M negative pairs from the test set. We measured the model performance using two metrics: the Equal-Error-Rate (EER) [11-13] and the Average Precision (AP) [8,9,25,26], both of which are commonly used in KWS.

3. Results

Table 1. Ablation study for Relational Proxy Loss (RPL).

We use ‘P2P’ and ‘S2S’ to represent ‘point-to-point’ and ‘structure-to-structure’ methods, respectively, as shown in Fig. 1. To ensure a fair comparison, all systems share the same experimental setup, including datasets and model architectures, but differ only in their training criteria.

First, we conduct an ablation study to verify the effectiveness of each RPL loss, as presented in Table 1. In this case, we designate P2P as the AdaMS method, combining the Adaptive Margin and Scale (AdaMS) into the AsyP loss. This serves as the baseline in the table, showing 71.9% of Average Precision (AP). We observe that three RPL losses (i.e., RPL-D, RPL-A, and RPL-P) improve the performance of the P2P method. In summary, Integrating all RPL losses leads to a noticeable improvement in the performance of the ‘P2P’ method, resulting in an AP of 79.4% with a relative improvement (RI) of 10.43% compared to the baseline. Additionally, we investigate the performance when only three RPL losses are employed without P2P. As a result, the performance declines, yielding 66.8% of AP. Thus, based on these findings, we can conclude that employing both P2P and S2S demonstrates the better performance than either approach alone, and using all four losses shows optimum performance among all systems presented in the table.

Table 2. Effect of auxiliary losses (AL) for RPL.

Table 2 shows the effects of auxiliary losses (AL) on the performance of our proposed system. Here, ‘RPL’ refers to our system using P2P with three RPL losses in Table 1. The initial loss is referred to as the prototype-centroid matching loss, L_pc. It is computed based on the centroids representing the average Euclidean distance between each prototype (i.e., TE) and its corresponding centroid of AE within a mini-batch. The losses discussed so far are all related to the relationship between AE and TE, regardless of whether it is P2P or S2S. However, we find that incorporating losses from the speech-enrolled KWS approach is beneficial for the text-enrolled KWS as well. Specifically, we introduce losses from [7], consisting of two auxiliary losses, namely L_mono and L_triplet, exclusively associated with the acoustic encoder.

Table 3. Comparison of the performance between the proposed methods and other state-of-the-art systems.

In Table 3, we compare the performances of our proposed systems ‘RPL’ and ‘RPL+AL’ with those of other state-of-the-art methods. It is evident that RPL outperforms the baseline systems, and the addition of auxiliary losses leads to further improvements. ‘RPL+AL’ attains a relative improvement of 22.96% in terms of EER compared to PATN and 15.71% in terms of AP compared to AsyP, respectively.

Conclusions



We have introduced the Relational Proxy Loss (RPL) for audio--text based keyword spotting (KWS). Unlike previous works using deep metric learning that only focused on comparing acoustic embedding (AE) and text embedding (TE) in a point-to-point manner, RPL exploits the structural relations within AEs and within TEs with respect to distance and angle. Specifically, we introduced three variants of RPL, namely RPL-D, RPL-A, and RPL-P. Our experiments demonstrated that combining point-to-point and structure-to-structure approaches led to better performance. Further improvement was achieved by incorporating auxiliary losses. On the noisy and reverberant WSJ test set, the proposed RPL and RPL+AL outperformed existing techniques, including both speech enrolled KWS and text enrolled KWS methods. In the future, we will focus on audio-text alignment and try to integrate our RPL into an audio-text alignment framework that predominantly adopts a naive loss function.

Link to the paper



https://arxiv.org/pdf/2406.05314

References

[1] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4087–4091.

[2] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proc. Interspeech, 2015, pp. 1478–1482.

[3] R. Tang and J. J. Lin, “Deep residual learning for small-footprint keyword spotting,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5484–5488.

[4] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240.

[5] J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-by-example keyword spotting system using multi-head attention and soft-triple loss,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6858–6862.

[6] K. R, V. K. Kurmi, V. Namboodiri, and C. V. Jawahar, “Generalized keyword spotting using ASR embeddings,” in Proc. Interspeech, 2022, pp. 126–130.

[7] H. Lim, Y. Kim, Y. Jung, M. Jung, and H. Kim, “Learning acoustic word embeddings with phonetically associated triplet network,” arXiv:1811.02736, 2018.

[8] W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” in Proc. International Conference on Learning Representations (ICLR), 2017.

[9] M. Jung and H. Kim, “Asymmetric proxy loss for multi-view acoustic word embeddings,” in Proc. Interspeech, 2022, pp. 5170–5174.

[10] M. Jung and H. Kim, “AdaMS: Deep metric learning with adaptive margin and adaptive scale for acoustic word discrimination,” in Proc. Interspeech, 2023, pp. 3924–3928.

[11] H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning audio-text agreement for open-vocabulary keyword spotting,” in Proc. Interspeech, 2022, pp. 1871–1875.

[12] K. Nishu, M. Cho, and D. Naik, “Matching latent encoding for audio-text based keyword spotting,” in Proc. Interspeech, 2023, pp. 1613–1617.

[13] Y.-H. Lee and N. Cho, “PhonMatchNet: Phoneme-guided zero-shot keyword spotting for user-defined keywords,” in Proc. Interspeech, 2023, pp. 3964–3968.

[14] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5022–5030.

[15] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3967–3976.

[16] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” in Proc. International Conference on Pattern Recognition, 2014, pp. 34–39.

[17] P. J. Huber, “Robust estimation of a location parameter,” The annals of mathematical statistics, vol. 35, pp. 73–101, 1964.

[18] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2017.

[19] DataOceanAI, “King-ASR-066,” 2015. [Online]. Available: https://en.speechocean.com/datacenter/details/1446.html

[20] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: trainable text-speech alignment using kaldi,” in Proc. Interspeech, 2017, pp. 498–502.

[21] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.

[22] D. B. Paul and J. M. Baker, “The design for the wall street journal-based CSR corpus,” in Proc. the Workshop on Speech and Natural Language, 1992, pp. 357–362.

[23] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011.

[24] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv:1510.08484v1, 2015.

[25] Y. Hu, S. Settle, and K. Livescu, “Multilingual jointly trained acoustic and written word embeddings,” in Proc. Interspeech, 2020, pp. 1052–1056.

[26] M. Jung, H. Lim, J. Goo, Y. Jung, and H. Kim, “Additional shared decoder on siamese multi-view encoders for learning acoustic word embeddings,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2019.

[27] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” arXiv:2005.07143, 2020.

[28] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016, pp. 1715–1725.

[29] A. Moi and N. Patry, “HuggingFace’s Tokenizers,” 2023. [Online]. Available: https://github.com/huggingface/tokenizers