AI
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society. And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals. In this blog series, we are introducing our research papers at the ICASSP 2025 and here is a list of them. #4. Text-aware adapter for few-shot keyword spotting (AI Center - Seoul) #8. Globally Normalizing the Transducer for Streaming Speech Recognition (AI Center - Cambridge) |
Keyword spotting (KWS) is a technique for detecting pre-defined keywords in audio streams. Unlike fixed KWS [1]–[3], which requires users to exclusively use specific keywords, flexible KWS allows users to utilize any custom keyword. Custom keywords can be enrolled in flexible KWS either through audio [4]–[6] or text [7]–[11]. Since text-based keyword enrollment does not require multiple utterances of a target keyword and can be achieved easily via text input, the demand for Text-enrolled Flexible KWS (TF-KWS) is growing.
TF-KWS systems typically use a text encoder for enrollment and an acoustic encoder for testing, both of which are optimized using deep metric learning (DML) [12] objectives such as contrastive loss [9], triplet-based loss [7], and proxy-based loss [8], [11]. Specifically for TF-KWS, the text embedding (TE) is learned to act as a representative vector for its corresponding keyword, thus attracting the acoustic embedding (AE) of the same keyword and repelling AEs of different keywords in the shared embedding space.
While TF-KWS models can support an unlimited number of keywords, their performance does not match that of keyword-specific models trained with abundant data for each keyword [13]. Thus, there remains potential to enhance performance for specific target keywords. Given the challenge of collecting large amounts of data for a particular keyword, this problem can be approached through few-shot learning [14]. To address this, we propose a few-shot transfer learning approach called text-aware adapter (TA-adapter). To the best of our knowledge, this is the first work on applying few-shot transfer learning to TF-KWS.
Our research aims to adapt a small portion of the pre-trained acoustic encoder to a target keyword using limited speech data, leveraging the TE extracted from the corresponding text encoder. The TA-adapter consists of three main components: text-conditioned feature modulation (TCFM), feature weight adapter (FW-adapter), TE classifier. Due to its modular design, the TA-adapter enables seamless restoration to the original pre-trained model, facilitating rapid adaption to various target keywords. In our experiments, we evaluate performance of our method on the Google Speech Commands (GSC) V2 dataset [15] under noisy and reverberant conditions.
Figure 1. Overall architecture of text-aware adapter (TA-adapter). t and x represent input text and speech associated with keyword k. The red line indicates text embedding (TE) classifier and text-conditioned feature modulation (TCFM).
For the pre-trained model, we trained the acoustic and text encoders using Relational Proxy Loss (RPL) [11]. We employed the ECAPA-TDNN architecture [28] as the acoustic encoder.
Figure 2. Comparison between (a) AdaIN-based conditioning and (b) text-conditioned feature modulation (TCFM).
First, for the TA-adapter, we adopt TCFM to transfer the target keyword information from the TE into the pre-trained acoustic encoder. By freezing the text encoder, we extract a TE for the target keyword, which serves as a representative vector for its corresponding keyword. Fig. 2 highlights the difference between AdaIN [23]-based conditioning and TCFM. TCFM conditions the keyword information from the TE by learning a weighted combination of basic activation functions based on the TE, requiring significantly fewer parameters. We employ six activation functions from the set of basic activation functions in [24], selected based on their validation performance: ELU, hard sigmoid, ReLU, softplus, swish, and tanh.
In addition to feature modulation, we aim to refine the weighting and aggregation process of features within the acoustic encoder by adjusting attention weights and activation distributions. We hypothesize that essential keyword features have already been effectively learned by the pre-trained acoustic encoder using extensive training samples. Therefore, when transferring information about the target keyword with limited samples, it is sufficient to adjust the aggregation of low-level features into higher-level ones by emphasizing feature importance, which can be achieved through squeeze-and-excitation (SE) and batch normalization (BN). As shown in Fig. 1, we adapt only the BN and SE modules, highlighted in green. Our experiments demonstrate that TCFM and the FW-adapter work synergistically, complementing each other.
The TE classifier process is illustrated by the red line in Fig. 1 (labeled as ‘TE classifier’). When TCFM and FW-adapter are applied simultaneously, an AE is extracted and then passed through the final fully-connected (FC) layer. A sigmoid activation function is then applied to the output. The acoustic encoder is adapted using binary cross-entropy loss. Instead of learning a new weight vector from scratch, we fix the TE as the weight vector in the final classification layer. This is reasonable because the TE has already been trained as a representative vector for its corresponding keyword during the pre-training phase using DML.
Table 1. GSC V2 dataset used in the experiment.
The pre-training strategy followed the same approach as our previous work in [11], including the use of identical datasets, acoustic features, and model architectures. For the TA-adapter, we used the GSC V2 dataset [15] containing 35 keywords. Compared to the training set used for pre-training [33], 10 of these keywords were seen during pre-training, while 25 were unseen (as shown in Table 1). We evaluated the model under three low-resource scenarios: 5-shot, 10-shot, and 15-shot learning. For each keyword, we developed three separate models by randomly selecting 5, 10, and 15 samples, respectively, for each scenario. To develop a keyword-specific model, we fine-tuned the pre-trained model for each keyword individually. The official validation and test sets of GSC v2 were employed for model selection and evaluation. Table 1 shows the average counts of positive and negative samples for the corresponding keywords. To simulate real-world conditions, we generated noisy and reverberant speech by convolving synthetic room impulse responses (RIRs) from the OpenSLR dataset [34] and adding noise from the MUSAN dataset [35], with signal-to-noise ratios (SNRs) ranging from 5 to 25 dB.
Table 2. AP (%) for FW-adapter and TE classifier (clf). ‘BN/SE Gx’ is BN/SE adaptation in group x from Fig. 1. ‘# Params’ is the number of tunable parameters.
Table 2 presents an ablation study that evaluates the effectiveness of the FW-adapter and the TE classifier. For simplicity, the experiment focuses on a 15-shot scenario with five keywords from both seen and unseen keywords: 1) Seen keywords: ‘follow’, ‘happy’, ‘house’, ‘one’, and ‘seven’; 2) Unseen keywords: ‘cat’, ‘dog’, ‘eight’, ‘nine’, and ‘off’. The table reports the average AP (%) values for both seen and unseen keywords, as well as the average between them (‘Avg.’). ‘PT’ and ‘FT’ refer to the pre-trained and fully fine-tuned models without the adapter, respectively. ‘FT clf’ represents the model where the network is frozen, and only an additional classifier (clf) is finetuned. For ‘PT’, the score is obtained from the cosine similarity between AE and TE, indicating the use of TE classifier.
Interestingly, ‘FT (15-shot)’ performs worse (67.04%) than ‘PT’ (72.02%), as 15-shot samples are insufficient for fine-tuning all the parameters. However, when using all available samples for finetuning (FT (full-shot)), the model achieves remarkable performance (95.35%), although this is impractical due to the excessive cost of data collection. Table I shows the average number of training samples. Freezing the model and fine-tuning only an additional classifier (FT clf) yields poor performance (50.14%). Comparing ‘PT’ and ‘FT clf’, the TE classifier clearly boosts few-shot KWS performance.
We evaluate performance by individually or collectively adapting the BN layers in each group. Hereafter, we omit mentioning ‘(15- shot)’ in the method, but all methods continue to use 15-shot samples. Regardless of the adaptation location within the ECAPA-TDNN, BN adaptation consistently outperforms the pre-trained model for both seen and unseen keywords, validating our hypothesis that adjusting feature weights at each layer enables successful adaptation to a target keyword. Adopting BN adaptation across all groups (‘BN’) yields better results (79.40%) than fine-tuning individual groups, indicating that the gains from individual group adaptations are complementary and cumulative. Combining SE and BN adaptations yields further improvements. However, contrary to BN, applying SE adaptation to all groups (‘SE & BN’) performs worse compared to applying it individually to each group. We hypothesize that this could be attributed to the excessive number of additional parameters required for few-shot KWS. The best performance is an AP of 84.33%, achieved with ‘SE G3 & BN’, without adding any extra parameters.
Table 3. Performance of TCFM in terms of AP (%).
Table 3 presents the ablation results on TCFM, where ‘FW-adapter’ corresponds to ‘SE G3 & BN’. Conditioning keyword information at the lower layers (i.e., G0 to G3) degrades performance. We suspect that directly modifying their features with limited samples impairs performance, as these early layers generate fundamental features that eventually form keyword-specific representations. In contrast, the FW-adapter merely adjusts feature aggregation, enabling effective adaptation at any layer and improving performance compared to the pre-trained model. Our results suggest that applying TCFM to higher layers is suitable for conditioning TE, consistent with AdaKWS [22], where KAMs are inserted just before the final classifier. Optimal performance is achieved when applying TCFM to G4 and G5, boosting the AP of TF-KWS (i.e., ‘PT’ in Table 2) from 72.02% to 87.22%. Although TCFM slightly increases the total number of parameters (by 3.2K, representing 0.14% of the original count), the performance gain is substantial.
Table 4. Performance comparison with baseline approaches in terms of AP (%) and EER (%). †: results without fine-tuning.
Finally, Table 4 compares the performance of the TA-adapter with other baseline approaches using all 35 keywords. ‘TA-adapter’ corresponds to the best-performing model from Table 3. Across all scenarios, the TA-adatper consistently outperforms the baseline systems, including TF-KWS models (‘AdaMS’ and ‘RPL’) and few-shot learning approaches (‘2-class clf’, ‘3-class clf’, and ‘AdaKWS’). Notably, the performance gap widens under lower resource conditions.
This paper introduces the TA-adapter for addressing the few-shot transfer learning problem in TF-KWS. The TA-adapter utilizes the text embedding to condition keyword information into the acoustic encoder and to generate a keyword score. Also, BN layers and SE modules are adapted using only a few samples of the target keyword. Experimental results demonstrate that the TA-adapter effectively overcomes the challenges of few-shot KWS. Specifically, the TA-adapter boosts the pre-trained model’s average precision from 77.93% to 87.63% using just 5 target samples. Since the TA-adapter only modifies a small subset of the model, it enables seamless reversion to the original pre-trained model. In future work, we plan to leverage text-to-speech (TTS) technology to generate synthetic data, enabling zero-shot transfer learning for TF-KWS.
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10890609
[1] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4087–4091.
[2] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proc. Interspeech, 2015, pp. 1478–1482.
[3] R. Tang and J. J. Lin, “Deep residual learning for small-footprint keyword spotting,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5484–5488.
[4] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240.
[5] J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-by-example keyword spotting system using multi-head attention and soft-triple loss,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6858–6862.
[6] K. R, V. K. Kurmi, V. Namboodiri, and C. V. Jawahar, “Generalized keyword spotting using ASR embeddings,” in Proc. Interspeech, 2022, pp. 126–130.
[7] W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” in Proc. International Conference on Learning Representations (ICLR), 2017.
[8] M. Jung and H. Kim, “AdaMS: Deep metric learning with adaptive margin and adaptive scale for acoustic word discrimination,” in Proc. Interspeech, 2023, pp. 3924–3928.
[9] K. Nishu, M. Cho, and D. Naik, “Matching latent encoding for audio-text based keyword spotting,” in Proc. Interspeech, 2023, pp. 1613–1617.
[10] Y.-H. Lee and N. Cho, “PhonMatchNet: Phoneme-guided zero-shot keyword spotting for user-defined keywords,” in Proc. Interspeech, 2023, pp. 3964–3968.
[11] Y. Jung, S. Lee, J.-Y. Yang, J. Roh, C. W. Han, and H.-Y. Cho, “Relational proxy loss for audio-text based keyword spotting,” in Proc. Interspeech, 2024, pp. 327–331.
[12] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5022–5030.
[13] T. Bluche and T. Gisselbrecht, “Predicting detection filters for small footprint open-vocabulary keyword spotting,” in Proc. Interspeech, 2020, pp. 2552–2556.
[14] W.-T. Kao, Y.-K. Wu, C.-P. Chen, Z.-S. Chen, Y.-P. Tsai, and H.-Y. Lee, “On the efficiency of integrating self-supervised learning and meta-learning for user-defined few-shot keyword spotting,” in Proc. SLT, 2022, pp. 414–421.
[15] P. Warden, “Speech Commands: A dataset for limited-vocabulary speech recognition,” arXiv:1804.03209, 2018.
[16] M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi, “Few-shot keyword spotting in any language,” in Proc. Interspeech, 2021, pp. 4214–4218.
[17] A. Awasthi, K. Kilgour, and H. Rom, “Teaching keyword spotters to spot new keywords with limited examples,” in Proc. Interspeech, 2021, pp. 4254–4258.
[18] J. Jung, Y. Kim, J. Park, Y. Lim, B.-Y. Kim, Y. Jang, and J. S. Chung, “Metric learning for user-defined keyword spotting,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[19] J. Yuan, Y. Shi, L. Li, D. Wang, and A. Hamdulla, “Few-shot keyword spotting from mixed speech,” in Proc. Interspeech, 2024, pp. 5063–5067.
[20] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with residual adapters,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
[21] J. Pfeiffer, A. Ruckle, C. Poth, A. Kamath, I. Vulic, S. Ruder, K. Cho, and I. Gurevych, “Adapterhub: A framework for adapting transformers,” in Proc. EMNLP, 2020.
[22] A. Navon, A. Shamsian, N. Glazer, G. Hetz, and J. Keshet, “Open-vocabulary keyword-spotting with adaptive instance normalization,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11656–11660.
[23] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1510–1519.
[24] A. G. C. P. Ramos, A. Mehrotra, N. D. Lane, and S. Bhattacharya, “Conditioning sequence-to-sequence networks with learned activations,” in Proc. International Conference on Learning Representations (ICLR), 2022.
[25] S. S. Sarfjoo, S. R. Madikeri, P. Motlicek, and S. Marcel, “Supervised domain adaptation for text-independent speaker verification using limited data,” in Proc. Interspeech, 2020, pp. 3815–3819.
[26] T. Wang, L. Li, and D. Wang, “SE/BN adapter: Parametric efficient domain adaptation for speaker recognition,” in Proc. Interspeech, 2024, pp. 2145–2149.
[27] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
[28] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” arXiv:2005.07143, 2020.
[29] Z. Zhao, Z. Li, W. Wang, and P. Zhang, “PCF: ECAPA-TDNN with progressive channel fusion for speaker verification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[30] C. L. et al., “Dynamic TF-TDNN: Dynamic time delay neural network based on temporal-frequency attention for dialect recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[31] H. Li, B. Yang, Y. Xi, L. Yu, T. Tan, H. Li, and K. Yu, “Text-aware speech separation for multi-talker keyword spotting,” in Proc. Interspeech, 2024, pp. 337–341.
[32] S. Gao, M.-M. Cheng, K. Zhao, X. Zhang, M.-H. Yang, and P. H. S. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE TPAMI, 2019.
[33] DataOceanAI, “King-ASR-066,” 2015. [Online]. Available: https: //en.speechocean.com/datacenter/details/1446.html
[34] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
[35] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv:1510.08484, 2015.
[36] Y. Hu, S. Settle, and K. Livescu, “Multilingual jointly trained acoustic and written word embeddings,” in Proc. Interspeech, 2020, pp. 1052–1056.