Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

By Umberto Michieli Samsung R&D Institute United Kingdom
By Pablo Peso Parada Samsung R&D Institute United Kingdom
By Mete Ozay Samsung R&D Institute United Kingdom

In this blog, we introduce our new work [1] published at the INTERPSEECH 2023 Conference.


Recent advancement in speech signal processing has produced extremely large transformer models with billions of parameters achieving outstanding performance compared to previous models.
Nonetheless, one of the main use-cases of such models lies in their implementation on limited-resource devices, such as virtual assistants on smartphones.
According to recent studies [2], users of such virtual assistants strive for on-device customization and personalization of such models to e.g., recognize their own set of commands or interact with their specific wake-up words without sharing any data with a central server.

In this work, we address the need for efficient and on-device personalization of keywords spotting (KWS) models.

We formulate the problem as a class-incremental Online Continual Learning for Embedded devices (EOCL), where we exploit the features identified by a large pre-trained model to continuously recognize new keywords over time.

The desiderata of our setup are as follows:

1. Continual learning of new concepts without forgetting previous ones;

2. Learning online from a stream of data, since samples cannot be stored on device due to storage and privacy limitations;

3. Efficient update of massive speech processing transformer models via frozen backbone, small-batch updates and a limited number of training parameters with no backpropagation.

Our Method: TAP-SLDA

The proposed method (TAP-SLDA) is composed of three main components, as we outline in Figure 1. and below:

1. A feature extractor , e.g., a large transformer model for speech recognition such as Wav2Vec2 [3], pre-trained on the server side, transmitted to the device and then frozen during on-device adaptation;

2. A pooling mechanism to extract rich information from the scarce input data. To perform pooling, we propose a Temporal Aware Pooling technique that computes and concatenates the first statistical moments from the output of . This idea has been extended from other recent works such as [2], where R=4 has been used to enhance speaker verification tasks.

3. A classifier . We incrementally train over samples (indexed by ) of subsequent tasks (indexed by ). We identified as a modified version of SLDA [3] to process the enriched feature space spanned by . This classifier computes a Gaussian model per each class storing a running mean feature vector per class (i.e., class prototypes) and one covariance matrix shared across classes, that is updated online. During inference, assigns the label of the most likely Gaussian model of a category to the input sample.

Figure 1.  Our method (TAP-SLDA) to tackle the EOCL task

Figure 2.  Distribution of statistical moments from for classes

Intuitively, our method achieves superior results since features of different classes have similar distributions of first-order moments, while higher moments capture the difference, as illustrated in Figure 2.

Experimental Setup

Datasets. We employ two datasets to evaluate performance:

1. GSC-V2 [6] which comprises of 35 English words, and

2. MSWC [7]. Here, we picked the 5 most represented languages (en, de, fr, ca, rw) and create 3 micro-sets with different number of keywords N = {25, 50, 100}. We release the splits at

Models. We use 6 recent speech recognition backbones as : Wav2Vec2-Base/-Large [3], HuBERT-Base/-Large/-XLarge [8] and Emformer-Base [9]. All models have been pre-trained on LibriSpeech (English corpus).

Metrics. We evaluate performance according to several metrics.
Methods are ranked according to top-1 accuracy, Acc (%).
We consider various CL metrics [10] such as backward transfer (, ↑), Forgetting (↓), Plasticity (↑).
We compare different methods via the relative comparison of their accuracy:

Our Results

TAP-SLDA vs. OCL methods is shown in Table 1. Some methods largely overfit to the last seen data and cannot mitigate forgetting (low Acc/BwT and high Forg/Pla). Other methods attempt to find some kind of trade-off between plasticity and forgetting.

SLDA is the best approach, therefore we selected it to apply our TAP pooling mechanism. TAP-SLDA shows strong accuracy gains on every backbone, improving on average by relative 37.8% compared to SLDA (76.7% vs. 85.5%). The key element in TAP-SLDA is the computation of the high-order feature space that provide useful temporal characteristics to the subsequent centroids- and covariance- based modeling. That is, providing richer, temporal-aware statistics of input waveform is beneficial for EOCL KWS, where models must adapt fast exploiting all the information and relate it to previous knowledge.

Table 1.  Results of 10 OCL methods and 6 backbones on the GSC-V2 dataset. Each entry is averaged over 5 distinct class ordering.

TAP improves every OCL method. To show benefits of TAP to EOCL KWS, we include it on top of each OCL method in Table 2. Comparing the results against Table 1 emerges that TAP improves OCL methods almost every time (7 exceptions out of 54 cases). Sometimes, we observe a large gain up to ∼60%.

Table 2.  Accuracy of OCL methods employing our TAP on GSC-V2. Improvement with respect to original methods within brackets.

TAP outperforms other pooling schemes as we show in Table 3 for the best OCL method (SLDA). The best methods prove to be those extracting temporal clues, for example, when covariance (iSQRT-COV) or variance (TSDP and TSTP) of features are used. Using only standard deviation (TSDP) is less useful than using it in conjunction with AVG (i.e., TSTP). TSTP shows robust gains compared to AVG in every scenario by preservation of temporal statistics after pooling. Our TAP moves from similar ideas and encodes further higher-order statistics in the pooled output. TAP is the best performing method on every architecture except for Emformer-Base where it ranks second. On average, it improves by relative 37.8% compared to AVG and by 19.4% compared to TSTP. We confirm the superiority of our TAP for EOCL KWS, which is achieved by extracting rich temporal dynamics from the single iteration over the input data.

Table 3.  Accuracy of SLDA under different pooling strategies on GSC-V2 (best in bold, runner-up underlined).

In Table 4 we verify that larger feature space is NOT all we need to succeed at our EOCL task. Increasing the size of the pooled output is enough to achieve the performance gains observed so far. RAP [11] extracts the top-k% features: the large increase of feature size does not reflect into accuracy gains. Motivated by the success of MAX and aiming at bringing temporal awareness into pooling, we considered a 2·l window around MAX (i.e., MAXWl). MAXWl can only match the results of MAX although showing higher feature size. On the extreme, FLAT flattens all features into a long 1D vector (i.e., no pooling).

Despite drastic increase of feature size, these schemes cannot capture temporal dynamics of features. On the other hand, our method improves accuracy thanks to the richer temporal statistics extracted from input features while maintaining practically unchanged the total number of parameters as we show next.

Table 4.  Accuracy of pooling methods which increase pooled feature size. Evaluation on the GSC-V2 using SLDA as CL method. : average feature size increment multiplier.

TAP only adds minimal overhead as we observe in Table 5. We observe that: 1) R = 5 always brings the highest accuracy for all the 3 methods; and 2) of TAP-SLDA (0.12% for the highest accuracy) shows only minimal increase compared to that of SLDA (0.10%). On average, TAP increases training (inference) time by just 2.1% (2.6%).

Table 5.  Accuracy and increase of parameters (%) over the backbone (). Metrics averaged over the 6 networks on GSC-V2. We use TAP with variable R (R=1 is AVG). TAP has minimal footprint.

TAP enhances personalization to other languages. HuBERT-Base model pre-trained on unlabelled English data only and adapted to recognize keywords in different languages. Accuracy is reduced compared to GSC, due to the harder MSWC benchmark (higher results could be achieved by multilingual pre-training, which is out of the scope of our current work). Nonetheless, TAP-SLDA improves over the baseline on every language and class set (mean relative gain of 12.8%, and of 12.6% in the hardest case with 100 classes), therefore being extremely effective when domain of use changes, as it often happens for deployed KWS systems.

Figure 3.  Accuracy of HuBERT-Base on MSWC micro-splits. Accuracy of FT averaged over class sets are respectively: 1.6, 1.5, 1.6, 1.4, 1.7.


We proposed a new task: online continual learning for KWS models targeting low-resource devices with limited computational and storage capability

We proposed a new method: TAP-SLDA, a parameter-efficient online continual learning method

TAP-SLDA features:

• New temporal-aware pooling scheme based on the first 5 moments of extracted features

• Lightweight solution: frozen feature extractor + class-conditional Gaussian modelling of feature space

• Extraction of high-order statistical moments of the embedded features of input samples

• Robust results in a variety of scenarios on several backbones

Link to the paper


[1] U. Michieli, P. P. Parada and M. Ozay, “Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics,” in INTERSPEECH, 2023.

[2] "The Rising Demand for Personalization is Driving the Intelligent Virtual Assistant Market," in The Digital Journal, 2023.

[3] A. Baevski, Y. Zhou, A. Mohamed and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," in NeurIPS, 2020.

[4] L. You, L. Guo, L. Dai and D. J, "Multi-Task Learning with High-Order Statistics for X-vector based Text-Independent Speaker Verification," in INTERSPEECH, 2019.

[5] T. L. Hayes and C. Kanan, "Lifelong machine learning with deep streaming linear discriminant analysis," in CVPRW, 2020.

[6] P. Warden, "Speech commands: A dataset for limited-vocabulary speech recognition," in arXiv:1804.03209, 2018.

[7] M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez, M. Sabini, P. Mattson, D. Kanter, W. P. Greg Diamos, J. Meyer and V. J. Reddi, "Multilingual spoken words corpus," in NeurIPS, 2021.

[8] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov and A. Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.

[9] Y. Shi, Y. Wang, C. Wu, C.-F. Yeh, J. Chan, F. Zhang, D. Le and M. Seltzer, "Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition," in ICASSP, 2021.

[10] Z. Mai, R. Li, J. Jeong, D. Quispe, H. Kim and S. Sanner, "Online continual learning in image classification: An empirical survey," in Neurocomputing, 2022.

[11] S. Bera and V. K. Shrivastava, "Effect of pooling strategy on convolutional neural network for classification of hyperspectral remote sensing images," in IET Image Processing, 2020.