AI

[INTERSPEECH 2024 Series #6] SummaryMixing makes speech technologies faster and cheaper

By Titouan Parcollet Samsung AI Center - Cambridge
By Rogier Van Dalen Samsung AI Center - Cambridge
By Shucong Zhang Samsung AI Center - Cambridge
By Sourav Bhattacharya Samsung AI Center - Cambridge

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)




Speech technology is integral to our lives. However, it still has pain points. For example, state-of-the-art speech recognisers have trouble with long utterances. Dictate to your phone for more than a minute, and it may slow down. Transcribe a podcast and your computer might run out of memory and crash. The underlying reason is something called “self-attention” which takes up more and more resources on longer inputs. Deployed systems often mitigate against this, for example by chopping up the input, but that reduces accuracy. Instead, researchers at Samsung AI Center Cambridge (SAIC-C) have attacked this problem with self-attention at its root, by replacing self-attention with something they have named “SummaryMixing”.

Samsung AI Center Cambridge (SAIC-C) is the leading AI research laboratory from Samsung in Europe. It is tasked with coming up with clever tech that causes step changes in the user experience. These innovations can then be found in Samsung devices, like in the Bixby voice assistant. This blog post summarizes a one-year effort of the Speech Team at SAIC-C to deliver faster (and cheaper to train) speech technologies. Two scientific papers about this work (see the links at the bottom of this page) are being presented this week at Interspeech 2024.

The new method, SummaryMixing, can be inserted in most existing deep learning models used for processing speech. This method improves the responsiveness and stability of an application relying on this technology by drastically reducing the processing time and the required memory to run it. The source code for SummaryMixing is released under the Creative Commons Attribution-NonCommercial 4.0 International licence and as a plug-and-play add-on to the SpeechBrain toolkit available on SamsungLabs Github: https://github.com/SamsungLabs/SummaryMixing

The longer the audio, the more resources self-attention requires



So what makes self-attention slow on long audio segments? Self-attention takes information from all time steps, say, 100 times a second, in the audio, and compares each time step with each time step. Imagine hosting a party and introducing all guests to each other. Four guests require only 6 introductions; but with 20 guests this rockets to 190. This is because each additional guest needs more introductions: the fourth guest meets 3 people, but the 20th guest is introduced to 19 others. It is the same with self-attention, which performs more and more comparisons with each additional second of audio.

The question then arises: are all these individual comparisons necessary? From existing research, usually all comparisons yield the same result. It should be possible to, effectively, re-use the results of old comparisons. It is like instead of introducing guests one by one, we say “I know all these people from work”. This equalises the time taken for each new guest. When graphing the total time spent on introductions against the number of guests, the result is a line, and hence this is called “linear complexity”, which contrasts with the “quadratic complexity” of self-attention.

SummaryMixing can listen to you for a much longer time



The Speech Team introduces “SummaryMixing” to replace self-attention. Figure 1 shows time and memory use for speech recognition with SummaryMixing and self-attention. For SummaryMixing, the graphs have straight lines, indicating linear complexity. With self-attention, on the other hand, time and memory use spiral out of control much more quickly. Hence, given a fixed amount of memory for the Bixby application, SummaryMixing can listen to the user much longer than self-attention before exceeding the smartphone capacity or impacting the responsiveness of the application. It can also transcribe speech much faster.

Figure 1. Use of resources (time and of memory) during speech recognition. With SummaryMixing, the novel approach, resource use increases in a straight line, whereas with self-attention, it increases much faster, causing the recognizer to slow down and/or run out of memory.

Self-attention is a block in a neural network, as is SummaryMixing. Before being processed by a deep learning model, a speech utterance is transformed into a sequence of time-steps represented as vectors. For instance, one second of a user recording typically would result in a sequence of 100 vectors, also called frames. This sequence is then given to the deep learning model that may contain SummaryMixing or self-attention. The core difference between both lies in how the interactions between each frame composing the input speech are modeled. In self-attention, every single frame is compared with every other frame composing the sequence, like in our party analogy, hence leading to a quadratic increase in the necessary computations as the input sequence gets longer. Conversely, in SummaryMixing, these pair-wise interactions are replaced with an average. Hence, the whole sequence of frames is averaged into a single new vector. The averaging operation has a linear time-complexity as we just need to add every frame of a speech utterance once before dividing by the total number of frames. This average is also very close to what self-attention does in speech recognition models, except that self-attention does it with a quadratic time-complexity.

Any existing state-of-the-art speech technology can be very easily turned into a SummaryMixing variant as only self-attention needs to be replaced. The rest of the model will remain identical.

Are SummaryMixing-based speech technologies as accurate as self-attention ones?



The Speech Team has evaluated the accuracy of SummaryMixing across a wide range of applications including streaming and offline speech recognition but also intent classification, emotion recognition and speaker verification based on self-supervised pre-training. The latter technique is a method to learn very large models on gigantic unlabeled speech datasets i.e. only the raw speech signal is available, and no annotations are given. These big models are then adapted to various speech tasks that often offer very limited annotated data, reaching very high levels of accuracy.

For speech recognition, Figure 2 represents the average word error rates (WERs, lower is better) observed on all considered speech datasets (per-dataset WER are given in the articles) with SummaryMixing and self-attention equipped models. From the WERs (lower is better), it appears clear that SummaryMixing performs on-par, or even outperforms slightly self-attention. However, such performance is obtained at much lower training and recognition costs than self-attention.

Figure 2. Average word error rates during speech recognition tasks (lower is better) obtained with SummaryMixing, the proposed method, compared to self-attention, the previous state-of-the-art.

Figure 3. Shows speech recognition results when self-supervised learning is utilized. The considered datasets are different and harder than the ones previously used. Once again, SummaryMixing performs on-par with self-attention, or even slightly better sometimes. It must be stressed that for self-supervision, lowering the training cost is of critical interest. Indeed, typical self-supervised learning models are trained for weeks and months on thousands of expansive and energy consuming GPUs. Details of performance on other tasks can be found in the articles.

Figure 3. Average word error rates during speech recognition tasks (lower is better) obtained with SummaryMixing, the proposed method, compared to self-attention, the previous state-of-the-art. These results are obtained with self-supervised learning and on more challenging speech recognition datasets than Figure 2.

Everyone should use SummaryMixing instead of self-attention for speech technologies



The SAIC-C Speech Team’s invention makes speech technologies faster and cheaper to use than current state-of-the-art methods. Accuracies of trained models do not suffer from this change, and even goes up for some applications.

Samsung also decided to open-source the code of SummaryMixing in compatibility with one of the most used speech toolkit worldwide: SpeechBrain.

GitHub repository

https://github.com/SamsungLabs/SummaryMixing