AI

[INTERSPEECH 2024 Series #5] Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds

By Hanbin Bae Samsung Research
By Pavel Andreev Samsung R&D Institute Russia
By Azat Saginbaev Samsung R&D Institute Russia
By Nicholas Babaev Samsung R&D Institute Russia
By Won-Jun Lee Samsung Research
By Hosang Sung Samsung Research
By Hoon-Young Cho Samsung Research

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)

Introduction



Recently, true wireless stereo (TWS) earbuds have been successfully popularized along with mobile phones, thereby increasing convenience for many users. In line with this, companies developing earbuds have introduced a variety of functions to maximize user experience. Active noise cancellation (ANC), which blocks almost all sounds around the user, is a core feature of the TWS earbud. This function enhances various experiences in everyday life, such as listening to music, making calls, or focusing on work by removing background noise.

The need for additional technology becomes apparent when a person wearing earbuds and applying ANC wants to have conversations with nearby people. Currently, to clearly hear the voice of a nearby person, the user must turn off the ANC function or remove the earbuds altogether. If there is a technology that can enhance the voice of a nearby person while reducing ambient sounds through ANC, this reduces several inconveniences, such as missing a few words, delaying conversation, and increasing the risk of losing earbuds. In this study, we aim to apply a speech enhancement solution focused on advancing the noise suppression capabilities of earbuds, particularly in noisy environments where ANC is in operation. We aim to ensure that suppression does not hinder conversations. This necessitates the development of advanced low-latency speech enhancement models capable of ensuring a balance between noise reduction and critical sound preservation.

To successfully implement an effective speech enhancement model for the aforementioned scenario, two key criteria must be satisfied. First, the algorithmic latency of the speech enhancement module should be maintained at a maximum of 3 ms or less. This is a critical factor in the context of remote communication, where users are noticeably more sensitive to delays. This sensitivity arises from the fact that users interact in the same space, making any inconvenience caused by the spectral coloration of the comb-filtering effect [1] from the superposition of the direct and delayed speech more disruptive. Second, the use of computing resources must be minimized. This is particularly crucial in real-time on-device applications, such as ours, where the efficient use of resources can significantly impact the performance and user experience.

To meet these requirements, we explored several design choices to achieve efficient low-latency speech enhancement.

    1. We compared the efficiencies of a state-of-the-art frequencydomain network and a time-domain baseline and discovered
       that the time-domain baseline was more effective when allocated comparable computational resources and algorithmic
       latency.

    2. We investigated whether modern structured state space-based models [2, 3] could replace our conventional Wave-U-Net +
       LSTM baseline structure. Despite the encouraging results for long-context modeling tasks, these models were unable to
       outperform our simple baseline in a low-latency speech enhancement setup.

    3. We evaluated the efficiency of adversarial losses, a common tool for training contemporary speech enhancement models
       [4–6], in a low-latency setup and noted its propensity for speech oversuppression. To counter this effect, we suggested
       two-stage training that combines Phone-Fortified Perceptual Loss (PFPL) [7], adversarial [4], UTokyo-sarulab Mean Opinion
       Score (UTMOS) [8], and Perceptual Evaluation Speech Quality (PESQ) [9] losses, that we believe can enhance speech
       intelligibility and minimize artifacts.

    4. We assessed the performance of the magnitude pruning method against that of the novel Sparsity Profiles via DYnamic
       programming search (SPDY) + Optimal Brain Compression (OBC) method [10,11]. We observed that the SPDY + OBC method
       significantly improved the quality of the pruned models.

Overall, the combination of these techniques delivered low latency speech enhancement models with a 3 ms algorithmic latency and 0.21 GMAC complexity (or 291 MCPS after being ported on-device), making it suitable for on-device usage while outperforming the baselines with less latency and complexity

Speech Boosting

1. Wave-U-Net + LSTM

Figure 1. Architecture of the Wave-U-Net + LSTM model.

We began with the Wave-U-Net + LSTM baseline introduced in [12]. We used the adversarial loss function with MSD discriminators because this loss had better perceptual properties than regression-type losses [4, 6, 12]. We used a three-layer architecture with 4, 4, and 2 strides and 32, 64, and 128 channels, as shown in Figure 1. The model operated at a 16 kHz sample rate. It used a chunk size of 32 timesteps and a lookahead of 16 timesteps. This look-ahead is achieved by duplicating the input waveform into several channels, each containing a shifted waveform. Consequently, the total algorithmic latency was 48 timesteps, that is equivalent to 3 ms. The computational complexity of this model was approximately 2 GMAC/s. In all the experiments, the batch size was 16, segment size was set to 2 s, and Adam optimizer was used with a learning rate of 0.0002 and betas 0.8 and 0.9.

2. Loss functions

We observed that the adversarial loss function has two main disadvantages. First, training with this loss is considerably slow because of the training of the discriminators. Second, low-latency models trained with adversarial loss tend to oversuppress the speech content within the recording. This obstacle is expected because the adversarial loss promotes outputs of the model to be within the speech recording distribution rather than preserving the speech content; thus, it may sacrifice some of the speech content to increase the distribution credibility of the generated speech.

Thus, as an alternative, we trained the model using the PFPL [7]. The PFPL is a regression-type loss formulated by combining the time-domain L1 loss and the Wasserstein distance between the wav2vec2.0 [13] features of the generated and reference (clean) waveforms.

Usage of the PFPL in the initial training stage offers two significant benefits: (1) This loss impeccably retains speech content during the noise suppression process. This is likely owing to the incorporation of wav2vec2.0 features, known for their proficiency in extracting speech content. (2) Training with the PFPL is considerably faster than with its adversarial counterpart owing to the absence of discriminator training.

However, the use of the PFPL during training sometimes leads to the emergence of background squeak artifacts, a typical phenomenon associated with regression-type losses. To address this, we implemented a second stage of training (fine-tuning) that integrated the adversarial [4], UTMOS [8], and PESQ [14, 15] losses with 1, 50 and 5 weights, respectively

We applied adversarial loss with MSD discriminators as an effective solution for squeak artifacts. This ensured a correlation between the distributions of the clean and generated signals, thereby correcting any distributional discrepancies. Concurrently, UTMOS and PESQ augmented speech intelligibility by incorporating insights gleaned from human preference studies. We utilized the official implementation of the UTMOS score and the PyTorch implementation of the PESQ metric. Both metrics are differentiable with respect to their inputs and can therefore be applied as loss functions (multiplied by negative constants). Owing to the initial stage of PFPL training, we only had to fine-tune the models with second-stage losses for a few epochs, thereby saving time and preserving the speech content captured by the PFPL training.

To verify the efficacy of the proposed training pipeline, we compared it with vanilla adversarial training and vanilla PFPL training. As summarized in the results in Table 1, the proposed two-stage training procedure considerably outperformed the baselines according to human opinion.

Table 1. Comparison of losses

3. Pruning

The original Wave-U-Net + LSTM model had a complexity of approximately 2 GMACs, rendering it unsuitable for on-device deployment. We implemented block-structured pruning to optimize the model in terms of performance and storage.

    1. For convolutional layers, we applied kernel pruning that enforces sparsity in such a way that if W represents a weight in a
        convolutional layer, then for certain input channel i and output channel j, W[i, j, :] = 0. We only stored the indices of the
        non-zero kernels and computed the outputs based on these kernels.
    2. The LSTM layers were pruned using block sparsity. For each non-zero block, we recorded its coordinates within the fully
        connected LSTM layers and performed computations only for these non-zero blocks. The blocks measured 16 × 1.

Our pruning pipeline followed an iterative prune + finetune strategy. At each pruning iteration, 10% of the remaining weights were pruned and the model was fine-tuned for 50 epochs. The procedure was continued until the total sparsity of the model ≈90% (complexity-wise). The key is determining the weights required to prune at each iteration. We handled this problem by using the SPDY + OBC pruning strategy. This strategy decomposes the pruning process into layer-wise local pruning (OBC) [11] and search for layer sparsity distributions (SPDY) [10].

In the first step of SPDY + OBC, we used the OBC for pruning each layer independently to optimally reconstruct local activations using the mean squared error criterion, given the sparsity constraint. This approach is based on the exact realization of the classical optimal brain surgeon framework applied to local layer pruning. Using the OBC, we obtained a bank of weights for each layer that satisfied different sparsities.

Subsequently, the SPDY search was employed to determine layer sparsities such that the total model sparsity was suitable for the current computational budget while maximizing the model performance on the calibration data. The algorithm assumed a linear dependency of the model quality on the log-sparsity levels of the layers and used dynamic programming to determine the sparsity levels. The linear dependency parameters were optimized using differential evolution and random search (shrinking neighborhood local search) algorithms for global optimization.

We compared the proposed pruning pipeline with common baseline magnitude pruning and observed that SPDY + OBC pruning drastically improved the quality of the pruned models under similar complexity constraints.

Table 2. Comparison of pruning methods


4. HiFi4 DSP Simulation

We implemented the 0.21 GMAC pruned model in the native C code. Running this code on on a system with Cadence Tensilica HiFi4 DSP core [16] provided 2031 million clocks per second (MCPS) for this model. This is the total number of clocks, including the instructions to load and store each variable required for the calculations through the data memory interface. The clock frequencies supported by the micro control units are typically approximately 300–600 MHz. Because the processing time is longer than the algorithmic latency, delays are inevitable. Therefore, single instruction multiple data (SIMD) operations, such as the 16-bit four-way SIMD operation of HiFi4 DSP for fixed-point numbers, have to be used to reduce the total MCPS. We converted the inputs and parameters of each layer into fixed-point numbers using Q format. In this case, the input values were converted to Q12 as a 32-bit integer variable. The weights and biases of the convolutional layers were converted to Q13 and Q25 as 16-bit short and 32-bit integer variables, respectively. The weights and biases of the LSTM layers were converted to Q13 as 16-bit short variables. Subsequently, we replaced the calculation expressions of the convolutional and LSTM layers with SIMD operations of the HiFi4 DSP. The final optimized model had 291 MCPS and around 800 kB size.

Results and Future work



Figure 2. Examples of speech denoising performance

Figure 2 shows the enhanced speech samples of all the process models, where the input sample is a male voice mixed with subway noise at an SNR of 2.5 dB. Samples mixed with babble noise is shown in the left segment, and samples mixed with the additional harmonic noise of the alarm sound is shown in the right segment. As observed, the babble noise was well removed, whereas the harmonic noise remained in small amounts in the utterance segments of the enhanced speech. This observation suggests that Wave-U-Net + LSTM struggles to filter out the harmonic signals in the noisy speech adequately.

Owing to the inherent complexities of speech signals and noise characteristics, accurately estimating and removing noise while preserving the speech components can be a delicate balance. In addition, variations in harmonic gains and fluctuations in the denoising process can contribute to the generation of harmonic noise artifacts, making it challenging to achieve a clean and natural sounding output [17]. To alleviate this issue, WaveU-Net + LSTM should be improved to capture harmonic relationships in noisy speech. One promising avenue for future research in this area includes hybrid architectures that operate simultaneously in time and frequency domains [4, 18, 19].

Conclusion



This work advances low-latency, on-device speech enhancement by reevaluating several critical design choices. We examine different model architectures, training losses, and pruning techniques, selecting the optimal scenario for efficient low latency speech enhancement. The resulting model achieves a remarkable balance between performance and resource utilization. It is suitable for on-device usage, exhibits low algorithmic delay, and delivers a quality comparable to models with significantly higher algorithmic latency. The experimental results will pave the way for future advancements in speech enhancement technology for TWS earbuds that will be co-operated with various audio processing modules such as ANC and Beamforming.

References

[1] T. Goehring, J. L. Chapman, S. Bleeck, and J. J. Monaghan*, “Tolerable delay for speech production and perception: Effects of hearing ability and experience with hearing aids,” International Journal of Audiology, vol. 57, no. 1, pp. 61–68, 2018.

[2] K. Goel, A. Gu, C. Donahue, and C. Re, “It’s raw! audio generation with state-space models,” in International Conference on Machine Learning. PMLR, 2022, pp. 7616–7633.

[3] A. Gu, K. Goel, and C. Re, “Efficiently modeling long sequences ´ with structured state spaces,” in Proc. ICLR 2022 – 10th The International Conference on Learning Representations, 2022.

[4] P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “Hifi++: A unified framework for bandwidth extension and speech enhancement,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5.

[5] I. Shchekotov, P. Andreev, O. Ivanov, A. Alanov, and D. Vetrov, “FFC-SE: Fast fourier convolution for speech enhancement,” in Proc. INTERSPEECH 2022 – 23rd Annual Conference of the International Speech Communication Association, 2022, pp. 2448– 2452.

[6] J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 166–170.

[7] T.-A. Hsieh, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao, “Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement,” in Proc. INTERSPEECH 2021 – 22nd Annual Conference of the International Speech Communication Association, 2021.

[8] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” in Proc. INTERSPEECH 2022 – 23rd Annual Conference of the International Speech Communication Association, 2022, pp. 4521–4525.

[9] Audiolabs, “Pytorch implementation of the perceptual evaluation of speech quality for wideband audio.” [Online]. Available: https://github.com/audiolabs/torch-pesq

[10] E. Frantar and D. Alistarh, “SPDY: Accurate pruning with speedup guarantees,” in Proceedings of the 38th International Conference on International Conference on Machine Learning, 2022.

[11] E. Frantar, S. P. Singh, and D. Alistarh, “Optimal brain compression: A framework for accurate post-training quantization and pruning,” Advances in Neural Information Processing Systems, vol. 35, pp. 4475–4488, 2022.

[12] P. Andreev, N. Babaev, A. Saginbaev, I. Shchekotov, and A. Alanov, “Iterative autoregression: A novel trick to improve your low-latency speech enhancement model,” in Proc. INTERSPEECH 2023 – 24th Annual Conference of the International Speech Communication Association, 2023, pp. 2448–2452.

[13] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems 33 (NeurIPS 2020), vol. 33, pp. 12 449–12 460, 2020.

[14] J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A deep learning loss function based on the perceptual evaluation of the speech quality,” IEEE Signal processing letters, vol. 25, no. 11, pp. 1680–1684, 2018.

[15] J. Kim, M. El-Khamy, and J. Lee, “End-to-end multi-task denoising for joint sdr and pesq optimization,” arXiv preprint arXiv:1901.09146, 2019.

[16] “Cadence tensilica hifi4 dsp core,” https://www.cadence.com/ ko KR/home/tools/silicon-solutions/compute-ip/hifi-dsps/hifi-4. html, (Accessed Feb. 29, 2024).

[17] E. Cho, J. O. Smith, and B. Widrow, “Exploiting the harmonic structure for speech enhancement,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4569–4572.

[18] A. Defossez, “Hybrid spectrogram and waveform source sepa- ´ ration,” in Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2022.

[19] S. Rouard, F. Massa, and A. Defossez, “Hybrid transformers for ´ music source separation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.