AI

FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement

By Lei Yang Samsung Research Institute China- Beijing
By Wei Liu Samsung Research Institute China- Beijing

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them.

#1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research)

#2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research)

#3. FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement (Samsung R&D Institute China-Beijing)

#4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America)

#5. Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment (Samsung R&D Institute China-Nanjing)

#6. Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition (Samsung R&D Institute United Kingdom)

#7. Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition (Samsung R&D Institute India-Bangalore)


1 Introduction

Speech enhancement or noise suppression aims at improving the quality and intelligibility of noisy speech. It is an important frontend module for voice call, automatic speech recognition (ASR), and hearing aids system. Over the last decade, the application of deep neural networks in speech enhancement has received significant attention, but their enhanced performance is accompanied by increased model complexity.

Figure 1. Overview of proposed method FSPEN

In this work, we propose an ultra-lightweight network FSPEN for real-time speech enhancement task. We propose a full-band and sub-band network structure for extracting global and local features, and an inter-frame path extension method that can enhance network modeling capacity while preserving complexity. Experiments demonstrate that the proposed FSPEN achieves a performance of PESQ 2.97 on the Voice-Bank+Demand dataset at 89M multiply-accumulate operation per second (MAC) and 79k parameters.

2 Method

Our proposed framework consists of three main components: The first part is Encoder module: it includes sub-band encoder and full-band encoder in the Fig.1. With this approach, global and local feature can be achieved. Dual Path enhancer with path Extension module is the second part, an inter-frame path extension method was proposed to further improve the performance of the network while preserving complexity. Full-band and sub-band decoder module is the third part, the two decoders obtain the complexity spectrum mask and magnitude spectrum mask.

2.1 Full-band and sub-band encoder

The full-band encoder employs 3-layer CNNs to extract features from the complex spectrum. We employ 1x1 convolution to process the extracted features along the feature dimension to obtain global features But this will lead to the problem of blurred local information.

To solve this problem, we utilize a set of 1-layer CNNs as sub-band encoders to encode amplitude spectrum |X| to aid local feature extraction. K frequency bins are manually divided into M sub-bands. Then the M sub-bands are divided into N groups, and the sub-bands in the same group have the same number of frequency bins, as shown in Figure 1(a). N 1D Conv encoders are utilized to encode the sub-band signals.

In this manner, the full-band encoder concentrates on extracting the global feature U_global at the frame level, while the sub-band encoder concentrates on extracting the local feature at the sub-band level.

2.2 Dual Path enhancer with path Extension (DPE)

The traditional dual path network is depicted in Figure 2(a), 1 GRU is used for causal inter-frame modeling, and 1 BiGRU is used for intra-frame modeling. We propose a method for inter-frame path extension, 1 BiGRU is used for intra-frame modeling. For inter-frame modeling, as shown in Figure 2(b), we partition the M features within a frame into P groups, and each group utilizes 1 GRU to perform causal inter-frame modeling independently. By employing this approach, the network modeling capacity is enhanced while maintaining extremely low complexity. The model size has increased from 39k to 79k after applying path extension, and the complexity remains at 89M MACs. The DPE module comprises 3 cascaded DPE blocks. The detailed structure of the DPE blocks is depicted in Figure 3.

Figure 2. Path extension

Figure 3. DPE block. There are 3 blocks in DPE modules

2.3 Full-band and sub-band decoder

The Full-band and sub-band decoders are applied to decode features after DPE module. The full-band decoder employs skip connection to connect with the full-band encoder, and employs 3-layer deconvolution to obtain complexity spectrum mask. To further reduce the complexity, the feature’s channel number is reduced by half using 1x1 convolution before each deconvolution layer. The sub-band decoder is applied to obtain magnitude spectrum mask. The sub-band decoder's structure is similar to that of a full-band decoder, but it employs full connected layer (FC) to decode

3 Results

3.1 Performance comparison on VoiceBank+Demand

We evaluate the proposed FSPEN with other lightweight algorithms in terms of PESQ [21], and STOI [22] in Table 1. It is observed that our model achieves PESQ 2.97 with only 79k model parameters and 89M MACs. It demonstrates that the FSPEN achieves highly competitive performance with minimal resource usage, and is therefore more suitable for wearable and IoT devices.

Table 1. Results comparison on VoiceBank+Demand

3.2 Ablation experiment

We conduct an ablation study to demonstrate the effectiveness of our proposed methods in Table 2. To verify the effectiveness of DPE, we introduce DPE to the Dual Path module in the baseline DPCRN_Light. It can be observed that following this modification, the PESQ increases by 0.05. In order to verify the effectiveness of the full-band and sub-band structure, we introduce it to the DPCRN_Light model. It can be observed that following this modification, the PESQ increases by 0.07. We also compared the performance of FSPEN with different numbers of GRU P in the DPE module. It can be observed that that once the number of GRUs exceeds 8, the marginal gain becomes negligible. Consequently, we choose to employ 8 GRUs for path extension.

Table 2. Ablation experiment on VoiceBank+Demand

4 Conclusion

In this paper, we propose an ultra-lightweight model called FSPEN for real time speech enhancement. A full-band and sub-band network structure is proposed for extracting global and local features, and an inter-frame path extension method is proposed that can enhance network modeling capacity while preserving complexity. Our experiments demonstrate that FSPEN achieves a performance of PESQ 2.97 with tiny resource usage, and is therefore more suitable for wearable and IoT devices. As a future work, we would like to introduce quantization and pruning technologies to further compress model size and reduce complexity.

References

[1] Y. Yang, A. Pandey, and DL. Wang, “Time-Domain Speech Enhancement for Robust Automatic Speech Recognition,” in Proc. of Interspeech 2020, 2020.

[2] W. Xiong, L.Wu, F. Alleva, et al., “The microsoft 2017 conversational speech recognition system,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 5934–5938, 2018.

[3] S. Kochkin, “MarkeTrak V: ’Why my hearing aids are in the drawer’ the consumers’ perspective.” The Hearing Journal, 53(2), pp. 34-36, 2000.

[4] I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, and P. N. Whatmough, “TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids,” in Proc. Interspeech 2020, pp. 4054–4058, 2020.

[5] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech 2020, pp. 2472–2476, 2020.

[6] C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, et al., “Front-end processing for the CHiME-5 dinner party scenario,” in Proceedings 5th Intl. Workshop on Speech Processing in Everyday Environments (CHiME), Hyderabad, pp. 35–40, 2018.

[7] H. Zhao, S. Zarar, I. Tashev, and C. Lee, “Convolutional-recurrent neural networks for speech enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2401–2405, 2018.

[8] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50, 2020.

[9] J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation,” in Proc. of Interspeech 2020, 2020.

[10] Z. Zhang, B. He, and Z. Zhang, “TransMask: A Compact and Fast Speech Separation Model Based on Transformer,” In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5764-5768, 2021.

[11] C. Subakan, M. Ravanelli, S. Cornell, et al. “Attention is all you need in speech separation,” In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21-25, 2021.

[12] X. Le, H. Chen, K. Chen, and J. Lu, “DPCRN: Dual-Path ConvolutionRecurrent Network for Single Channel Speech Enhancement,” in Proc. Interspeech 2021, pp. 2811–2815, 2021.

[13] F. Dang, H. Chen, and Q. Hu, et al. “First coarse, fine afterward: A lightweight two-stage complex approach for monaural speech enhancement” [J], Speech Communication, 146, pp. 32-44, 2023.

[14] H. Schroter, A. N. Escalante-B, T. Rosenkranz & A. Maier, “DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering”. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7407-7411, 2022.

[15] W. Mack, E. A. Habets, “Deep filtering: Signal extraction and reconstruction using complex time-frequency filters.” IEEE Signal Processing Letters, 27, pp. 61-65, 2019.