AI
Deep learning techniques have accomplished a big step forward on speech separation task. The current leading methods are based on the time-domain audio separation network (TasNet) [1]. TasNet uses a learnable encoder and decoder to replace the fixed T-F domain transformation. It takes waveform inputs and directly reconstructs sources, and computes time-domain loss with utterance-level permutation invariant training (uPIT). Several approaches are proposed based on TasNet framework, such as the Conv-TasNet [2] , the dual-path recurrent neural network (DPRNN) [3], the dual-path Transformer network (DPTNet) [4], RNN-free transformer-based neural network (SepFormer) [5] , a self-attentive network with a novel sandglass-shape, namely Sandglasset [6].
Time domain separation methods achieved impressive results. However, the specific space generated in time domain methods lacks interpretability and the performance is unstable in extreme conditions. In the meantime, T-F domain separation methods are more robust, and highly correlated to the phonetic structure of speech. But STFT is not learnable, and is a generic signal transformation that is not necessarily optimal for speech separation. To overcome this problem, auxiliary encoder after STFT and separation approach designed for T-F domain are necessary.
In this blog, we present our recent work which published at ICASSP 2022. We proposed a T-F Domain Path Scanning Network (TFPSNet) for end-to-end monaural speech separation. Experiments show that proposed model achieves state-of-the-art (SOTA) performance on public WSJ0-2mix datasets.
Our speech separation system consists of three stages: encoder, separator and decoder, which is similar to that of Conv-TasNet. First, an encoder is used to convert the mixture waveform into T-F features. Then the features are fed to the separation layer to predict mask vector for each source. Finally, the decoder converts the T-F signal to waveform with iSTFT.
Figure 1. The overall architecture of the proposed TFPSNet
The spectrogram is highly correlated to the phonetic structure of speech, and the spectrum structure and resolution are very important for the separator. After STFT, we use an encoder to encode the spectrogram to a high dimension non-negative vector. The encoder encodes to high dimension mixture feature as follows
The separator is fed by the encoded representations and estimates a group of masks in the mixture. The masked encoder features for the s-th source are obtained by the element-wise multiplication betweenand
We use transformer to scan three kinds of paths and separate mixed T-F feature. The new design could learn more details of frequency structure, and improve the generalizability as well.
Frequency path: It is applied to model the transitions from frequency bin 0 to frequency bin in one frame. It processes the T-F feature in each frame independently.
Time path: It is applied to model the transitions of same frequency along time axis. It processes the T-F feature in each frequency bin independently. It has more obvious physical meaning than time domain dual path network.
Time-frequency path: Frequency path and time path connect frequency bins in one frame, and same frequency along time axis directly. All the frequency bins in the utterance have implicit connections by stacking frequency and time paths. But the connection is not strong and direct. Here we model the transitions of adjacent frequency bins of adjacent frames directly by T-F path. The connection is important and meaningful for speech, for example, speech pitch and formant usually change frame by frame. The time-frequency path could trace the changing of adjacent frequency bins. The motivation of using diagonal matrix is that the time-frequency path is the complementary to the time-path and frequency-path.
Figure 2. Illustration of path scanning. (a) Frequency path (b) Time path (c) T-F
In decoder, a fully connected layer is used to reconstruct separated speech T-F signalsfor the s-th source:
It converts the H-dimension feature to 2-dimension. Then iSTFT is applied to obtain the final waveforms
We train the proposed model with uPIT to maximize scale-invariant source-to-distortion ratio (SI-SDR). Both of them are normalized to zero-mean before the calculation. Instead of using waveform SI-SDR directly, we calculate SI-SDR along frequency path and time path. By the T-F path loss method, the network learns more details of frequency structure. The proposed loss function consists of three parts: a) Frequency path SI-SDR : use SI-SDR along frequency path for each frame; b) Time path SI-SDR : use SI-SDR along frequency path for each frequency bin; c) use waveform SI-SDR : use waveform SI-SDR is the same as end-to-end separation training objective, such as TasNet. The final loss would sum these three parts by weighted.
Experiments show that proposed model achieves superior separation performance on public WSJ0-2mix datasets. It reaches 21.1dB SI-SDRi on WSJ0-2mix.
Table 1. Comparison of performances on WSJ0-2mix
Furthermore, we could observe that our approach has good generalizability. The model trained on WSJ0-2mix dataset achieves 18.6dB SI-SDRi on Libri-2mix test set without any fine-tuning work. This result is even 0.4dB higher than DPTNet trained on Libri-2mix dataset. This presents the generalizability of our method and further demonstrates the effectiveness of it.
Table 2. Comparison of generalizability
In this blog we presented a speech separation model called TFPSNet. It could learn more details of frequency structure by T-F path scanning transformer. Our experiment results demonstrate that TFPSNet achieves new SOTA performance on WSJ0-2mix data corpus. Moreover, our model has good generalizability. As a future work, we would like to explore joint time domain and T-F domain approach to improve performance further.
https://ieeexplore.ieee.org/abstract/document/9747554
[1]. Y. Luo and N. Mesgarani, "Tasnet: time-domain audio separation network for real-time, single-channel speech separation," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
[2]. Y. Luo and N. Mesgarani, "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation," IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[3]. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 46-50.
[4]. J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation," in Proc. of Interspeech 2020, 2020.
[5]. C. Subakan, M. Ravanelli, S. Cornell, et al. “Attention is all you need in speech separation," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 21-25.
[6]. M.W.Y. Lam, J. Wang, D. Su, et al., "Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 5759-5763.
[7]. K. Wang, H. Huang, Y. Hu, et al., "End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain," in Proc. of Interspeech 2021, 2021.
[8]. L. Zhang, Z. Shi, J. Han, A. Shi, and D. Ma, "Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks," in International Conference on Multimedia Modeling. Springer, 2020, pp. 653–665.