Two papers published by Samsung R&D Institute China - Beijing (SRC-B) have been recently accepted by the International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2022, the flagship conference of the Institute of Electrical and Electronics Engineers (IEEE) Signal Processing Society on signal processing and its applications.
Researchers from Samsung R&D Institute China-Beijing (SRC-B)
Read further to learn more details about the papers on the new speech separation technique submitted by SRC-B.
This study proposed a time-frequency (T-F) domain path scanning network (TFPSNet) for speech separation work, displaying state-of-the-art (SOTA) performance on public WSJ0-2mix data sets. The results have proven to be both practically and theoretically significant.
Proposed TFPSNet’s Overall Architecture
In this paper, the novel model TFPSNet specialized for speech separation was introduced. The connections between the frequency bins in the frequency, time, and T-F paths were structured after the transformer, thereby allowing the model to learn more details on the frequency structure and separate this feature in the T-F domain. Experiments showed that TFPSNet achieved new SOTA performance on the WSJ0-2mix data corpus, reaching 21.1 dB signal-to-distortion ratio improvement (SI-SDRi) on the WSJ0-2mix and 19.7 dB SI-SDRi on the Libri-2mix. Furthermore, this approach presented good generalizability, with the model trained on the WSJ0-2mix data set achieving 18.7 dB SI-SDRi on the unmatched test set Libri-2mix without any fine-tuning applied.
Conducted through a collaboration between SRC-B and Shanghai Jiao Tong University (SJTU), this study proposed a rapid speech separation model for meeting recordings, adopting skip memory for long sequence modeling and exploring a global feature in instructing the model’s separation ability. The model achieved an improvement of 17.1 dB SDR with less than 1.0 ms latency in the simulated meeting-style evaluation.
Proposed Skipping Memory (SkIM) Net
Continuous speech separation (CSS) for meeting preprocessing has recently become a research focus. Compared to data in utterance-level speech separation, the meeting-style audio stream lasts longer, with an unspecified number of speakers. This paper adopted the time-domain speech separation method and the recently proposed Graph-PIT to build a super low-latency online speech separation model, which is crucial for the actual application. The low-latency time-domain encoder with a small stride leads to a highly long feature sequence. Moreover, the researchers also proposed a simple yet efficient model called Skipping Memory (SkiM) for long sequence modeling. Experimental results showed that SkiM achieved equal or better separation performance than the dual-path recurrent neural network (DPRNN). Meanwhile, SkIM’s computational cost was 75% lower than that of DPRNN. Ultimately, SkiM’s robust long sequence modeling capability and low computational cost make it a suitable model for online CSS applications.