AI

FT-CSR: Cascaded Frequency-Time Method for Coded Speech Restoration

By Liang Wen Samsung R&D Institute China-Beijing
By Lizhong Wang Samsung R&D Institute China-Beijing
By Kwang Pyo Choi Samsung Research

1 Introduction


Speech codecs are commonly used in voice communication and streaming media applications. The objective of speech coding is to maximize the similarity between the original speech and the decoded speech that is subjectively perceived by the listener. However, speech coding has to balance between compressed bitrate and auditory quality, as it is limited by the communication network bandwidth. At a reduced bitrate, auditory quality tends to suffer from coding distortion, mainly due to high-frequency information loss and coding noise caused by parameter quantization. The process of coded speech enhancement (CSE) aims to reduce coding noise, while bandwidth extension (BWE) focuses on recovering lost high-frequency information.

In this paper, we explore methods for jointly coded speech enhancement and bandwidth extension. The main contributions of the paper are: (1) we propose FT-CSR, a frequency-time domain two-stage approach for coded speech restoration. FT-CSR can serve as a post-processing module for existing codec and communication applications. (2) Test results for the Opus codec indicate that FT-CSR achieves a MOSPOLQA score of 3.6 or higher between 8 to 16 kbps. The results of the subjective test show that FT-CSR can improve MOS by over 0.85 for decoded speech.

2 Proposed Method


The goal of FT-CSR is to jointly address two common distortions resulting from lossy compression of speech codec: coding noise and bandwidth reduction. Figure 1 displays the complete diagram of FT-CSR. Let us assume that x is a low bandwidth speech signal. Lossy speech codec encodes x to bitstream for transmission and decodes the received bitstream to x′. The FT-CSR consists of two stages that are arranged in a cascade. First stage (CSE F) is a frequency domain neural network that enhances coded speech by reducing coding noise in x′. The enhanced speech signal is denoted by the symbol x ̂ . The second stage (CSB T) is a time-domain neural network for BWE. CSB T predicts y ̂ from x ̂ with y as the target high bandwidth speech signal.

Figure 1. Frequency-Time Method for Coded Speech Restoration (FT-CSR). Coded low bandwidth speech is enhanced in frequency domain (CSE_F), whereas high frequency is recovered in time domain (CSB_T). CSE_F: Two enhanced spectrums are predicted by two Residual Post Filters (RPF). RPFs are connected by Feature Modulation (FM) to aid feature propagation. Two enhanced spectrums are converted back to the time domain, and Linear Fusion (LF) is employed to obtain enhanced speech. CSB_T: the enhanced low bandwidth speech is first pre-upsampled. Then, high-bandwidth speech is obtained by two Scale-Up (SU) modules where FM and LF are also employed.

3 Experiments


We train FT-CSR on the CSTR Voice Cloning Toolkit Corpus (VCTK) dataset. VCTK contains 44 hours clean speech data from 109 native English speakers with different accents. We divide VCTK into 78% for training, 8% for validation, and 14% for testing, with no speaker overlap between subsets. Test examples for objective testing include 225 wav files selected from speakers p336 to p361 in the testing set, and the speakers are gender balanced. For subjective evaluation, test samples cover more languages, including Korean and Chinese. In our experiment, the sample rates for low and high bandwidth speech are 16kHz and 32kHz, respectively. The decoded low bandwidth speech is obtained by Opus codec. MPF and HiFiGAN+ are selected as baseline systems for comparison.

Table 1. Objective/Subjective Metric Test Results (lr: low bandwidth enhancement performance using low bandwidth speech as a reference. hr: overall restoration performance using high bandwidth speech as a reference.)


Figure 2. Generalization Performance


Figure 3. Example Spectrograms (VCTK p364 001, Opus@9.6kbps)


4 Conclusion


We presented a frequency-time cascaded method of coded speech restoration (FT-CSR) that enhances and extends the bandwidth of coded speech. This method can be used as a post-processing module for a codec. We conducted objective and subjective tests and found that FT-CSR outperforms cutting-edge models, resulting in a significant improvement in speech quality over coded speech. In the next stages, we plan to explore the capabilities of our method with an end-to-end neural audio codec.

Link to the paper

https://www.online-ecp.org/icme2024/Home/Download?id=2024003108.pdf

References

Wen, L., Wang, L., Wen, X., Zheng, Y., Park, Y., & Choi, K. P. (2021). X-net: A Joint Scale Down and Scale Up Method for Voice Call. In Interspeech (pp. 1644-1648).

L.Wen, L.Wang, Y. Zhang, and K. P. Choi, “Multi-stage progressive audio bandwidth extension,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 422–427.

J. Su, Y. Wang, A. Finkelstein, and Z. Jin, “Bandwidth extension is all you need,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 696–700.

S. Korse, K. Gupta, and G. Fuchs, “Enhancement of coded speech using a mask-based post-filter,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6764–6768.