Audio bandwidth extension can enhance subjective sound quality by increasing bandwidth of audio signal. We presents a novel multi-stage progressive method for time domain causal bandwidth extension. Each stage of the progressive model contains a light weight scale-up module to generate high frequency signal and a supervised attention module to guide features propagating between stages. Time-frequency two-step training method with weighted loss for progressive output is adopted to supervise bandwidth extension performance improves along stages. Test results show that multi-stage model can improve both objective result and perceptual quality progressively. The multi-stage progressive model makes bandwidth extension performance adjustable according to energy consumption, computing capacity and user preferences.
Bandwidth extension model converts low-bandwidth signal to high-bandwidth signal to increase sample rate and generate high frequency energy at the same time. We propose an end-to-end time domain method to do multi-stage progressive bandwidth extension which can achieve continuous BWE performance improvement through increasing the number of stage. First, the input low bandwidth time domain signal is pre-upsampled to increase sample rate to the same as that of high bandwidth signal. Then, the pre-upsampled signal iteratively go through BWE model (named ‘scale-up+’) several times to progressively complete multi-stage BWE.
Figure 1. Multi-stage progressive BWE model
‘Scale-up+’ model is developed based on X-net’s scale-up model  and SAM (supervised attention module) . Consistent with X-net’s scale-up model, two 1D Convolutional layers (Conv1&2) and an activation layer are used to predict high frequency signal. The first Convolutional layer and non-linear activation layer extract low frequency feature and generate high frequency feature. The second Convolutional layer integrates generated high frequency feature to form time domain high frequency signal. The predicted time domain high frequency signal is added to the pre-upsampled signal to form high-bandwidth signal. The purpose of SAM is to make the network propagate only the most relevant features of current stage to the latter stage and discard the less useful ones. We implement 1D version which performs causal processing in time domain. As is shown in Figure, SAM's output attention-guided features are computed based on scale-up’s high frequency feature and attention masks. The attention masks are calculated from current stage's output with a convolution layer (Conv4) and non-linear activation function. These masks are used to re-calibrate transformed high frequency feature (obtained after Conv3), resulting in attention-guided features which are added to scale-up’s high frequency feature. The SAM's output will be concatenated with pre-upsampled signal to be the input of latter stage of scale-up+ except 1st stage. All convolution layers in ‘scale-up+’ are causal, i.e. no look-ahead is used to compute any output point. Polynomial interpolation based pre-upsample is used to replace ‘duplicated in time’ used in X-net to avoid annoying mirrored high frequency artifacts.
We train and evaluate our method on both speech and music data. For speech, we use the CSTR Voice Cloning Toolkit Corpus (VCTK) dataset. VCTK contains clear speech data from 109 native English speakers with different accents. Each speaker reads out approximately 400 different sentences, which are either taken from newspaper articles, the International Dialects of English Archive’s Rainbow passages, or an elicitation passage that aims to identify the speaker’s accent. The recordings are 16-bit WAV files at 48 kHz sample rate. VCTK contains 44 hours of speech data in total. For music, we use MUSDB18-HQ. It consists of a total of 150 full-track songs of different styles and includes both the stereo mixtures and the original sources, divided between a training subset and a test subset. Training subset contains 100 tracks and test subset contains 50 tracks. Each track contains mixture, drums, bass, vocals and other wav files. All signals are stereophonic and encoded at 44.1 kHz. In our experiments, we use SWB (super wide band) signal at 32 kHz sample rate as high-bandwidth signal, and WB (wide band) signal at 16 kHz sample rate as low-bandwidth signal. The proposed progressive BWE method is compared with three methods: light-weight scale-up net in X-net , Audio U-net , and HiFi-GAN+.
For objective evaluation we use signal-to-noise ratio (SNR), log-spectral distance (LSD) and POLQA (Perceptual Objective Listening Quality Assessment). Results of WB and three baseline model is provided as a reference. All SNR, LSD and POLQA MOS result shows progressive bandwidth extension model can gradually improve performance as the number of stages increases. SNR/LSD result of Progressive stage 4 (bold marked) achieves the best result and result of progressive stage 2 (underline marked) can already outperforms three baseline’s result. POLQA MOS score improvement also becomes less obvious after stage 2 which may because it’s already very close to POLQA MOS’s full score (4.75). Model size calculated in number of parameters is also provided to show progressive BWE model can achieve high MOS score result with much smaller network size.
Table 1. SNR and LSD results
Table 2. POLQA MOS result and model size
Figure 2. Example spectrograms (top: speech, bottom: music)
Qualitative comparison results between X-net’s scale up net and result of stage 4 from multi-stage progressive BWE is showed in figure 2. We select one sentence from VCTK (p361_004) (above) and one song from MUSDB18-HQ (Ben Carrigan Well Talk about It All Tonight) (bottom). Spectrograms of BWE result are shown alongside WB, SWB. We can make the observation that multi-stage progressive bandwidth extension can help model recover more high frequency energy than X-net scale-up’s single stage model (marked in white rectangle).
We introduced a multi-stage progressive method to do bandwidth extension. Objective metrics and qualitative inspection results have shown that the method can improve BWE performance as stages iterate and perform better quality than the state-of-the-art models. The proposed model is light weight and a fully causal process, so it can be implemented as a post-processing module to existing voice or streaming services to provide tunable BWE performance by different stages. For next steps we plan to explore the capabilities of multi-stage progressive BWE method with customizable conditions for personalized BWE experience.
Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park and Kwang Pyo Choi , “X-Net: A Joint Scale Down And Scale Up Method For Voice Call”, Interspeech 2021, pp.1644-1648 .
Eloi Moliner and Vesa Valimaki, “A Two-Stage U-Net For High-Fidelity Denoising Of Historical Recordings”, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022, pp841-845.
Volodymyr Kuleshov, S Zayd Enam, and Stefano Ermon, “Audio Super-Resolution Using Neural Networks,” International Conference on Learning Representations (ICLR) 2017.
Jiaqi Su, Yunyun Wang, Adam Finkelstein and Zeyu Jin,“Bandwidth Extension Is All You Need”, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021, pp.659-700.