Publications

Better Exploiting Spatial Separability in Multichannel Speech Enhancement with an Align-and-Filter Framework

Published

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Date

2024.12.21

Research Areas

Abstract

Multichannel speech enhancement (SE) techniques combine multiple microphone signals to extract clean speech from noisy mixtures based on spatial filtering. As the target speech may come from arbitrary, unknown directions, current deep learning-based SE systems could suffer from performance bottleneck in denoising speech within one stage. In contrast, conventional signal processing algorithms often feature a two-stage design, where the first stage focuses on spatially aligning the received signals with respect to the speech source, followed by the second stage to filter out noise. In this paper, we introduce Align-and-Filter network (AFnet) for deep learning-based SE that decouples the primal denoising problem into two sub-problems, which imitates the alignment-followed-by-filtering wisdom from signal processing. The key is to leverage the relative transfer functions (RTFs) that encode meaningful spatial information via a tactically designed alignment strategy. Experimental results show that by leveraging the proposed RTF-based spatial alignment supervision, AFnet learns interpretable directional features to better exploit spatial separability of sound sources for improved SE performance.