Publications

Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings

Published

IEEE Spoken Language Technology Workshop (SLT)

Date

2021.01.19

Research Areas

Abstract

Start Point Detection (SPD) and End Point Detection (EPD) in Automatic Speech Recognition (ASR) systems are the tasks of detecting the time at which the user starts speaking and stops speaking respectively. They are crucial problems in ASR as inaccurate detection of SPD and/or EPD leads to poor ASR performance and bad user experience. The main challenge involved in SPD and EPD is accurate detection in noisy environments, especially when speech noise is significant in the background. The current approaches tend to fail to distinguish between the speech of the real user and speech in the background. In this work, we aim to improve SPD and EPD in a multi-speaker environment. We propose a novel approach that personalizes SPD and EPD to a desired user and helps improve ASR quality and latency. We combine user-specific information (i-vectors) with traditional speech features (log-mel) and build a Convolutional, Long Short-Term Memory, Deep Neural Network (CLDNN) model to achieve personalized SPD and EPD. The proposed approach achieves a relative improvement of 46.53% and 11.31% in SPD accuracy, and 27.87% and 5.31% in EPD accuracy at SNR 0 and 5 dB respectively over the standard non-personalized models.