Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training
Published
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Abstract
Modern spoken language understanding (SLU) approaches optimize the system in an end-to-end (E2E) style. There are two advantages. First, it might avoid error propagation from upstream systems; second, combining various informa- tion types and optimizing them towards the same objective is straightforward. In this study, we attempt to build an SLU system by fusing information from two modalities, i.e., speech and text, and optimizing related tasks concurrently. We leverage a pre-trained model built with speech and text data and fine-tune it for the E2E SLU tasks. The SLU model is jointly optimized with automatic speech recognition (ASR) and SLU tasks under single-mode and dual-mode schemes. The single-mode model predicts both ASR and SLU results sequentially, while the same model learns to predict ASR or SLU outputs according to the task tag in the dual mode. Our proposed method proves superior when tested against bench- marks employing the FSC, SLURP, and in-house datasets, presenting improved intent accuracy, SLU-F1, and Word Error Rate (WER).