In a spoken dialogue system, an NLU model is preceded by a speech recognition system that can deteriorate the performance of natural language understanding. We propose to investigate the impact of speech recognition errors on the performance of natural language understanding models with a procedure that consists of three stages:
Figure 1. Evaluating robustness of NLU models to speech recognition errors
The back transcription procedure applied to the NLU dataset consists of three steps. First, textual data are synthesized with the use of a TTS model. Next, the ASR system converts the audio signal back to text. Finally, both the input utterances and the recognized texts are passed to the NLU model to determine their semantic representations.
As a result, an augmented NLU dataset is obtained that provides for each samples:
Samples that differ in NLU outcomes obtained for reference utterances and their back-transcribed counterparts can be divided into three categories:
C → I samples are always considered to have a negative impact on the robustness of the NLU model. I → I samples can be treated as irrelevant since they do not affect the performance of the NLU model or as negative to obtain the definition of robustness that penalizes changes. I → C samples can be considered negative to penalize all changes, irrelevant to make the definition of robustness unaffected by the changes that improve the performance of the NLU model or even positive since they improve the NLU performance. Proper combinations of these robustness criteria lead to six alternative measures that can be used to assess the robustness of the NLU model.
Figure 2. NLU robustness measures
Each of these measures has its own rationale:
To detect speech recognition errors that affect the robustness of the NLU model, one can determine the differences between the reference texts and their back-transcribed counterparts and confront them with the impact caused by the change in the NLU outcome. For this purpose spans of text that differ between reference and back-transcribed utterances are identified with the use of the Ratcliff-Obershelp algorithm and converted into edit operations that transform incorrectly transcribed words into correct words present in the reference. Next, a logistic regression model is built with the goal to predict if the robustness of the model deteriorates (Y=1) or not (Y=0) due to edit operations (editops) that are present in the alignment of the reference utterances (U) and their back-transcriptions (bt(U)).
Y ~ editops(U, bt(U))
Finally, the speech recognition errors that impact the robustness of the NLU model are determined by extracting the regression coefficients that correspond to the edit operations.
The proposed method depends on speech synthesis and recognition models, but it does not rely on the availability of spoken corpora. It makes use of the semantic representation of the user utterance, but it does not require any additional annotation of data. Thus, the dataset used for training and testing the NLU model can be repurposed for robustness assessment at no additional costs.
This study is an outcome of a collaboration between Samsung R&D Institute Poland and the Center of Artificial Intelligence at Adam Mickiewicz University in Poznań. The detailed description of the presented procedure along with its experimental evaluation can be found in our paper on using Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors accepted to the 2023 Conference on Empirical Methods in Natural Language Processing.