AI

Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors

By Marcin Sowański Samsung R&D Institute Poland
By Tomasz Ziętkiewicz Samsung R&D Institute Poland

Introduction

In a spoken dialogue system, an NLU model is preceded by a speech recognition system that can deteriorate the performance of natural language understanding. We propose to investigate the impact of speech recognition errors on the performance of natural language understanding models with a procedure that consists of three stages:     

    
1.    
The execution of back transcription, a procedure that combines a TTS model with an ASR system to prepare a dataset contaminated with speech recognition errors.
    
    
2.    
The automatic assessment of the outcome from the NLU model with respect to the proposed robustness criteria.
    
    
3.    
The fine-grained method of detecting speech recognition errors that deteriorate the robustness of the NLU model.

Figure 1.   Evaluating robustness of NLU models to speech recognition errors

Back Transcription

The back transcription procedure applied to the NLU dataset consists of three steps. First, textual data are synthesized with the use of a TTS model. Next, the ASR system converts the audio signal back to text. Finally, both the input utterances and the recognized texts are passed to the NLU model to determine their semantic representations.

As a result, an augmented NLU dataset is obtained that provides for each samples:     

    
•     
r(s): the reference text;
    
    
•     
h(s): the hypothesis, i.e. the r(s) text synthesized with the TTS model and transcribed with the ASR system;
    
    
•     
e(s): the expected outcome of the NLU model for r(s);
    
    
•     
b(s): the outcome of the NLU model for r(s);
    
    
•     
a(s): the outcome of the NLU model for h(s).

Robustness Assessment

Samples that differ in NLU outcomes obtained for reference utterances and their back-transcribed counterparts can be divided into three categories:     

    
1.    
C → I, a correct result obtained for the reference text is changed to an incorrect one in the case of the back-transcribed text.
    
    
2.    
I → I, an incorrect result returned for the reference text is replaced by another incorrect result in the case of the back-transcribed text.
    
    
3.    
I → C, an incorrect result obtained for the reference text is changed to a correct result in the case of the back-transcribed text.

C → I samples are always considered to have a negative impact on the robustness of the NLU model. I → I samples can be treated as irrelevant since they do not affect the performance of the NLU model or as negative to obtain the definition of robustness that penalizes changes. I → C samples can be considered negative to penalize all changes, irrelevant to make the definition of robustness unaffected by the changes that improve the performance of the NLU model or even positive since they improve the NLU performance. Proper combinations of these robustness criteria lead to six alternative measures that can be used to assess the robustness of the NLU model.

Figure 2.  NLU robustness measures

Each of these measures has its own rationale:     

    
•     
R13+ does not take into account that the behavior of downstream modules of a dialogue system that consume the outcome of an NLU model can deteriorate due to the change in labeling of incorrect results.
    
    
•     
R123+ promotes changing incorrect outcomes to correct ones, which is reasonable if we assume that the downstream module behaves correctly when presented with a correct input.
    
    
•     
R123 should be preferred to R123+, if the downstream module relies on the outcome of NLU regardless of its status.
    
    
•     
R12 penalizes changes between incorrect labels but neglects the impact of I → C changes, thus it is a rational choice for assessment of an NLU model that precedes a downstream module dedicated to correcting incorrect NLU outcomes such as a rule-based post-processor.
    
    
•     
R1 tracks the volume of samples that become incorrect due to the use of an ASR system. Hence, it is suitable for monitoring the regressions of the ASR-NLU pair across consecutive revisions of the ASR model.
    
    
•     
R13 penalizes positive changes which makes it a reasonable choice for tracking the robustness of NLU models that should act consistently in the presence of the input typed by the user and the input that comes from an ASR system. However, contrary to R123, it does not take into account the impact of I → I changes.

Speech Recognition Errors Detection

To detect speech recognition errors that affect the robustness of the NLU model, one can determine the differences between the reference texts and their back-transcribed counterparts and confront them with the impact caused by the change in the NLU outcome. For this purpose spans of text that differ between reference and back-transcribed utterances are identified with the use of the Ratcliff-Obershelp algorithm and converted into edit operations that transform incorrectly transcribed words into correct words present in the reference. Next, a logistic regression model is built with the goal to predict if the robustness of the model deteriorates (Y=1) or not (Y=0) due to edit operations (editops) that are present in the alignment of the reference utterances (U) and their back-transcriptions (bt(U)).

Y ~ editops(U, bt(U))

Finally, the speech recognition errors that impact the robustness of the NLU model are determined by extracting the regression coefficients that correspond to the edit operations.

Conclusion

The proposed method depends on speech synthesis and recognition models, but it does not rely on the availability of spoken corpora. It makes use of the semantic representation of the user utterance, but it does not require any additional annotation of data. Thus, the dataset used for training and testing the NLU model can be repurposed for robustness assessment at no additional costs.

This study is an outcome of a collaboration between Samsung R&D Institute Poland and the Center of Artificial Intelligence at Adam Mickiewicz University in Poznań. The detailed description of the presented procedure along with its experimental evaluation can be found in our paper on using Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors accepted to the 2023 Conference on Empirical Methods in Natural Language Processing.

Links

• https://arxiv.org/abs/2310.16609
• https://paperswithcode.com/paper/back-transcription-as-a-method-for-evaluating
• https://csi.amu.edu.pl