Publications

SEMI-SUPERVISED TRANSFER LEARNING FOR LANGUAGE EXPANSION OF END-TO-END SPEECH RECOGNITION MODELS TO LOW-RESOURCE LANGUAGES

Published

IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Date

2021.12.17

Research Areas

Abstract

In this paper, we propose a three-stage training methodology toimprove the speech recognition accuracy of low-resource languages.We explore and propose an effective combination of techniquessuch as transfer learning, encoder freezing, data augmentation usingText-To-Speech (TTS), and Semi-Supervised Learning (SSL). Toimprove the accuracy of a low-resource Italian ASR, we leveragea well trained English model, unlabeled text corpus and unlabeledaudio corpus using transfer learning, TTS augmentation and SSLrespectively. In the first stage we use transfer learning from a welltrained English model. This primarily helps in learning the acousticinformation from a resource rich language. This stage achievesaround24%relative Word Error Rate (WER) reduction over thebaseline.In stage two, We utilize unlabeled text data via TTS data-augmentation to incorporate language information into the model.We also explore freezing the acoustic encoder at this stage. TTS dataaugmentation helps us further reduce the WER by∼21 % relatively.Finally, In stage three we reduce the WER by another 4% rela-tive by using SSL from unlabeled audio data. Overall, our two-passspeech recognition system with a Monotonic Chunkwise Attention(MoChA) in the first pass and a full-attention in the second passachieves a WER reduction of∼42% relative to the baseline