AI

[INTERSPEECH 2024 Series #7] A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages

By Nikhil Jakhar Samsung R&D Institute India-Bangalore
By Sudhanshu Srivastava Samsung R&D Institute India-Bangalore

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)

Background



Multilingual Automatic Speech Recognition (ASR) presents several challenges, especially when multiple languages are being spoken in the same audio. Traditional multilingual ASR systems often rely on low-resource Indic language data and language-specific models, which limits their scalability and efficiency. Creating individual models is difficult due to the lack of Indic language data, while the need for an accurate LID model further affects the downstream task.

In this blog post, we introduce our work (INTERSPEECH 2024) that addresses both language identification and multilingual ASR within a unified framework. By leveraging the symbiotic relationship of LID and multilingual ASR, we enhance the performance of both tasks, overcoming existing limitations. This work presents a novel approach to multilingual ASR that incorporates language identification capabilities. Our method’s effectiveness is demonstrated by experimental results on benchmark datasets, which show an absolute 19.1% improvement in Word Error Rate (WER) while enhancing language identification performance by 6% in terms of Diarization Error Rate (DER).

Proposed Approach



Our proposed approach consists of two methods. We have used open source Whisper model for our experiments. The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However, its predictive capabilities can be improved further for certain languages and tasks through fine-tuning.

Figure 1. Model architecture and flow

In the first approach (Proposed-v1), we fine-tune the whisper model using the Indic data collected from various resources. In the traditional fine-tuning approach, each language is fine-tuned separately. By combining ASR tokens with the appropriate language tokens in the loss function for each utterance, we fine-tune all five languages in this approach (Equation 1). We have augmented the data in various ways for this. Utterances from different languages are combined together to augment the training data. While combining, the language token for the combined utterance is decided based on the majority token in the final utterance. This proposed system is depicted in Figure 1.

In the second approach (Proposed-v2), we experimented with a weighted loss function that combines the loss for language identification and the loss for ASR tokens as weighted sum (Equation 2). In this way, we can control how well the language ID should be learned along with the ASR.

The loss function for the first approach, Whisper fine-tuning is:

Here, n represents the number of text tokens, l represents the cross-entropy loss function, LIDpred is the predicted LID, LIDtrue is the actual LID of the audio, tokipred is the ith predicted ASR token and tokitrue is the actual ASR token. Additionally, ltot represents the final loss for an audio-transcript pair.

This loss is modified to accommodate weightage and the modified loss function is:

Where w1 is the weight given to the LID token’s loss and w2 is the weight given to ASR tokens’ loss.

Dataset used



We have used 3 close-field datasets for our Training; SPRINGINX, IndicTTS, and IndicVoices [1-3]. Details in Figure 2. Languages used in our training are listed below:

   • Indian English (En)
   • Hindi (Hi)
   • Bengali (Bn)
   • Telugu (Te)
   • Kannada (Kn)

To reduce the imbalance in language-specific data sizes within the training set for the proposed approach, noise augmentation for languages with comparatively less data was employed so that all of them represent an equal share in the training data.

For the evaluations, we have used the DiSPLACE challenge 2024 dataset [4] with non-overlapping speakers with respect to the training data.

Figure 2. Dataset used for training

Experimental details



The model mentioned above was fine-tuned for single as well as multiple languages. During the fine-tuning on a dataset containing multiple languages, the language tag for each audio was also taken into account during loss calculation. An experiment that only computes loss based on the language tag was also done.

Apart from fine-tuning the model, we have tried using different combinations of datasets to assess their effectiveness in improving accuracy. To validate the impact of the datasets, we trained two models, the first one (Proposed-v1) using the SPRING-INX data and IndicTTS datasets, and the second one (Proposed-v1.1) using all 3 datasets.

Results and Discussion



In all the languages our proposed system outperforms the Baseline system with a weighted average of 19.1% WER improvement. However, In the case of low-resource languages like Telugu and Kannada, the individual fine-tuned model performs slightly better compared to our proposed system. But it still outperforms the baseline Whisper model for low resource languages.

The evaluation results for individual fine-tuned models and proposed approach are shown in Figure 3. Individual-FT-v1/v1.1 are the evaluation results of 5 separate models fine-tuned individually for each of the languages shown whereas for the Proposed-v1/v1.1 only a single model is trained. Also, from the results it can be seen that adding the IndicVoices [2] dataset shows improvement across the languages for all models.

Figure 3. Evaluation results (WER, lower is better)

For the independent single-language fine-tuned models, the performance improvement for the fine-tuned language comes at a cost of the performance for the other languages as shown in Figure 4.

Figure 4. Evaluation results (WER, lower is better) of Telugu and Kannada fine-tuned models compared to proposed approach

As it can be seen here, in the fine-tuned models for Telugu and Kannada, only the language the model was trained on shows improvement, while the performance of other languages, besides Telugu and Kannada, drops significantly.

One interesting observation that we found is that while fine-tuning for either Telugu or Kannada, the performance of both languages is improved. This could be attributed to Telugu and Kannada belonging to the same language family, Dravidian and having phonetic similarities. Suggesting that fine-tuning for a language can improve the system for the languages in the same family.

To improve the language identification accuracy, we have fine-tuned our model based on the loss mentioned in Equation 1 and the results are shown in Figure 5. Here the model is jointly fine-tuned for LID and ASR tasks. This gives a performance improvement from 24.04% to 18.79% DER. In the case of Proposed-v2 where the w1 is 1.0, the fine-tuning is restricted to LID only. However, we have seen that LID only fine-tuning performs worse than the joint LID and ASR fine-tuning. Adding a LID token in the loss function improves the accuracy of LID as the model learns to distinguish between audios by learning the textual context.

To understand the efficacy of the Proposed-v2 approach, we have experimented with different weightage for the Language Identification loss in the loss function (Equation 2). This was done to enhance the LID performance of our proposed system. The best result for the Proposed-v2 approach was achieved with a LID loss weightage of 0.8. However, setting the LID loss weightage to 1.0 led to a decrease in overall model performance, indicating that joint fine-tuning of the model is more effective than focussing solely on LID fine-tuning.

Figure 5. Evaluation results (DER, lower is better) of Proposed-v1 and varying Proposed-v2 weights

Conclusion



In conclusion, our work presents an approach to multilingual Automatic Speech Recognition (ASR) that integrates language identification within a unified framework. Our approach surpasses the conventional methods that fine-tune the model for a single language. Furthermore, the use of a weighted loss function in Proposed-v2 approach boosts LID performance. This is validated by experimental results on benchmark datasets, which demonstrate a significant improvement in Word Error Rate (WER) by 19.1% and a 6% improvement in language identification performance, underscoring the effectiveness of our approach.

References

[1] Arun Baby et al. “Resources for Indian Languages”. In: Community-based Building of Language Resources (International Conference on Text, Speech and Dialogue). 2016, pp. 37–43.

[2] Tahir Javed et al. IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages. 2024. arXiv: 2403.01926 [cs.CL].

[3] Nithya R et al. “SPRING-INX: A multilingual Indian language speech corpus by SPRING Lab, IIT madras”. In: (Oct. 2023). arXiv: 2310.14654 [cs.CL].

[4] Shareef Babu Kalluri et al. “The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments”. In: Proc. INTERSPEECH 2024. 2024.