AI

[INTERSPEECH 2024 Series #4] Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank Adaptation

By George Joseph Samsung R&D Institute India-Bangalore

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)

Background



Personalizing automated speech recognition (ASR) for voice assistant systems is often considered the holy grail, requiring meticulous attention to detail in model optimization. When dealing with limited speaker data, the selection of hyperparameters becomes paramount in fine-tuning large ASR models. One effective method for this optimization is low-rank adaptation (LoRA), which proves instrumental in enhancing the performance of large language models (LLMs). A variation of LoRA, Weight-Decomposed Low-Rank Adaptation (DoRA) also promises enhanced performance.

In our study, we employed LoRA and DoRA, to refine the state-of-the-art cascaded conformer transducer model for speaker personalization. This involved adding a small number of speaker-specific weights to the existing model and fine-tuning them accordingly. Experimental assessments show an average relative improvement of 20% in word error rate across speakers with limited data, showcasing its efficacy in addressing the challenge of personalizing ASR systems in real-world applications.

Figure 1. Proposed method: LoRA and DoRA weights are added to the attention layer of the casaded conformer transducer


Proposed Approach



Two variations of low-rank adaptations, LoRA and DoRA, are proposed for speaker personalisation in the ASR systems. These approaches are depicted in Figure 1. We propose two approaches of PEFT for speaker personalisation; The first proposed approach (proposed-1) is with LoRA and the second proposed approach (proposed-2) is with DoRA.

2.1. LoRA-based speaker personalisation (Proposed 1)

In the LoRA method, we add a small set of parameters to the weight matrix. The weight matrix is modified in a layer with the help of a new set of weight matrices as shown in Figure 1 (LoRA training). For a pre-trained weight matrix W0ϵRd∗k, we constrain its update by representing the latter with a low-rank decomposition
W′ = W0 + ΔW = W0 + BA (1)
where BϵRd∗r,AϵRr∗k, and the rank r ≪ min(d, k). During training, W0 is frozen and does not receive gradient updates, while A and B contain trainable parameters. Note both W0 and ΔW = BA are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For h = W0x, our modified forward pass yields:
h = W′x = W0x + BAx (2)
We use a random Gaussian initialization for A and zero for B, so ΔW = BA is zero at the beginning of training.

2.2. DoRA-based speaker personalisation (Proposed 2)

In the DoRA method, we add an extra layer of a small set of parameters to the LoRA-based method. The weight matrix modified in a layer with the help of a new set of weight matrices as shown in Figure 1 (DoRA training) [19]. The weight decomposition of WϵRd∗k can be formulated as:
W0 = m ∗ V ||V ||c = ||W||c ∗ W ||W||c (3)
where mϵR1∗k is the magnitude vector, V ϵRd∗k is the directional matrix, with ||.||c being the vector-wise norm of a matrix across each column. This decomposition ensures that each column of V/||V ||c remains a unit vector, and the corresponding scalar in m defines the magnitude of each vector. We initialize DoRA with pre-trained weight W0 as outlined in Eqn 3 where m = ||W0||c and V = W0 after initialization. We then keep V frozen andma trainable vector. The directional component is then updated through LoRA. DoRA can be formulated similar to Eqn 1 as:
W′ = m ∗ V + ΔV ||V + ΔV ||c = m W0 + BA ||W0 + BA||c (4)

Table 1. Speaker Dataset

Table 2. Model Configuration

Dataset used



WA portion of the LibriSpeech [23] test-other data was utilized to create the dataset for the adaption studies. Since testother contains somewhat noisy data and is more appropriate for real-world situations, we have chosen it over test-clean. The speaker-specific utterances are sorted in reverse order of the total number of utterances, and they are grouped together. Four presenters are selected from the top of the list. These four speakers’ audio data spans five to ten minutes, with 62 to 130 utterances. These speaker data sets are divided into test, valid, and train sets. Table 2 presents all of these specifics. The Table 1 provides the speaker id and total duration for each speaker. Testing of the model is performed on these speaker-specific test data(denoted as Spk X).

Experimental details



The LoRA and DoRA weights, as explained in Section 2, are applied to the attention layer of each of the 16 layers of the conformer blocks in the causal encoder. For each layer, the W0 matrix is of the size 256 x 256, since the encoder cell size is 256. The ΔW is decomposed into LoRA matrices. The total number of parameters(#Params) added in each LoRA experiment can be obtained using the equation 5; where N is the number of encoder layers, m is the number of matrices using LoRA weights in each encoder layer, r is LoRA rank and C is the encoder cell size(shown in Table 2).
#Params = N ∗ m ∗ (#Params of A,B) = N ∗ m ∗ (2 ∗ C ∗ r) (5)
Similarly for DoRA-based method, we can find the total number of parameters using the Equation 6; where N is the number of encoder layers, m is the number of matrices using DoRA weights in each encoder layer, r is DoRA rank and C I the encoder cell size(shown in Table 2)
#Params = N ∗ m ∗ (#Params of A,B, ||V ||) = N ∗ m ∗ (2 ∗ C ∗ r + C) (6)
There are two fine-tuned models considered when comparing with the proposed approach. In the first case, the entire model is fine-tuned using the speaker-specific data and is denoted as FT. In the second case, only the attention layer is fine-tuned(FTA) as mentioned in [7]. FTA has shown to perform really well with just fine-tuning the attention module.

Table 3. Comparison of WER(lower better) for each speaker using the base model, fine-turned models and models of the proposed approach with different ranks. AVG refers to the average WER across the speakers. RWER refers to relative WER reduction wrt Base.

Results and Discussion



To find the ideal value, a range of rank(r) values are explored with. Given that the encoder cell is 256 by 256, rank values to the power of 2 are selected for the experiments. Based on the original LoRA work [12], weight values for the Wq(query) and Wv(value) matrices of the attention layer are selected for this set of tests.

The experiments’ outcomes are displayed in Table 3. The model name is indicated in Table 3’s first column of table.

For each of the proposed methods, the number of trainable parameters in each of these tests is indicated in the first column; the WER for each speaker is displayed in the remaining columns. The average and relative WER for each speaker are displayed in the final two columns. Only the query and value matrices of the attention layer (2M parameters) are fine-tuned in the third model FTA:qv, in a manner similar to [7]. The suggested models are represented by the remaining rows; in order to determine the ideal rank, several rank values are tested. The number of parameters rises in tandem with the rank. For all rank values over 64, the number of trained parameters is more than 1M, or more than FTA:qv, . Thus, only rank values 1 through 32 are taken into account for comparison in order to ensure fairness.

This demonstrates that the baseline model, or Base, has an average WER of 14.75%. This model has no trainable parameters because it hasn’t been altered. The second model, denoted by the letter FT, is the fully adjusted model. Here, every one of the 27 million model parameters is optimized. The results demonstrate that this model performs the best. However, training 27M parameters on an edge device is difficult. For rank values between 4 and 32, we can observe that the suggested method performs better than FTA:qv. With an 22% relative improvement over the Base model, rank 8 outperforms the others, while the previous state-of-the-art method FTA:qv could only achieve a 14% relative improvement. Furthermore, the suggested method has only 139k trained parameters, whereas FTA:qv has 2M parameters. Our strategy is more effective when the suggested approach yields results with less than 10% trainable parameters of FTA:qv.

The LoRA method is able to achieve a relative improvement of 10% while the DoRA is capable of achieving 20% which is even better than the FTA:qv and very close to the fully finetuned model(FT). The experiments clearly shows the introduction of normalisation vector in DoRA helps in better performance. The relative WER results of DoRA is almost double to its LoRA counter parts keeping minimal difference in trainable parameters . This observation is similar to that of [19].

Conclusion



It had proven to be a challenging task to fine-tune ASR models that were customized for speakers. In this study, we have shown the effectiveness of low-rank adaptation-based approaches for model fine-tuning in a compute- and parameter-efficient way. With a very small amount of speaker data, the baseline model is improved upto 22% relative word error rate. The suggested approach works better than the prior cutting-edge technique found in [7]. The original model also can be maintained while using this strategy and with very little training infrastructure better accuracy can be achieved.

References

[1] Y. Zhao, “An acoustic-phonetic-based speaker adaptation technique for improving speaker-independent continuous speech recognition,” IEEE Transactions on Speech and Audio Processing, 1994.

[2] K. Shinoda, “Speaker adaptation techniques for automatic speech recognition,” APSIPA ASC, 2011.

[3] A. Baby, N. NL, A. L. Thomas, and H. A. Murthy, “A unified parser for developing indian language text to speech synthesizers,” in TSD, 2016.

[4] A. Baby, P. Jawale, S. Vinnaitherthan, S. Badam, N. Adiga, and S. Adavane, “Non-native english lexicon creation for bilingual speech synthesis,” in SSW 11, 2021.

[5] K. Tomanek, V. Zayats, D. Padfield, K. Vaillancourt, and F. Biadsy, “Residual adapters for parameter-efficient ASR adaptation to atypical and accented speech,” in EMNLP 2021. [Online]. Available: https://aclanthology.org/2021.emnlp-main.541

[6] J. Jia, J. Mahadeokar, W. Zheng, Y. Shangguan, O. Kalinli, and F. Seide, “Federated Domain Adaptation for ASR with Full Self- Supervision,” in Interspeech 2022.

[7] Y. Huang, G. Ye, J. Li, and Y. Gong, “Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need,” in Interspeech 2021.

[8] Q. W. et al., “VoiceFilter: Targeted Voice Separation by Speaker- Conditioned Spectrogram Masking,” in Interspeech 2019.

[9] A. G. et al., “Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module,” in ICASSP 2022.

[10] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV, 2022.

[11] S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P. Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,” NeurIPS, vol. 35, 2022.

[12] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

[13] D. Lian, D. Zhou, J. Feng, and X. Wang, “Scaling & shifting your features: A new baseline for efficient model tuning,” in NeurIPS, 2022. [Online]. Available: https://proceedings.neurips.cc/paper

[14] G. Luo, M. Huang, Y. Zhou, X. Sun, G. Jiang, Z. Wang, and R. Ji, “Towards efficient visual adaption via structural reparameterization,” arXiv, 2023.

[15] C. Li, H. Farkhoor, R. Liu, and J. Yosinski, “Measuring the intrinsic dimension of objective landscapes,” in ICLR, 2018.

[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.

[17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv, 2019.

[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NeurIPS, 2020.

[19] S.-Y. Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “Dora: Weight-decomposed low-rank adaptation,” 2024.

[20] B. Li, A. Gulati, J. Yu, T. N. Sainath, C.-C. Chiu, A. Narayanan, S.-Y. Chang, R. Pang, Y. He, J. Qin et al., “A better and faster end-to-end model for streaming asr,” in ICASSP 2021.

[21] Y. Huang, J. Li, L. He, W. Wei, W. Gale, and Y. Gong, “Rapid rnn-t adaptation using personalized speech synthesis and neural language generator,” in Interspeech, October 2020