AI

Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition

By Arun Baby Samsung R&D Institute India-Bangalore
By George Joseph Samsung R&D Institute India-Bangalore
By Shatrughan Singh Samsung R&D Institute India-Bangalore

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them.

#1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research)

#2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research)

#3. FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement (Samsung R&D Institute China-Beijing)

#4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America)

#5. Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment (Samsung R&D Institute China-Nanjing)

#6. Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition (Samsung R&D Institute United Kingdom)

#7. Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition (Samsung R&D Institute India-Bangalore)


Introduction

Automatic Speech Recognition (ASR) system is a critical element in many applications, including voice assistants, transcription services, and speech-to-text technology. The performance of ASR systems, however, frequently suffers when dealing with speakers who have particular traits or when the system has not been trained on a diverse group of speakers. Speaker personalization strategies have been created to overcome this issue by adjusting ASR models to the unique qualities of each speaker. Fine-tuning ASR models typically requires a moderate amount of data. It is challenging to gather user-specific data for model adaptation, especially given privacy issues. Furthermore, the majority of these assistants are implemented on-device, which makes it much more challenging to collect data for user customization. Additionally, due to the enormous number of training parameters, this fine-tuning cannot be done within the device.

Speaker adaptations traditionally use acoustic and phonetic techniques to match the target speaker’s characteristics in voice recognition and synthesis [1, 2, 3, 4]. Recent approaches include fine-tuning only the attention and bias of the model [5], performing federated learning on edge devices and adapting only a subset of weights [6], and using residual adapters for fine-tuning [7]. Another branch of speaker customization uses voice filters to create a clean version of speaker data to be used by ASR models [8]. This technique has been successfully employed in speech synthesis as well [9]. The audio file is pre-processed to enhance the speaker characteristics and fed to the ASR model to decode. This helps in noisy environments for ASR models. Parameter efficient fine-tuning(PEFT) is popular across large-language models as it requires less computational overhead. A lower intrinsic dimension for over parameterized models [10] inspired to create low-rank adaptions [11] (LoRA). LoRA being storage- and compute-efficient is one of the prominent ones among PEFT. Several studies have explored the application of low-rank adaptation for LLMs. The most prominent one, LoRA, shows different methods to fine-tune the huge models with single GPUs [11]. Various LLMs like GPT-2 [12], RoBERTa [13], and GPT-3 [14] are fine-tuned to show the efficacy of LoRA for large models. A generalised version of LoRA(GLoRA) [15] is shown to outperform different fine-tuning methods such as VPT [16], AdaptFormer [17], LoRA [11], SSF [18], and RepAdater [19].

Proposed Method

GLoRA based Fine-tuning

In the GLoRA method, we add a small set of parameters to the weight matrix. The weight matrix is modified in a layer with the help of a new set of weight matrices as shown in Figure 1(GLoRA training). The forward pass in our method yields as Equation 1 [15].

f (x) = (W0 + W0A + B)x + CW0 + Db0 + E + b0 ------(1)

We hypothesize the updates to the weights also have a low intrinsic rank during adaptation. So, we constrain the updates by representing the matrices with a low-rank decomposition. The lower the rank the lesser number of parameters to train. Each of the added matrices A, B, and C are decomposed using LoRA, while D and E are vectors of size similar to the encoder cell size. Suppose W0εRd×d, then A is decomposed into AdεRd×r and AuεRr×d, B is decomposed into BdεRd×r and BuεRr×d, C is decomposed into CdεRd×r and CuεRr×1, DεRr×1 and EεRr×1 where r is the LoRA rank. We use a random Gaussian initialization for Au, Bu, Cu and zero initialization for Ad, Bd, Cd, so that Ad × Au, Bd × Bu, and Cd × Cu are zero matrices at the beginning of the training. D and E are also initialized to zero matrices.

Figure 1. Proposed method: GLoRA weights are added to the attention layer of the cascaded conformer transducer

Our main aim is to improve the performance of the streaming on-device ASR models. Hence the experiments are performed in the state-of-the-art cascaded conformer transducer (ConfT) models [20]. In order to improve the streaming accuracy, the GLoRA is applied only to the causal(encoder) part of the cascaded ConfT model as proposed by [21]. Each of the conformer blocks consists of sub-modules of feed-forward, attention and convolution layers. Our experiments focus on augmenting the attention layer similar to [5]. Even though we can add these GLoRA weights to any of the layers, we experimented with applying GLoRA to Wq (query), Wk(key) and Wv (values) in our experiments based on the results showing the efficacy from the previous works [5, 11, 21].

All the experiments are performed in cascaded ConfT architecture [20]. We use an in-house codebase built on TensorFlow [22], to train and evaluate the models. All the baseline models are trained with the configuration explained in Section 3.2. For the proposed systems, the training is performed by plugging the GLoRA weights into the ConfT architecture. All the experiments are done with a single GPU (Nvidia A100). All models are trained for around 100 epochs based on the convergence and evaluated on the inference set with and without adaptation. The base model is trained on 960 hours of LibriSpeech [23] data. The details of the dataset used is shown in Table 1.

Table 1. Speaker Dataset

The dataset used for the adaptation experiments is a subset of LibriSpeech [23] test-other data. We choose test-other instead of test-clean as the former consists of slightly noisy data which is more adapted to real-world scenarios. The speaker-specific utterances are grouped together and sorted in the reverse order of the number of utterances. 4 speakers are chosen from the top of this list. These 4 speakers have audio data for around 5-10 minutes and the number of utterances from 62-130. Each of these speaker data is split into train, valid and test sets.

The training of the model is performed similarly to the one in [20, 26]. 80-dimensionalmel filter-bank features are used with the stacking of 4 and skipping of 2 frames. A dropout factor of 0.1 is used while training the model. A kernel size of 5 and the structure of feed-forward - convolution - attention - feed forward is used as the conformer block as shown in Figure 1. Our focus is mainly on the streaming part, so only the causal encoder part is considered forall the experiments. The GLoRA weights, as explained in Section 2, are applied to the attention layer of each of the 16 layers of the conformer blocks in the causal encoder. The number of GLoRA parameters added in case of different ranks ranges from 58K for rank 1 to 1M for rank 32. For each layer, the W0 matrix is of the size 256 x 256, since the encoder cell size is 256. The ∆W is decomposed into GLoRA matrices.

The model configuration used for the proposed approach is detailed in Table 2. The model is a modified approach to the cascaded ConfT architecture in which the decoder is replaced with tied and reduced [25] architecture. Here, the LSTM based decoder (prediction) is replaced with simple operations and tied with the joint network to reduce the number of parameters. Our focus is mainly on the streaming part, so only the causal encoder part is considered for all the experiments. The GLoRA weights, as explained in Section 2, are applied to the attention layer of each of the 16 layers of the conformer blocks in the causal encoder. The number of GLoRA parameters added in case of different ranks ranges from 58K for rank 1 to 1M for rank 32. For each layer, the W0 matrix is of the size 256 x 256, since the encoder cell size is 256. The ∆W is decomposed into GLoRA matrices.

Table 2. Model configuration

Experiments

Rank(r) selection for the proposed approach

Different values for the rank(r) are experimented to obtain the optimal value. Rank values ranging from 1 to 256, which are of the power of 2, are chosen for the experiments as the encoder cell is of size 256. For this set of experiments, weight values of both Wq (query) and Wv (value) matrices of the attention layer are chosen based on the original LoRA work [11]. The results of the experiments are shown in Table 3. The first column in Table 3 refers to the model name. The second column is the number of trainable parameters in each of these experiments; the remaining columns show the WER for each speaker. The last 2 column shows the average and relative WER for all the speakers. Here, we can see that the baseline model(Base) has an average of 14.75% WER. This model is not fine-tuned, so the number of trainable parameters is 0. The second model denoted as FT, is the fully fine-tuned model. All 27M parameters of the model are fine-tuned in this case. From the results, we can see that this model performs the best. However, training 27M parameters on an edge device is difficult. Unlike the proposed approach, there is no flexibility in switching ON the speaker to fine-tuned weights when needed. In the case of the third model FTA:qv , only the query and value matrices of the attention layer(2M parameters) are finetuned similarly to [5]. The remaining rows are for the proposed models; different rank values are experimented with to obtain the optimal rank. As the rank increases, the number of parameters also increases. For all the rank values above 64, the number of trained parameters is equal to or greater than 2M, which is more than FTA:qv . So for a fair comparison only the rank values 1 to 32 are considered for comparison. However, you can see significant improvement in the RWER in Table 3 for the proposed approach when the rank is greater than 64. Since the model can be updated by combining the GLoRA weights with the attention weights, based on the use case, higher-rank training can be employed.

Table 3. Comparison of SER for each speaker using the base model, fine-tuned models and models of the proposed approach with different ranks. AVG refers to the average WER accross the speakers. RWER refers to relative WER reduction wrt Base

For the proposed approach, we can see that the performance is better than FTA:qv , for rank values from 4 to 32. Among these, rank 8 performs the best obtaining an 18% relative improvement compared to the Base model whereas the previous state-of-the-art method FTA:qv could obtain only 14% relative improvement. Moreover, the number of trained parameters in the proposed approach is only 345k compared to the 2M parameters of FTA:qv . With less than 20% trainable parameters of FTA:qv , the proposed approach could obtain better results showing the efficacy of our method.

Query, Key, and Value projection experiments

In order to understand the performance of different submodules of the attention layer, we have performed experiments on Query, Key and Value matrices. Here each of these matrices and their combinations are fine-tuned with GLoRA and tested on all the speaker data. For the proposed approach, we kept the rank as 8 for this set of experiments. We have also fine-tuned the key, value and query matrices with the approach mentioned in [5] denoted as FTA:kv (key, value) and FTA:qv (query, value). Our proposed approach outperforms the method specified in [5] as seen in the table.

Table 4. Comparison of WER for each speaker using the base model, fine-tuned models and models of proposed approach with Key(K), Query(Q), and Value(V) combinations for rank 8. AVG refers to the average WER accross the speakers. RWER refers to relative WER reduction wrt Base

As observed in Table 4, fine-tuning only Key projection(K) or Query projection(Q) degraded the performance of the system. Even the combination of KQ degraded the performance of the system. This observation is slightly different from [5]; this could be because of the dataset difference. However, fine-tuning only Value projection(V) improved the performance of the model significantly(similar to [5]). It is observed that the value matrix contributes significantly to the model performance compared to Query and Key matrices. This is in line with the observation in [5]. The combinations with the Value projection(KV, QV) fine-tuning perform significantly better than the ones without it. Fine-tuning all three(KQV) performs as well as the KV and QV combinations but KQV has much more trainable parameters. In all the cases we have seen that our proposed model performs better than the corresponding model in [5].

Conclusion

Fine-tuning ASR models customising to speakers had been a difficult challenge. In this work, we have demonstrated the efficiency of GLoRA-based methods to fine-tune the model in a parameter- and compute-efficient manner. An average of 20% relative word error rate is improved compared to the baseline model with a very limited amount of speaker data. The proposed method outperforms the previous state-of-the-art method in [5]. This method can be used with minimal training infrastructure keeping the original model intact.

Bibliography

[1] Yunxin Zhao, “An acoustic-phonetic-based speaker adaptation technique for improving speaker-independent continuous speech recognition,” IEEE Transactions on Speech and Audio Processing, 1994.

[2] Koichi Shinoda, “Speaker adaptation techniques for automatic speech recognition,” APSIPA ASC, 2011.

[3] Arun Baby, Nishanthi NL, Anju Leela Thomas, and Hema A Murthy, “A unified parser for developing indian language text to speech synthesizers,” in TSD, 2016.

[4] Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga, and Sharath Adavane, “Non-native english lexicon creation for bilingual speech synthesis,” in SSW 11, 2021.

[5] Yan Huang, Guoli Ye, Jinyu Li, and Yifan Gong, “Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need,” in Interspeech 2021.

[6] Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, and Frank Seide, “Federated Domain Adaptation for ASR with Full Self-Supervision,” in Interspeech 2022.

[7] Katrin Tomanek, Vicky Zayats, Dirk Padfield, Kara Vaillancourt, and Fadi Biadsy, “Residual adapters for parameter-efficient ASR adaptation to atypical and accented speech,” in EMNLP 2021.

[8] Quan Wang et al., “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” in Interspeech 2019.

[9] Adam Gabry ́s et al., “Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module,” in ICASSP 2022.

[10] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski, “Measuring the intrinsic dimension of objective land-scapes,” in ICLR, 2018.

[11] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022.

[12] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, “Language models are unsupervised multitask learners,” 2019.

[13] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv, 2019.

[14] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al., “Language models are few-shot learners,” NeurIPS, 2020.

[15] Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen, “One-for-all: Generalized lora for parameter-efficient fine-tuning,” 2023.

[16] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim, “Visual prompt tuning,” in ECCV, 2022.

[17] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,” NeurIPS, vol. 35, 2022.

[18] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang, “Scaling & shifting your features: A new baseline for efficient model tuning,” in NeurIPS, 2022.

[19] Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, and Rongrong Ji, “Towards efficient visual adaption via structural re-parameterization,” arXiv, 2023.

[20] Bo Li, Anmol Gulati, Jiahui Yu, Tara N Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, et al., “A better and faster end-to-end model for streaming asr,” in ICASSP 2021.

[21] Yan Huang, Jinyu Li, Lei He, Wenning Wei, William Gale, and Yifan Gong, “Rapid rnn-t adaptation using personalized speech synthesis and neural language generator,” in Interspeech, October 2020.

[22] Mart ́ın Abadi, Paul Barham, Jianmin Chen, et al., “Tensorflow: a system for large-scale machine learning.,” in Osdi. Savannah, GA, USA, 2016.

[23] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in ICASSP, 2015.

[24] Yusuxke Shibata et al., “Byte pair encoding: A text compression scheme that accelerates pattern matching,” 1999.

[25] Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, and Yanzhang He, “Tied Reduced RNN-T Decoder,” in Interspeech, 2021.

[26] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech, 2020.

Link to the paper

https://ieeexplore.ieee.org/document/10446630