AI

[INTERSPEECH 2024 Series #3] High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

By Joun Yeop Lee Samsung Research
By Ji-Hyun Lee Samsung Research
By Hoon-Young Cho Samsung Research

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)

Introduction



Recently, there has been a notable shift in Text-to-Speech (TTS) research towards the adoption of discrete speech tokens as intermediate features [1-3], which presents diverse options for model architecture and inference strategies.

Incorporating discrete tokens into TTS offers numerous advantages over conventional speech modeling in the continuous domain. Above all, targeting discrete tokens simplifies the representation of one-to-many mappings by facilitating a categorical distribution for the output space, addressing the complex challenges posed by continuous domain generative modeling. Moreover, the discrete output space enables the integration of various specialized schemes. Particularly noteworthy is its ability to leverage recent advancements in large language models (LLM) such as masked language models (MLM) [3, 4]. Furthermore, the discrete output space can facilitate the development of robust alignment modeling, simplifying the adoption of transducers [5, 6] within the TTS framework.

In this post, we introduce our works, a high-fidelity TTS framework designed to optimize the use of semantic and acoustic tokens, leveraging the benefits of discrete tokenization. This framework follows a two-stage procedure: converting text into semantic tokens (Interpreting) and then into acoustic tokens (Speaking) as in Figure 1.

Figure 1. Overall architecture of the proposed model

Proposed method

1. Discrete Tokens

Figure 2. Overall procedure of tokenization

Discrete tokens in TTS are broadly categorized into two types: semantic tokens and acoustic tokens. Semantic tokens are typically derived through quantization applied to speech features containing contextualized linguistic details. These quantized speech features are sourced from various speech encoders like self-supervised speech models [7, 8]. Semantic tokens alleviate the complexities arising from acoustic diversity, thereby enabling a sharper focus on semantic content critical for enhancing intelligibility. In contrast, acoustic tokens represent codewords generated by neural codecs [9-11], which have witnessed significant advancements in recent years. These tokens encapsulate acoustic details of raw waveforms, serving as alternatives to traditional frame-level acoustic features such as mel-spectrograms.

As semantic tokens, we exploit the index sequence of k-means clustering on wav2vec 2.0 embeddings, akin to [5]. According to [5], semantic tokens mainly focus on phonetic information, but it also dealing with some prosodic information such as speech rate and overall pitch contour. For acoustic token extraction, we leverage the HiFi-Codec [11], which employs the Group-Residual Vector Quantization (G-RVQ) to derive multiple streams of discrete token sequences. Specifically, we use the bi-group and bi-depth G-RVQ to maintain the balance between performance and computational efficiency.

2. Interpreting

Figure 3. Overall architecture of Interpreting

In the Interpreting stage, we generate a semantic token sequence from the input text, addressing alignment between text and semantic tokens, as well as controlling prosody embedded in semantic tokens. While attention-based language models are often considered for this seq2seq translation task, they cannot exploit the inherent monotonic constraint of alignment, making them susceptible to misalignment issues and requiring computationally intensive key-query matching processes. Instead, we adopt a transducer, referred to as Token Transducer++, a modified version of the Token Transducer of [5]. Transducers are specifically designed architecture for discrete seq2seq models with monotonic alignment constraint, achieved through alignment lattice and a special blank token, which indicates the transition to the next input frame.

However, there are two drawbacks of transducers: (1) the autoregressive nature of the joint network presents a significant computational bottleneck during inference, and (2) the joint network's frame-wise computation neglects temporal context. To overcome these problems, when compared to the original Token Transducer, we largely reduce the size of the joint network and inject reference embedding to the prediction network instead of the joint network for temporal consideration of reference embedding. We add the reference embedding to the prediction network. These simple modifications not only boost inference speed but also enhance overall performance.

3. Speaking

Figure 4. Overall architecture of Speaking

The Speaking stage aims to translate semantic tokens produced in the Interpreting stage into acoustic tokens, utilizing a prompt for acoustic guidance. We tackle this pre-aligned seq2seq task via Masked Language Modeling (MLM), which effectively incorporates prompt speaker information through in-context learning.

Given the characteristics of RVQ-based acoustic tokens, there have been MLM approaches that capture both of temporal and RVQ-level-wise conditional dependency [3]. Among these methods, we employ the Group-MLM (G-MLM) approach, which is specifically designed for G-RVQ acoustic tokens. By simultaneously masking tokens at the same level across different groups, it additionally captures group-wise conditional dependency. Such a masking strategy enables the model to more easily predict masked tokens, ultimately facilitating efficient decoding. The inference method that mimics this approach is illustrated in Figure 4 (a). First, coarse-grained acoustic tokens from different groups are obtained through iterations of iterative parallel decoding. Then, fine-grained acoustic tokens are predicted all at once. Note that this sampling scheme, named Group-Iterative Parallel Decoding (G-IPD), improves the audio quality even with a small number of iterations.

As illustrated in Figure 4 (b), the Speaking part comprises a prompt network and a generator based on cross-attention mechanisms. Each module follows the architecture described in [4]. The prompt network and generator are based on a bidirectional conformer structure [12] to learn the underlying contextual information. The prompt embedding processed from the prompt network operates as the key and value, while the aggregated embeddings of the semantic tokens and partially masked acoustic tokens serve as the query. This cross-attention-based architecture has been proven to be efficient in terms of computational cost and inference speed, as it allows for caching of key and value during iterative decoding.

4. Discussion on Framework Design

The division of the two stages offers several distinct advantages. Firstly, each stage can concentrate on different aspects of TTS objective. The Interpreting stage can solely prioritize alignment modeling and linguistic accuracy, whereas the Speaking stage can focus more on high-fidelity and handling one-to-many mappings due to acoustic diversity. We design sophisticated architectures tailored to each role, considering both speech quality and inference efficiency. Additionally, from a data perspective, the Interpreting stage relies on text and speech pair data, whereas the Speaking stage doesn't require text annotations. This flexibility enables us to leverage unlabeled data, which is more abundant, resulting in higher speech quality and the ability to represent diverse acoustic conditions. Moreover, regarding zero-shot adaptation, we can separately control each stage as they govern different aspects of paralinguistic information. The Interpreting stage manages speech rate and global prosodic dynamics, while the Speaking stage focuses on timbre and acoustic attributes. This separation provides us with the flexibility to independently control each component using different speech prompts, thereby enhancing overall controllability across a broader spectrum.

Experiments

1. Experiment settings

Dataset: We conducted zero-shot TTS experiments using the LibriTTS [13] corpus. We used all of the training subsets (train-clean-100, train-clean-360, train-other-500) for training and the test-clean subset for evaluation.

Tokens: We employed semantic tokens obtained through k-means clustering on the official wav2vec 2.0-XLSR model [8], with k set to 512. The acoustic tokenization was conducted using the official pre-trained HiFi-Codec [11], optimized for 24kHz speech samples and a 320 times down-sampling rate.

Baselines: We built three baseline models for comparison: VITS, VALLE-X [14], and Kim et al. [5]. To adapt the baseline VITS to the zero-shot adaptive scenario, we incorporated the ECAPA-TDNN [15] structure as a reference encoder. We used the open-source implementation for VALLE-X.

Evaluation Metrics: For objective evaluations, we assessed the character error rate (CER) using a pretrained Whisper large model [16], leveraging the official implementation. Additionally, we conducted averaged speaker embedding cosine similarity (SECS) analysis to evaluate speaker similarity between speech prompts and synthesized samples, utilizing the official pre-trained WavLM large speaker verification model [17]. We randomly selected 500 utterances from the test dataset for these objective assessments.

Table 1. Results of zero-shot TTS. MOS and SMOS are represented with 95% confidence


2. Results: Zero-Shot Multi-Speaker TTS

According to Table 1, our proposed model exhibited superior performance compared to the baselines across all assessments. Notably, VALLE-X demonstrated the worst CER due to misalignment issues stemming from the lack of a monotonic alignment constraint. In contrast, alignment models based on the transducer approach, including [5] and the proposed model, exhibited higher intelligibility owing to their robust alignment capabilities. Furthermore, when compared to the transducer-based model [5], our proposed model presented enhanced speech quality and speaker similarity, attributed to the G-MLM utilized in the Speaking stage. Despite some inconsistencies in the results, it is noteworthy that our proposed architecture outperformed all assessed perspectives.

Table 2. Comparison of SECS between generated samples and the speech prompts. The p_rand denotes arbitrary samples from the test set, which are used as the standard value.


3. Ablation: Prosody Controllability

We investigated the controllability using different speech prompts for both the Interpreting and Speaking stages. Here, ps denotes the speech prompt for the Interpreting, i.e., semantic prompt, which controls paralinguistic information embedded in semantic token sequence such as speech rate and pitch contour. Also, pa denotes an acoustic prompt for the Speaking, and it provides detailed acoustic attributes (i.e., timbre, acoustic condition) of the prompt speaker. Following the approach outlined in [5], we generated samples under two conditions: (1) when ps=pa, and (2) when ps≠pa (selected from different speakers). The former represents the general zero-shot TTS scenario, while the latter assesses the ability to independently control semantic paralinguistic elements (such as speech rate and prosody) and acoustic conditions (including speaker identity, timbre, and environmental factors). This separation allows for disentangled prosody controllability. In Table 2, we computed the speaker embedding cosine similarity (SECS) between the generated speech and the speech prompts ps and pa, comparing the proposed method with different Speaking stage implementations, using the speech generator from [5]. Across all cases, our proposed model exhibited higher scores, indicating that the Speaking approach offers superior speaker similarity in both scenarios. The results shows speaker similarity is mostly controlled by Speaking stage, while linguistic prosody is controlled by semantic token generation [5].

Conclusions



In this post, we introduced a two-stage text-to-speech (TTS) system designed to achieve high-fidelity speech synthesis through the utilization of semantic and acoustic tokens. The first stage, termed the Interpreting module, effectively processes text and a speech prompt into semantic tokens, ensuring precise pronunciation and alignment. Following this, the Speaking module takes over, employing these semantic tokens to generate acoustic tokens that capture the target voice's acoustic attribute (timbre, acoustic condition), significantly enhancing the speech reconstruction process. For future work, we plan to extend our proposed framework to multiple languages and make our model encompass a broader range of speech-generation tasks, including singing voice synthesis. Also, we will enlarge the training dataset to the unlabeled speech data to increase the generalization of Speaking.

Link to the paper



https://arxiv.org/abs/2406.17310

Audio Samples



https://srtts.github.io/interpreting-speaking

References

[1] E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 1703–1718, 2023.

[2] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.

[3] Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi, “Soundstorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.

[4] M. Jeong, M. Kim, J. Y. Lee, and N. S. Kim, “Efficient parallel audio generation using group masked language modeling,” arXiv preprint arXiv:2401.01099, 2024.

[5] M. Kim, M. Jeong, B. J. Choi, S. Kim, J. Y. Lee, and N. S. Kim, “Utilizing neural transducers for two-stage text-to-speech via semantic token prediction,” arXiv preprint arXiv:2401.01498, 2024.

[6] C. Du, Y. Guo, H. Wang, Y. Yang, Z. Niu, S. Wang, H. Zhang, X. Chen, and K. Yu, “Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech,” arXiv preprint arXiv:2401.14321, 2024.

[7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.

[8] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.

[9] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.

[10] A. D´efossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023, featured Certification, Reproducibility Certification.

[11] D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou, “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.

[12] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.

[13] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” Interspeech 2019, 2019.

[14] Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.

[15] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.

[16] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518.

[17] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE JSTSP, 2022.