AI
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society. And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals. In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them. #1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research) #2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research) #4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America) |
In the swiftly progressing domain of neural text-to-speech (TTS) systems, the quest for creating human-like speech has witnessed remarkable strides. Recent advancements have opened avenues for TTS systems capable of not only mimicking human speech but also encapsulating the nuances of emotions and linguistic diversity. With an escalating demand for more sophisticated TTS capabilities, the pursuit of multi-emotion and multi-lingual TTS systems has gained prominence.
However, the journey towards achieving this objective is riddled with complexities. One of the primary hurdles lies in obtaining speech samples from target speakers exhibiting multiple emotions or languages, which often proves impractical in real-world scenarios. Furthermore, the entanglement of speech attributes, spanning content, speaker identity, emotion, language, and style, poses significant challenges in this endeavor.
Disentangling these intertwined attributes is pivotal for effectively transferring desired speech qualities to target speakers. Among these attributes, emotion presents a particularly daunting challenge by its inherent complexity and variability. Preliminary experiments have revealed that label-based TTS systems often stumble when encountering new combinations of labels. While reference-based TTS systems have been explored to address these challenges, ensuring the exclusive separation and extraction of emotional information from reference speech remains challenging.
MELS-TTS, our proposed Multi-Emotion, Multi-Lingual, and Multi-Speaker TTS system, confronts these disentanglement challenges head-on. Drawing inspiration from Global Style Tokens (GST) [1], MELS-TTS introduces disentangled style tokens to represent specific speech attributes—speaker, language, emotion, and residual. This enables successful disentanglement of various speech attributes by learning their influences. Through objective and subjective evaluations, MELS-TTS has showcased its superiority over other reference-based TTS systems in multi-lingual and multi-speaker scenarios for emotion transfer.
Figure 1. The architecture of the proposed method. During training, the emotion, speaker, language, and residual token sets are employed for the key and value of the style attention. In inference, only the emotion token set is utilized for the key and value of the style attention, with dotted lines deactivated.
MELS-TTS is based in a Tacotron-variant architecture [2], as depicted in Figure 1. As follows, we introduce the emotion encoder, the pivotal component of MELS-TTS, alongside its inference process.
The reference encoder processes the mel-spectrogram of the target speech into a reference embedding. We propose disentangled style tokens to represent four speech attributes: speaker, language, emotion, and residual, facilitating the learning of corresponding attributes from the reference embedding.
Emotion token sets : Based on the target mel-spectrogram's emotion, the emotion token set T_E is selected from 〖{T〗_n,T_h,T_s,T_a}, representing neutral, happy, sad, and angry emotion token sets, ensuring balanced learning to address the imbalanced emotion database. Each emotion token set comprises randomly initialized embeddings to capture diverse nuances within an emotion category.
Speaker token : We design the speaker token to aid the emotion embedding, ensuring the segregation of speaker information from the reference embedding. Speaker token T_S is the output embedding from the speaker look-up table.
Language token : The language token functions to separate language details from the reference embedding. Language token T_L is the output embedding from the language look-up table.
Residual token set : Beyond speaker and language attributes, we acknowledge the presence of additional details within non-emotion information, referred to as residual information. To accommodate this diversity, the residual token set T_R comprises randomly initialized embeddings.
For the style attention, we employ multi-head attention [3] with as follows:
For training, T_x=T_A : To capture diverse speech information from the reference embedding, we employ all disentangled style tokens as keys and values in the style attention mechanism. All disentangled tokens T_A consist of the selected target emotion token set T_E, the speaker token T_S, the language token T_L, and residual token set T_R.
For inference, T_x=T_E : For emotion-specific salience, we deactivate the speaker, language, and residual token sets. As the key and value for the style attention, we employ the selected target emotion token set T_E.
During inference, we employ the representative emotion embedding. First, we compute the emotion embedding for all training utterances and obtaining the mean emotion embedding for each emotion. Then, use the emotion embedding of an utterance closest to this mean as the representative emotion embedding. Empirically, utilizing an emotion embedding from an existing utterance outperformed using the mean emotion embedding.
With the representative emotion embedding, the reference speech becomes unnecessary during inference, allowing for stable synthesis of speech with the desired emotion. The desired representative emotion embedding is concatenated with the output of the text encoder, and the desired speaker and language IDs are processed via look-up tables and linear layers before being fed into the decoder.
To assess cross-lingual cross-speaker emotion transfer, we conducted experiments using a multi-lingual speech database containing both Korean and English recordings. For the Korean dataset [4], we utilized approximately 271,000 recordings from 42 native speakers with 4 emotions: neutral, happy, sad, and angry. The English dataset was drawn from the VCTK database [5], featuring 43,000 utterances from 108 speakers.
For target acoustic features, we utilized the Bunched LPCNet2 neural vocoder [6] using 22-dimensional LPCNet features. The reference encoder input was 80-dimensional log mel-spectrograms, extracted with a 1024 window size and 256 hop size.
We compared three baseline systems: the label-based TTS (LB) system, the GST system [1], and the GST system with an emotion classifier (GST-C) [7].
We conducted three Mean Opinion Score (MOS) evaluations, focusing on naturalness, speaker similarity, and emotion similarity across English and Korean speech samples. The assessments were conducted by native experts and crowdsourced participants. As tables show, MELS-TTS demonstrated exceptional performance, particularly in conveying intended emotions and maintaining naturalness across languages.
Table 1.MOS results with 95% confidence intervals (CIs) of synthesized Korean speech samples in four emotions from English source speakers.
Table 2.MOS results with 95% CIs of synthesized English samples in four emotions from Korean source speakers.
For the GT of the English naturalness MOS test, only the utterances with neutral emotion were evaluated.
To further validate our approach, we conducted objective evaluation in cross-lingual emotional synthesis. Using a pre-trained emotion classifier on the same Korean emotion database, we assessed the accuracy of emotion recognition. The results affirmed the proposed system closely approached the performance of the Ground Truth (GT) case.
Additionally, an ablation study was conducted to explore the significance of speaker, language, and residual (SLR) tokens in our approach. We observed a decrease in emotion classification accuracy upon excluding SLR tokens during training. Visualization of emotion embeddings further validated our findings, demonstrating some utterances are poorly clustered without SLR tokens. It implies SLR tokens help to disentangle speech attributes, enabling the emotion embedding to focus on emotion information more effectively.
Table 3.Emotion classification accuracy [%] on synthesized Korean speech in four emotions from English source speakers. Avg. indicates the average accuracy of whole emotions.
Figure 2.Emotion embedding visualization using t-SNE for ablation study. (a) the proposed system without the speaker, language, and residual tokens, and (b) the proposed system.
In our post, we introduced MELS-TTS—a novel text-to-speech system that utilizes disentangled style tokens to capture speaker, language, emotion, and residual information. MELS-TTS learns to untangle these speech attributes, enabling cross-lingual emotion transfer in synthesized speech. Our evaluations underscore MELS-TTS's prowess in generating emotionally expressive speech across languages and speakers.
[1] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, et al., “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. Int. Conf. on Machine Learning (ICML), 2018, pp. 5180–5189.
[2] N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, et al., “High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency,” in Proc. Interspeech, 2020, pp. 2002–2006.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Proc. NIPS, 2017.
[4] AIHub website, “Multi-speaker multi-emotion database,” https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=466,2022, [Online; accessed 27-October-2022].
[5] C. Veaux, J. Yamagishi, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2017.
[6] S. Park, K. Choo, J. Lee, A. V. Porov, K. Osipov, and J. S. Sung, “Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge,” in Proc. Interspeech, 2022, pp. 808–812.
[7] P. Wu, Z. Ling, L. Liu, Y. Jiang, H. Wu, and L. Dai, “End-to-end emotional speech synthesis using style tokens and semisupervised training,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 623–627.