AI

Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis

By Jae-Sung Bae Samsung Research
By Joun Yeop Lee Samsung Research
By Ji-Hyun Lee Samsung Research
By Seongkyu Mun Samsung Research
By Taehwa Kang Samsung Research
By Hoon-Young Cho Samsung Research

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them.

#1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research)

#2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research)

#3. FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement (Samsung R&D Institute China-Beijing)

#4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America)

#5. Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment (Samsung R&D Institute China-Nanjing)

#6. Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition (Samsung R&D Institute United Kingdom)

#7. Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition (Samsung R&D Institute India-Bangalore)


Introduction

Recently, personalized AI systems have gained significant attention. In the TTS field, zero-shot text-to-speech (ZS-TTS) systems [1-7] enable users to create their own TTS systems that replicate their voices with just one utterance, without further training. To achieve high speaker similarity across various scenarios, ZS-TTS systems require substantial training data encompassing a diverse set of speakers.

Recent TTS systems have utilized crowd-sourced speech data [8-9] or employed data augmentation techniques such as pitch shifting [10] and synthesizing new speech using voice conversion or TTS systems [10–12]. Nevertheless, these data sources can raise privacy concerns and often contain speech with ambiguous pronunciation, background noise, channel artifacts, and artificial distortions, which result in the degradation of the overall performance of the TTS systems.

In this post, we introduce a novel method called latent filling (LF) to address these challenges. Rather than directly augmenting the input speech data, LF augments the data in the latent space of speaker embeddings through latent space data augmentation [13-16]. The LF can be easily applied to existing ZS-TTS systems with minimal modifications and improves the speaker similarity performance of the ZS-TTS system without any degradation in intelligibility and naturalness.

Proposed Methods

Baseline ZS-TTS system

Figure 1. Architecture of the baseline ZS-TTS system with our proposed latent filling method.

Our baseline ZS-TTS system shares a similar architecture with our previous work [7]. It is designed for low-resource and cross-lingual speech generation in a streaming scenario. The overall architecture is illustrated in Fig. 1. Due to the page limit, please refer to our paper for the detailed structure.

Latent Filling

Figure 2.Illustration of LF of (a) interpolation and (b) noise adding. The red circle indicates an augmented speaker embedding, while the circles in various colors represent speaker embeddings of different speakers.

Through the latent filling (LF) method, our aim is to fill the latent space of the speaker embeddings that the training dataset cannot adequately express. We employ two intuitive latent space augmentation techniques for the LF: interpolation [13, 14] and noise addition [13]. Illustrations of these methods are provided in Fig. 2. The interpolation method creates a completely new speaker embedding in the speaker embedding space by using two different speaker embeddings, while the adding noise method generates a new speaker embedding that is relatively close to the existing one. Please refer to our paper for the complete process of LF.

Latent filling consistency loss

Adopting latent space data augmentation for generation tasks has been challenging due to the inherent difficulty of obtaining target data corresponding to augmented latent vectors. Similarly, when the LF method is adopted for the speaker embedding, it is impossible to calculate reconstruction losses of the predicted acoustic and duration features because there is no corresponding ground-truth speech containing the speaker information for the augmented speaker embedding s ̃.

To address this challenge, we propose a latent filling consistency loss (LFCL), a modified version of speaker consistency loss (SCL) [1, 17] tailored for augmented speaker embeddings. The LFCL encourages the speaker embedding of the generated acoustic feature corresponding to s ̃ to be close to s ̃. By using the LFCL, we can successfully update the ZS-TTS system with the augmented speaker embedding without the need for reconstruction losses. This allows the ZS-TTS system to be trained with speaker embeddings not contained in the training dataset.

Training procedure

The training procedure for the TTS system using the LF and LFCL is as follows: First, for each training iteration, we make a random decision on whether to perform the LF, with a probability parameter τ ranging from 0 to 1. When the LF is performed, the entire TTS system is only updated with LFCL. Conversely, when LF is not applied, the TTS system is updated using reconstruction losses and SCL. By randomly incorporating LF during training, we seamlessly integrate it into existing TTS systems without requiring additional training stages or degrading performance.

Experiments

We used open English and Korean speech-text paired datasets to train the proposed TTS system. Please refer to our paper for the detailed information of these datasets.

Comparison systems

For the comparison, we used the following systems. GT and GT-re were the ground-truth speech samples and re-synthesized version of GT through the vocoder, respectively. SC-Glow TTS [3] and YourTTS [1] were the open-sourced ZS-TTS systems. We compared the proposed LF method (Baseline+LF) with data augmentation methods. The Baseline+CS system utilized LibriLight [18] dataset. We also built Baseline+PS system which utilized pitch shifts to increase the amount and diversity of the training data. We applied pitch shifts to each speech in the training dataset and doubled the amount of training data.

Evaluation Metrics

For the objective test, we evaluate the speaker similarity and the intelligibility of each system with averaged speaker embedding cosine similarity (SECS) and word error rate (WER), respectively. The SECS ranges from -1 to 1, and a higher score implies better speaker similarity. External speaker verification and automatic speech recognition (ASR) models were used to compute the SECS and WER, respectively.

For the subjective evaluation, we conducted two mean opinion score (MOS) tests; a MOS test on overall naturalness of speech, and a speaker similarity MOS (SMOS) test that focused on evaluating the speaker similarity between generated speech and reference speech. Testers of both tests were requested to evaluate speech in the range from 1 to 5 with an interval of 0.5, where 5 is the best.

Results

Table 1.Objective and subjective zero-shot experiment results of the intra-lingual test that generated English speech samples from English reference speech (En→En), and the cross-lingual test that generated English speech samples from Korean reference speech (Ko→En). NP indicates a number of parameters. The MOS and SMOS are reported with 95% CIs. Note that the speech samples of GT and GT-re systems of the cross-lingual test were in Korean.

The objective and subjective results are summarized in Table 1. The proposed Baseline+LF system exhibited outstanding performance in terms of SECS and SMOS metrics for both intra-lingual (En→En) and cross-lingual (Ko→En) tests. Specifically, it achieved the highest SMOS and SECS scores in the intra-lingual test, and the best SMOS score along with the second-best SECS score in the cross-lingual test. Compared to the YourTTS system, it achieved 0.55 and 0.46 higher SMOS scores in the intra- and cross-lingual tests, respectively, while employing approximately 60% fewer parameters.

When compared to the baseline system, our proposed system demonstrated SMOS improvements of 0.14 and 0.12 in the intra-lingual and cross-lingual tests, respectively, as well as SECS improvements of 0.009 for both tests.

The data augmentation approaches (Baseline+CS and Baseline+PS systems) demonstrated mixed results. While the SMOS score improved in the intra-lingual test, all other evaluation metrics deteriorated or slightly improved when compared to the baseline system. This decline was attributed to the low quality of the augmented speech. Conversely, by performing augmentation in the latent space using the LF method, our proposed approach successfully improved speaker similarity without degrading speech quality, in contrast to input-level data augmentation approaches.

Conclusion

In this work, we proposed a latent filling (LF) method, which leverages latent space data augmentation to the speaker embedding space of the ZS-TTS system. By introducing the latent filling consistency loss, we successfully integrated LF into the existing ZS-TTS framework seamlessly. Unlike previous data augmentation methods applied to input speech, our LF method improves speaker similarity without compromising the naturalness and intelligibility of the generated speech.

Link to the paper and demo page

Paper: https://ieeexplore.ieee.org/abstract/document/10446098
Demo page: https://srtts.github.io/latent-filling/

References

[1] E. Casanova, J. Weber, C. D Shulby, A. C. Junior, E. G ̈olge, and M. A Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. Int. Conf. on Machine Learning (ICML), 2022.

[2] M. Kim, M. Jeong, B. Jin C., S. Ahn, J. Y. Lee, and N. S. Kim, “Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus,” in Proc. Interspeech, 2022.

[3] E. Casanova, C. D. Shulby, E. G ̈olge, N. Michael M ̈uller, F. S. de Oliveira, A. C. J ́unior, et al., “SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech, 2021.

[4] Y. Wu, X. Tan, B. Li, L. He, S. Zhao, R. Song, T. Qin, and T.-Y. Liu, “AdaSpeech 4: Adaptive text to speech in zero-shot scenarios,” in Proc. Interspeech, 2022.

[5] J.-H. Lee, S.-H. Lee, J.-H. Kim, and S.-W. Lee, “PVAE-TTS: Adaptive text-to-speech via progressive style adaptation,” in Proc. ICASSP, 2022.

[6] B. J. Choi, M. Jeong, J. Y. Lee, and N. S. Kim, “SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech,” IEEE Signal Processing Letters, vol. 29, pp. 2502–2506, 2022.

[7] J. Y. Lee, J.-S. Bae, S. Mun, J. Lee, J.-H. Lee, H.-Y. Cho, et al., “Hierarchical timbre-cadence speaker encoder for zero-shot speech synthesis,” in Proc. Interspeech, 2023.

[8] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, et al., “Neural codec language models are zero-shot text to speech synthesizers,” CoRR, vol. abs/2301.02111, 2023.

[9] E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, et al., “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” CoRR, vol. abs/2302.03540, 2023.

[10] R. Terashima, R. Yamamoto, E. Song, Y. Shirahata, H.-W. Yoon, J.-M. Kim, and K. Tachibana, “Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitch-shift data augmentation,” in Proc. Interspeech, 2022.

[11] G. Huybrechts, T. Merritt, G. Comini, B. Perz, R. Shah, and J. Lorenzo-Trueba, “Low-resource expressive text-to-speech using data augmentation,” in Proc. ICASSP, 2021.

[12] E. Song, R. Yamamoto, O. Kwon, C.-H. Song, M.-J. Hwang, S. Oh, et al., “TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder,” in Proc. Interspeech, 2022.

[13] T. DeVries and G. W. Taylor, “Dataset augmentation in feature space,” in Proc. Int. Conf. on Learning Representations (ICLR), 2017.

[14] H. Zhang, M. Ciss ́e, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. Int. Conf. on Learning Representations (ICLR), 2018.

[15] T.-H. Cheung and D.-Y. Yeung, “MODALS: Modality-agnostic automated data augmentation in the latent space,” in Proc. Int. Conf. on Learning Representations (ICLR), 2021.

[16] A. Falcon, G. Serra, and O. Lanz, “A feature-space multimodal data augmentation technique for text-video retrieval,” in Proc. ACM Int. Conf. on Multimedia, 2022.

[17] D. Xin, Y. Saito, S. Takamichi, T. Koriyama, and H. Saruwatari, “Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis,” in Proc. Interspeech, 2021.

[18] J. Kahn, M. Rivi``ere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazar ́e, et al., “Libri-light: A benchmark for ASR with limited or no supervision,” in Proc. ICASSP, 2020.