Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis

By Joun Yeop Lee Samsung Research
By Jae-Sung Bae Samsung Research
By Seongkyu Mun Samsung Research
By Ji-Hyun Lee Samsung Research
By Hoon-Young Cho Samsung Research
By Chanwoo Kim Samsung Research


In recent years, text-to-speech (TTS) has accomplished remarkable improvement with the emergence of various end-to end TTS models [1, 2, 3]. Through these advanced models, TTS expands its field from a model built with a professional voice actor to a personalized TTS. To make the TTS model personalize, there have been several attempts to finetune the pre-trained model with a target speaker data [5]. Even though the required quantity of the target speaker data for the fine-tuning is small, it is troublesome to collect such personal data and perform fine-tuning. The zero-shot TTS (zs-TTS) [6, 7, 8, 9, 10, 11], which uses a single utterance to clone the voice of the target speaker without additional fine-tuning, resolves these inconveniences. The most common approach in zs-TTS is to condition speaker embedding to typical end-to-end TTS architecture.

Figure 1.  Motivation of our works and definition of speaker similarity

As perceiving speaker similarity is a complex and ambiguous problem, an insightful definition of speaker similarity should precede. Thus, we assume that the speaker similarity between two speeches can be viewed in two aspects. To clarify, we restrict the definition of two terms to explain these aspects. First, each speaker has a unique identity of voice regardless of the textual context of an utterance, and we will call this property as timbre. The other term is cadence, which is an inter-utterance variant component related to prosody, style, and variation of tone within the same speaker.

In this blog post, we introduce our works, a timbre-cadence speaker encoder (TiCa) that has a hierarchical structure to model timbre and cadence separately.

Also, the conventional zs-TTS models have shown unstable performance such as switching timbre within an utterance. Thus, we introduce an effective augmentation method called speaker mixing augmentation (SMAug) to make a robust zs-TTS model.

Proposed Method

1. Timbre-cadence Speaker Encoder

Figure 2.  The overall architecture of the TiCa. The dashed lines represent the components that are only used during training

To separate timbre and cadence, we suggest a hierarchical speaker encoder as in Figure 2. As the speaker encoder encodes gradually from local to global information, the cadence embedding is first extracted using the attention pooling at the lower part of the speaker encoder. Then, the output of the attention pooling layer of the cadence embedding is subtracted to disentangle the timbre and cadence information. The timbre embedding is extracted after two convolution blocks followed by attention pooling. Finally, the timbre and cadence embeddings are concatenated and form the speaker embedding conditioned on the zero-shot TTS (zs-TTS) framework.

Furthermore, we utilize two supplementary losses for training the timbre-cadence speaker encoder (TiCa). First, to keep the timbre embedding consistent within the same speaker, we give L1-loss against the speaker ID-based speaker embedding acquired by an embedding table.

Second, we adopt the VICReg [12] to regularize the cadence embeddings within a batch to be distinct from each other and to decorrelate the variables of each cadence embedding. By using such losses, we can successfully restrict the timbre embedding to have a smaller discrepancy within the same speaker and enlarge the diversity of the cadence embedding among different utterances.

To train the TiCa stably, in the early stage of the training, we use the speaker ID embedding instead of the timbre embedding. Then, to reduce the mismatch of the training and inference, we train the zs-TTS model conditioning on the timbre embedding while freezing the speaker ID embedding table for certain iterations.

2. Speaker Mixing Augmentation

Figure 3.  Example of SMAug

In typical zs-TTS models, a single vector speaker embedding is broadcasted into the phoneme sequence length or acoustic feature sequence length and then used as a condition on TTS model. However, since the speaker embedding is fixed throughout the whole utterance, it can weaken the role of the speaker embedding and result in synthesized speech with unstable speaker similarity during inference.

To overcome this problem, SMAug concatenates two short utterances from different speakers. Using SMAug, the model encounters two different speakers in one integrated utterance, which enhances the robustness of the zs-TTS model.

Data augmentation in typical TTS is limited since modification in target speech can result in severe performance degradation in naturalness. However, as the SMAug does not perturb the speech samples, it can augment the data while not harming the naturalness of the TTS models.

Furthermore, as the SMAug has an advantage in compatibility, it can be expanded to other applications easily.


1. Model

To show the performance of our proposed method, we performed experiments by replacing speaker encoder modules in a fixed end-to-end model which is a non-attentive version of [1] (NALPCTron). The overall architecture of an acoustic model in NALPCTron is in Figure 4 and we employed the Bunched LPCNet [13] as a neural vocoder.

Figure 4.  The overall architecture of the NALPCTron

2. Experiment Setup

We evaluated the timbre-cadence speaker encoder (TiCa) with the conventional reference encoder-based speaker encoders. First, we utilized a vanilla reference encoder [14]-based speaker embedding (REF) which consists of convolution blocks and a recurrent pooling layer. Second, we applied Meta-StyleSpeech [15]-based speaker encoder (META) which is a self-attention based method. Third, we adopted speaker embedding from the pre-trained speaker verification (SV) model1 (EXTERN) [16]. Also, to show the effect of SMAug, we conduct some experiments without SMAug (TiCa-NoAug). All these models were trained with LibriTTS train-clean-360 set.

3. Evaluation Metrics

For an objective test, we evaluated the phoneme error rate (PER) of the speech samples with an automatic speech recognition (ASR) model to predict pronunciation accuracy. We employ ASR model which is finetuned over the XLSR-53 model in Wav2vec 2.0 model [17]. In addition, averaged speaker embedding cosine similarity (SECS) was performed to evaluate the speaker similarity between the ground truth (GT) samples and synthesized samples. For the SECS, we adopted the pre-trained ECAPA-TDNN model [18] which is one of the SOTA SV models. The SECS ranges from 0 to 1, and a higher score implies better speaker similarity.

For the subjective evaluations, we measured the mean opinion score (MOS) and comparative similarity MOS (CSMOS). MOS estimates the perceptual speech quality by testers in the range from 1 to 5 with an interval of 1, where 5 is the best. For CSMOS, testers listened to one reference speech and two synthesized utterances including TiCa, then were asked to choose which utterance is similar to the reference speech in terms of timbre and cadence with ranges from -3 (the comparative model is much worse than TiCa) to 3 (the comparative model is much better than TiCa) with an interval of 1.

Table 1.  Objective and subjective experiment results on the unseen case. MOS is represented with 95% confidence intervals.

4. Evaluation on Unseen Speakers (zero-shot TTS)

As shown in Table 1, TiCa showed superior performance in speech quality such as pronunciation accuracy and naturalness in terms of PER and MOS results. The PER of GT was worse than the others since some noisy speeches were contained in the GT while the synthesized speeches of the zs-TTS models were typically clean. However, when it comes to the MOS test, GT showed a higher score, because the testers were guided to focus on the perceptual naturalness of speech.

In terms of speaker similarity, the CSMOS results demonstrated that TiCa had better perceptual speaker similarity than the comparison TTS models. However, the overall tendency of SECS results was different from perceptual tests. As the human sense of speaker similarity is a combination of timbre and cadence, this can be a reason that the tendency of CSMOS and SECS was different. Especially, for EXTERN, it achieved the best SECS results but showed worse results than TiCa and META in the CSMOS test. This implies that EXTERN only focused on encoding the timbre of the speaker.

Comparing the objective results of TiCa and TiCa-NoAug, it proves the positive effect of SMAug in terms of pronunciation accuracy and speaker similarity.


In this post, we introduce a timbre-cadence speaker encoder (TiCa) as a novel technique for cloning a target speaker’s voice. This approach assumes that speaker embedding can be viewed as a combination of timbre and cadence. To model these components, TiCa extracts timbre and cadence with a hierarchical structure and some effective additional losses. Moreover, we introduce a simple but powerful speaker mixing augmentation (SMAug) that concatenates two utterances from the different speakers for robust zero-shot TTS. From the experimental results, it showed that our proposed methods outperform the conventional other reference encoder-based speaker encoders.

Link to the paper

Audio Samples


[1] N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency,” in Proc. Interspeech, 2020, pp. 2022–2026.

[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779–4783.

[3] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” in International Conference on Learning Representations (ICLR), 2021.

[4] Y. Chen, Y. M. Assael, B. Shillingford, D. Budden, S. E. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, C. Gulcehre, A. V. D. Oord, O. Vinyals, and N. de Freitas, “Sample Efficient Adaptive Text-to-Speech,” in International Conference on Learning Representations (ICLR), 2019.

[5] M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, sheng zhao, and T.-Y. Liu, “AdaSpeech: Adaptive Text to Speech for Custom Voice,” in International Conference on Learning Representations (ICLR), 2021.

[6] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Golge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone,” in International Conference on Machine Learning (ICML), 2022, pp. 2709–2720.

[7] M. Kim, M. Jeong, B. J. Choi, S. Ahn, J. Y. Lee, and N. S. Kim, “Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus,” in Proc. Interspeech, 2022, pp. 788–792.

[8] E. Casanova, C. Shulby, E. Golge, N. M. Muller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model,” in Proc. Interspeech, 2021, pp. 3645–3649.

[9] Y. Wu, X. Tan, B. Li, L. He, S. Zhao, R. Song, T. Qin, and T.-Y. Liu, “AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios,” in Proc. Interspeech, 2022, pp. 2568–2572.

[10] J.-H. Lee, S.-H. Lee, J.-H. Kim, and S.-W. Lee, “PVAE-TTS: Adaptive Text-to-Speech via Progressive Style Adaptation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6312–6316.

[11] B. J. Choi, M. Jeong, J. Y. Lee, and N. S. Kim, “SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech,” IEEE Signal Processing Letters, vol. 29, pp. 2502–2506, 2022.

[12] A. Bardes, J. Ponce, and Y. LeCun, “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning,” in International Conference on Learning Representations (ICLR), 2022.

[13] S. Park, K. Choo, J. Lee, A. V. Porov, K. Osipov, and J. S. Sung, “Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge,” in Proc. Interspeech, 2022, pp. 808–812.

[14] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in International Conference on Machine Learning (ICML), 2018, pp. 4700–4709.

[15] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation,” in International Conference on Machine Learning (ICML), 2021, pp. 7748–7759.

[16] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In Defence of Metric Learning for Speaker Recognition,” in Proc. Interspeech, 2020.

[17] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12 449–12 460.

[18] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPATDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, 2020, pp. 3830–3834.