OverFlow: Putting flows on top of neural transducers for better TTS

Summary

Architecture

Code

Stimuli from the listening tests

Sampling at different temperatures

Variation in synthesis

Fast finetuning to different voices

Comparison to additional TTS models

Comparing GTTS to pre-trained Glow-TTS (PT-GTTS)

Citation information

Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, and Gustav Eje Henter

We propose a new approach, OverFlow, to address the shortcomings of neural HMM TTS (a type of transducer TTS) by adding flows over them. Having a stronger probabilistic model, we can now describe the highly non-Gaussian distribution of speech acoustics, obtaining better likelihoods and resulting in improvements in pronunciation and naturalness. We show that our model converges to lower word error rate (WER) faster and achieves higher naturalness scores than comparable methods. The resulting system:

Quickly learns to speak and align
Is fully probabilistic
Can generate good quality speech at many temperatures
Can adapt to new speakers with limited data

Find reading boring? Try listening to the summary spoken by different voices:

LJ Speech	RyanSpeech	IndicTTS (Female)	IndicTTS (Male)	L2 Arctic (Mandarin)

For more information, please read our paper.

Architecture of OverFlow

Code is available in our GitHub repository, along with pre-trained models.

It is also available in Coqui TTS under text to spectrogram models. The training recipe can be found under recipes/ljspeech/OverFlow.

To synthesise from OverFlow present in Coqui TTS, you can use the following command:

# Install TTS
pip install tts
# Change --text to the desired text prompt
# Change --out_path to the desired output path
tts --text "Hello world!" --model_name tts_models/en/ljspeech/overflow --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path output.wav

Dataset: LJ Speech
Training data duration: ~22.8 hours

Sentence	Vocoded speech	Proposed (OverFlow)			Tacotron 2	Glow-TTS	Neural HMM TTS
Condition	VOC	OF	OFND (No Dropout)	OFZT (Zero Temperature)	T2	GTTS	NHMM
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7

Dataset: LJ Speech
Training data duration: ~22.8 hours

0 1

Sampling temperature:

Sentence	VOC	OF	OFND	GTTS	NHMM
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Dataset: RyanSpeech
Training data duration: ~9 hours

We finetuned our model from the RyanSpeech 100k checkpoint on several English datasets. The model adapted to the speaker style and accent-specific pronunciations within 5000 updates. This shows that finetuning on low-resource datasets is an effective way to adapt the model to different voices.

Dataset	Indic TTS	L2 Arctic	L2 Arctic	LibriTTS-British	SLR70 (Nigerian English)
Duration (hours)	~6.97	~1.20	~1.08	~0.44	~0.45
Speaker / ID	Male	L1 Arabic (YBAA)	L1 Mandarin (TXHC)	7700	07508
# of finetuning iterations	5000	5000	5000	5000	5000
Harvard Sentence 001
Harvard Sentence 002
Harvard Sentence 003
Harvard Sentence 004
Harvard Sentence 005
Harvard Sentence 006

We additionally trained VITS and FastPitch models using the recipes and code provided in Coqui-TTS, using 100k updates, a single GPU, and the same batch size as for the systems in our paper. At the end of this training, VITS learned to speak with a Word Error Rate (WER) of 4.4%. FastPitch only started speaking after continuing training to 140k updates, and then with low intelligibility (20% WER), whereas OverFlow had completed 30k updates and achieved 3.2% WER after the same wall-clock time.

Sentence	OF	VITS	FastPitch (FP) @140k
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7

To compare the quality of our GTTS baseline to a reference Glow-TTS system, we have synthesised a number of utterances from a pre-trained Glow-TTS model, specifically the checkpoint available in Coqui-TTS trained on LJ Speech. That model was trained with 330k updates using a batch size of 32.

Sentence	PT-GTTS	GTTS
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7

You can synthesise additional audio from the PT-GTTS model by running the following command:

pip install tts
tts --model_name tts_models/en/ljspeech/glow-tts --out_path output.wav --text "The sentence to be synthesised"

Held-out utterance

Realisation 1

Realisation 2

Realisation 1

Realisation 2

Realisation 1

Realisation 2

@inproceedings{mehta2023overflow,
  title={OverFlow: Putting flows on top of neural transducers for better TTS},
  author={Mehta, Shivam and Kirkland, Ambika and Lameris, Harm and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. Interspeech},
  pages={4279--4283},
  doi={10.21437/Interspeech.2023-1996},
  year={2023}
}