
OverFlow - Putting flows on top of neural transducers for better TTS

This project is maintained by shivammehta25

OverFlow: Putting flows on top of neural transducers for better TTS

Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, and Gustav Eje Henter


We propose a new approach, OverFlow, to address the shortcomings of neural HMM TTS (a type of transducer TTS) by adding flows over them. Having a stronger probabilistic model, we can now describe the highly non-Gaussian distribution of speech acoustics, obtaining better likelihoods and resulting in improvements in pronunciation and naturalness. We show that our model converges to lower word error rate (WER) faster and achieves higher naturalness scores than comparable methods. The resulting system:

Find reading boring? Try listening to the summary spoken by different voices:

LJ Speech RyanSpeech IndicTTS (Female) IndicTTS (Male) L2 Arctic (Mandarin)

For more information, please read our paper.


Architecture of OverFlow


Code is available in our GitHub repository, along with pre-trained models.

It is also available in Coqui TTS under text to spectrogram models. The training recipe can be found under recipes/ljspeech/OverFlow.

To synthesise from OverFlow present in Coqui TTS, you can use the following command:

# Install TTS
pip install tts
# Change --text to the desired text prompt
# Change --out_path to the desired output path
tts --text "Hello world!" --model_name tts_models/en/ljspeech/overflow --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path output.wav

Stimuli from the listening tests

Dataset: LJ Speech
Training data duration: ~22.8 hours

Sentence Vocoded speech Proposed (OverFlow) Tacotron 2 Glow-TTS Neural HMM TTS
Condition VOC OF OFND (No Dropout) OFZT (Zero Temperature) T2 GTTS NHMM
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7

Sampling at different temperatures

Dataset: LJ Speech
Training data duration: ~22.8 hours

Sampling temperature:

Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Variation in synthesis

Dataset: RyanSpeech
Training data duration: ~9 hours

Held-out utterance Realisation 1 Realisation 2 Realisation 1 Realisation 2 Realisation 1 Realisation 2

Fast finetuning to different voices

We finetuned our model from the RyanSpeech 100k checkpoint on several English datasets. The model adapted to the speaker style and accent-specific pronunciations within 5000 updates. This shows that finetuning on low-resource datasets is an effective way to adapt the model to different voices.

Dataset Indic TTS L2 Arctic L2 Arctic LibriTTS-British SLR70 (Nigerian English)
Duration (hours) ~6.97 ~1.20 ~1.08 ~0.44 ~0.45
Speaker / ID Male L1 Arabic (YBAA) L1 Mandarin (TXHC) 7700 07508
# of finetuning iterations 5000 5000 5000 5000 5000
Harvard Sentence 001
Harvard Sentence 002
Harvard Sentence 003
Harvard Sentence 004
Harvard Sentence 005
Harvard Sentence 006

Comparison to additional TTS models

We additionally trained VITS and FastPitch models using the recipes and code provided in Coqui-TTS, using 100k updates, a single GPU, and the same batch size as for the systems in our paper. At the end of this training, VITS learned to speak with a Word Error Rate (WER) of 4.4%. FastPitch only started speaking after continuing training to 140k updates, and then with low intelligibility (20% WER), whereas OverFlow had completed 30k updates and achieved 3.2% WER after the same wall-clock time.

Sentence OF VITS FastPitch (FP) @140k
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7

Comparing GTTS to pre-trained Glow-TTS (PT-GTTS)

To compare the quality of our GTTS baseline to a reference Glow-TTS system, we have synthesised a number of utterances from a pre-trained Glow-TTS model, specifically the checkpoint available in Coqui-TTS trained on LJ Speech. That model was trained with 330k updates using a batch size of 32.

Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7

You can synthesise additional audio from the PT-GTTS model by running the following command:

pip install tts
tts --model_name tts_models/en/ljspeech/glow-tts --out_path output.wav --text "The sentence to be synthesised"

Citation information

  title={OverFlow: Putting flows on top of neural transducers for better TTS},
  author={Mehta, Shivam and Kirkland, Ambika and Lameris, Harm and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. Interspeech},
