OverFlow - Putting flows on top of neural transducers for better TTS
This project is maintained by shivammehta25
We propose a new approach, OverFlow, to address the shortcomings of neural HMM TTS (a type of transducer TTS) by adding flows over them. Having a stronger probabilistic model, we can now describe the highly non-Gaussian distribution of speech acoustics, obtaining better likelihoods and resulting in improvements in pronunciation and naturalness. We show that our model converges to lower word error rate (WER) faster and achieves higher naturalness scores than comparable methods. The resulting system:
Find reading boring? Try listening to the summary spoken by different voices:
LJ Speech | RyanSpeech | IndicTTS (Female) | IndicTTS (Male) | L2 Arctic (Mandarin) |
---|---|---|---|---|
For more information, please read our paper.
Code is available in our GitHub repository, along with pre-trained models.
It is also available in Coqui TTS under text to spectrogram models. The training recipe can be found under recipes/ljspeech/OverFlow
.
To synthesise from OverFlow present in Coqui TTS, you can use the following command:
# Install TTS
pip install tts
# Change --text to the desired text prompt
# Change --out_path to the desired output path
tts --text "Hello world!" --model_name tts_models/en/ljspeech/overflow --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path output.wav
Dataset: LJ Speech
Training data duration: ~22.8 hours
Sentence | Vocoded speech | Proposed (OverFlow) | Tacotron 2 | Glow-TTS | Neural HMM TTS | ||
---|---|---|---|---|---|---|---|
Condition | VOC | OF | OFND (No Dropout) | OFZT (Zero Temperature) | T2 | GTTS | NHMM |
Sentence 1 | |||||||
Sentence 2 | |||||||
Sentence 3 | |||||||
Sentence 4 | |||||||
Sentence 5 | |||||||
Sentence 6 | |||||||
Sentence 7 |
Dataset: LJ Speech
Training data duration: ~22.8 hours
Sentence | VOC | OF | OFND | GTTS | NHMM |
---|---|---|---|---|---|
Sentence 1 | |||||
Sentence 2 | |||||
Sentence 3 | |||||
Sentence 4 | |||||
Sentence 5 | |||||
Sentence 6 |
Dataset: RyanSpeech
Training data duration: ~9 hours
VOC | OF | OFND | OFZT | |||
---|---|---|---|---|---|---|
Held-out utterance | Realisation 1 | Realisation 2 | Realisation 1 | Realisation 2 | Realisation 1 | Realisation 2 |
We finetuned our model from the RyanSpeech 100k checkpoint on several English datasets. The model adapted to the speaker style and accent-specific pronunciations within 5000 updates. This shows that finetuning on low-resource datasets is an effective way to adapt the model to different voices.
Dataset | Indic TTS | L2 Arctic | L2 Arctic | LibriTTS-British | SLR70 (Nigerian English) |
---|---|---|---|---|---|
Duration (hours) | ~6.97 | ~1.20 | ~1.08 | ~0.44 | ~0.45 |
Speaker / ID | Male | L1 Arabic (YBAA) | L1 Mandarin (TXHC) | 7700 | 07508 |
# of finetuning iterations | 5000 | 5000 | 5000 | 5000 | 5000 |
Harvard Sentence 001 | |||||
Harvard Sentence 002 | |||||
Harvard Sentence 003 | |||||
Harvard Sentence 004 | |||||
Harvard Sentence 005 | |||||
Harvard Sentence 006 |
We additionally trained VITS and FastPitch models using the recipes and code provided in Coqui-TTS, using 100k updates, a single GPU, and the same batch size as for the systems in our paper. At the end of this training, VITS learned to speak with a Word Error Rate (WER) of 4.4%. FastPitch only started speaking after continuing training to 140k updates, and then with low intelligibility (20% WER), whereas OverFlow had completed 30k updates and achieved 3.2% WER after the same wall-clock time.
Sentence | OF | VITS | FastPitch (FP) @140k |
---|---|---|---|
Sentence 1 | |||
Sentence 2 | |||
Sentence 3 | |||
Sentence 4 | |||
Sentence 5 | |||
Sentence 6 | |||
Sentence 7 |
To compare the quality of our GTTS baseline to a reference Glow-TTS system, we have synthesised a number of utterances from a pre-trained Glow-TTS model, specifically the checkpoint available in Coqui-TTS trained on LJ Speech. That model was trained with 330k updates using a batch size of 32.
Sentence | PT-GTTS | GTTS |
---|---|---|
Sentence 1 | ||
Sentence 2 | ||
Sentence 3 | ||
Sentence 4 | ||
Sentence 5 | ||
Sentence 6 | ||
Sentence 7 |
You can synthesise additional audio from the PT-GTTS model by running the following command:
pip install tts
tts --model_name tts_models/en/ljspeech/glow-tts --out_path output.wav --text "The sentence to be synthesised"
@inproceedings{mehta2023overflow,
title={OverFlow: Putting flows on top of neural transducers for better TTS},
author={Mehta, Shivam and Kirkland, Ambika and Lameris, Harm and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
booktitle={Proc. Interspeech},
pages={4279--4283},
doi={10.21437/Interspeech.2023-1996},
year={2023}
}