Neural HMMs are all you need (for high-quality attention-free TTS)
This project is maintained by shivammehta25
We show that classic, HMM-based speech synthesis and modern, neural text-to-speech (TTS) can be combined to obtain the best of both worlds. Concretely, our proposal amounts to replacing conventional attention in neural TTS by so-called neural HMMs. We call this new approach “neural HMM TTS”.
To validate our proposal, we describe a modified version of Tacotron 2 that uses neural HMMs instead of attention. The resulting system:
To our knowledge, this is the first time HMM-based speech synthesis has achieved a speech quality on par with neural TTS.
For more information, please read our ICASSP 2022 paper here.
Code is available in our Github repository, along with a pre-trained model.
Type | Proposed neural HMM TTS | Tacotron 2 baseline | ||
---|---|---|---|---|
Condition | 2 states per phone (NH2) | 1 state per phone (NH1) | w/o post-net (T2-P) | w/ post-net (T2+P) |
Sentence 1 | ||||
Sentence 2 | ||||
Sentence 3 | ||||
Sentence 4 | ||||
Sentence 5 | ||||
Sentence 6 |
Model | Utterance at iteration: |
---|---|
NH2 | |
T2-P | |
NH2 (500 utterances) | |
T2-P (500 utterances) |
Neural HMM (NH2) | Tacotron 2 (T2-P) | Pre-trained NVIDIA Tacotron 2 | |
---|---|---|---|
Example 1 | |||
Example 2 | |||
Example 3 | |||
Example 4 |
Neural HMM TTS (model NH2) | ||||
---|---|---|---|---|
Durations | Quantile | Sampled | Quantile | Sampled |
Acoustics | Mean | Mean | Sampled | Sampled |
Sentence 1 | ||||
Sentence 2 | ||||
Sentence 3 | ||||
Sentence 4 | ||||
Sentence 5 | ||||
Sentence 6 |
Neural HMM TTS (model NH2) | ||||
---|---|---|---|---|
Pre-net dropout? | ❌ | ✅ | ✅ | ✅ |
Output | Same every time | Example 1 | Example 2 | Example 3 |
Sentence 1 | ||||
Sentence 2 | ||||
Sentence 3 | ||||
Sentence 4 | ||||
Sentence 5 | ||||
Sentence 6 |
Drag the slider to change the speaking rate.
Sentence | Neural HMM TTS (model NH2) |
---|---|
Sentence 1 | |
Sentence 2 | |
Sentence 3 | |
Sentence 4 | |
Sentence 5 | |
Sentence 6 |
These audio samples use a stronger vocoder (HiFi-GAN version LJ_FT_T2_V1) and also include a hybrid condition NH2+P that demonstrates the effect of applying the post-net from model T2 to the output of model NH2.
Type | Proposed neural HMM TTS | Tacotron 2 baseline | ||
---|---|---|---|---|
Condition | 2 states per phone (NH2) | NH2 with Tacotron 2's post-net (NH2+P) | w/o post-net (T2-P) | w/ post-net (T2+P) |
Sentence 1 | ||||
Sentence 2 | ||||
Sentence 3 | ||||
Sentence 4 | ||||
Sentence 5 | ||||
Sentence 6 | ||||
Sentence 7 | ||||
Sentence 8 | ||||
Sentence 9 | ||||
Sentence 10 |
@inproceedings{mehta2022neural,
title={Neural {HMM}s are all you need (for high-quality attention-free {TTS})},
author={Mehta, Shivam and Sz{\'e}kely, {\'E}va and Beskow, Jonas and Henter, Gustav Eje},
booktitle={Proc. ICASSP},
year={2022}
}