Neural HMMs are all you need (for high-quality attention-free TTS)
This project is maintained by shivammehta25
We show that classic, HMM-based speech synthesis and modern, neural text-to-speech (TTS) can be combined to obtain the best of both worlds. Concretely, our proposal amounts to replacing conventional attention in neural TTS by so-called neural HMMs. We call this new approach “neural HMM TTS”.
To validate our proposal, we describe a modified version of Tacotron 2 that uses neural HMMs instead of attention. The resulting system:
To our knowledge, this is the first time HMM-based speech synthesis has achieved a speech quality on par with neural TTS.
For more information, please read our ICASSP 2022 paper here.

Code is available in our Github repository, along with a pre-trained model.
| Type | Proposed neural HMM TTS | Tacotron 2 baseline | ||
|---|---|---|---|---|
| Condition | 2 states per phone (NH2) | 1 state per phone (NH1) | w/o post-net (T2-P) | w/ post-net (T2+P) |
| Sentence 1 | ||||
| Sentence 2 | ||||
| Sentence 3 | ||||
| Sentence 4 | ||||
| Sentence 5 | ||||
| Sentence 6 | ||||
| Model | Utterance at iteration: |
|---|---|
| NH2 | |
| T2-P | |
| NH2 (500 utterances) | |
| T2-P (500 utterances) |
| Neural HMM (NH2) | Tacotron 2 (T2-P) | Pre-trained NVIDIA Tacotron 2 | |
|---|---|---|---|
| Example 1 | |||
| Example 2 | |||
| Example 3 | |||
| Example 4 |
| Neural HMM TTS (model NH2) | ||||
|---|---|---|---|---|
| Durations | Quantile | Sampled | Quantile | Sampled |
| Acoustics | Mean | Mean | Sampled | Sampled |
| Sentence 1 | ||||
| Sentence 2 | ||||
| Sentence 3 | ||||
| Sentence 4 | ||||
| Sentence 5 | ||||
| Sentence 6 | ||||
| Neural HMM TTS (model NH2) | ||||
|---|---|---|---|---|
| Pre-net dropout? | ❌ | ✅ | ✅ | ✅ |
| Output | Same every time | Example 1 | Example 2 | Example 3 |
| Sentence 1 | ||||
| Sentence 2 | ||||
| Sentence 3 | ||||
| Sentence 4 | ||||
| Sentence 5 | ||||
| Sentence 6 | ||||
Drag the slider to change the speaking rate.
| Sentence | Neural HMM TTS (model NH2) |
|---|---|
| Sentence 1 | |
| Sentence 2 | |
| Sentence 3 | |
| Sentence 4 | |
| Sentence 5 | |
| Sentence 6 |
These audio samples use a stronger vocoder (HiFi-GAN version LJ_FT_T2_V1) and also include a hybrid condition NH2+P that demonstrates the effect of applying the post-net from model T2 to the output of model NH2.
| Type | Proposed neural HMM TTS | Tacotron 2 baseline | ||
|---|---|---|---|---|
| Condition | 2 states per phone (NH2) | NH2 with Tacotron 2's post-net (NH2+P) | w/o post-net (T2-P) | w/ post-net (T2+P) |
| Sentence 1 | ||||
| Sentence 2 | ||||
| Sentence 3 | ||||
| Sentence 4 | ||||
| Sentence 5 | ||||
| Sentence 6 | ||||
| Sentence 7 | ||||
| Sentence 8 | ||||
| Sentence 9 | ||||
| Sentence 10 | ||||
@inproceedings{mehta2022neural,
title={Neural {HMM}s are all you need (for high-quality attention-free {TTS})},
author={Mehta, Shivam and Sz{\'e}kely, {\'E}va and Beskow, Jonas and Henter, Gustav Eje},
booktitle={Proc. ICASSP},
year={2022}
}