Neural HMMs are all you need (for high-quality attention-free TTS)

Shivam Mehta, Éva Székely, Jonas Beskow, and Gustav Eje Henter

Summary

We show that classic, HMM-based speech synthesis and modern, neural text-to-speech (TTS) can be combined to obtain the best of both worlds. Concretely, our proposal amounts to replacing conventional attention in neural TTS by so-called neural HMMs. We call this new approach “neural HMM TTS”.

To validate our proposal, we describe a modified version of Tacotron 2 that uses neural HMMs instead of attention. The resulting system:

Is smaller and simpler than Tacotron 2
Learns to speak and align much quicker
Does not risk breaking down into gibberish
Is fully probabilistic
Allows easy control over speaking rate
Achieves the same naturalness as Tacotron 2

To our knowledge, this is the first time HMM-based speech synthesis has achieved a speech quality on par with neural TTS.

For more information, please read our ICASSP 2022 paper here.

Architecture

Synthesising from Neural-HMM

Code

Code is available in our Github repository, along with a pre-trained model.

Listening examples

Stimuli from listening test

Type	Proposed neural HMM TTS		Tacotron 2 baseline
Condition	2 states per phone (NH2)	1 state per phone (NH1)	w/o post-net (T2-P)	w/ post-net (T2+P)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Learning to speak and align quickly

0.5k 30k

Number of training iterations:

Model	Utterance at iteration:
NH2
T2-P
NH2 (500 utterances)
T2-P (500 utterances)

Speaking sentences Tacotron 2 cannot speak

	Neural HMM (NH2)	Tacotron 2 (T2-P)	Pre-trained NVIDIA Tacotron 2
Example 1
Example 2
Example 3
Example 4

Effect of different output-generation methods

	Neural HMM TTS (model NH2)
Durations	Quantile	Sampled	Quantile	Sampled
Acoustics	Mean	Mean	Sampled	Sampled
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Effect of pre-net dropout during synthesis

	Neural HMM TTS (model NH2)
Pre-net dropout?	❌	✅	✅	✅
Output	Same every time	Example 1	Example 2	Example 3
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Control over speaking rate

Drag the slider to change the speaking rate.

Slow Fast

Duration quantile:

Sentence	Neural HMM TTS (model NH2)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Effect of using HiFi-GAN and a post-net

These audio samples use a stronger vocoder (HiFi-GAN version LJ_FT_T2_V1) and also include a hybrid condition NH2+P that demonstrates the effect of applying the post-net from model T2 to the output of model NH2.

Type	Proposed neural HMM TTS		Tacotron 2 baseline
Condition	2 states per phone (NH2)	NH2 with Tacotron 2's post-net (NH2+P)	w/o post-net (T2-P)	w/ post-net (T2+P)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7
Sentence 8
Sentence 9
Sentence 10

Citation information

@inproceedings{mehta2022neural,
  title={Neural {HMM}s are all you need (for high-quality attention-free {TTS})},
  author={Mehta, Shivam and Sz{\'e}kely, {\'E}va and Beskow, Jonas and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2022}
}