Neural HMM TTS

Neural HMMs are all you need (for high-quality attention-free TTS)

This project is maintained by shivammehta25

Neural HMMs are all you need (for high-quality attention-free TTS)

Shivam Mehta, Éva Székely, Jonas Beskow, and Gustav Eje Henter

Summary

We show that classic, HMM-based speech synthesis and modern, neural text-to-speech (TTS) can be combined to obtain the best of both worlds. Concretely, our proposal amounts to replacing conventional attention in neural TTS by so-called neural HMMs. We call this new approach “neural HMM TTS”.

To validate our proposal, we describe a modified version of Tacotron 2 that uses neural HMMs instead of attention. The resulting system:

To our knowledge, this is the first time HMM-based speech synthesis has achieved a speech quality on par with neural TTS.

For more information, please read our ICASSP 2022 paper here.

Architecture

Synthesising from Neural-HMM

Code

Code is available in our Github repository, along with a pre-trained model.

Listening examples

Stimuli from listening test

Type Proposed neural HMM TTS Tacotron 2 baseline
Condition 2 states per phone (NH2) 1 state per phone (NH1) w/o post-net (T2-P) w/ post-net (T2+P)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Learning to speak and align quickly

Number of training iterations:

Model Utterance at iteration:
NH2
T2-P
NH2 (500 utterances)
T2-P (500 utterances)

Speaking sentences Tacotron 2 cannot speak

Neural HMM (NH2) Tacotron 2 (T2-P) Pre-trained NVIDIA Tacotron 2
Example 1
Example 2
Example 3
Example 4

Effect of different output-generation methods

Neural HMM TTS (model NH2)
Durations Quantile Sampled Quantile Sampled
Acoustics Mean Mean Sampled Sampled
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Effect of pre-net dropout during synthesis

Neural HMM TTS (model NH2)
Pre-net dropout?
Output Same every time Example 1 Example 2 Example 3
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Control over speaking rate

Drag the slider to change the speaking rate.

Duration quantile:

Sentence Neural HMM TTS (model NH2)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6

Effect of using HiFi-GAN and a post-net

These audio samples use a stronger vocoder (HiFi-GAN version LJ_FT_T2_V1) and also include a hybrid condition NH2+P that demonstrates the effect of applying the post-net from model T2 to the output of model NH2.

Type Proposed neural HMM TTS Tacotron 2 baseline
Condition 2 states per phone (NH2) NH2 with Tacotron 2's post-net (NH2+P) w/o post-net (T2-P) w/ post-net (T2+P)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7
Sentence 8
Sentence 9
Sentence 10

Citation information

@inproceedings{mehta2022neural,
  title={Neural {HMM}s are all you need (for high-quality attention-free {TTS})},
  author={Mehta, Shivam and Sz{\'e}kely, {\'E}va and Beskow, Jonas and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2022}
}

Visits