Matcha-TTS: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter

We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:

Is probabilistic
Has compact memory footprint
Sounds highly natural
Is very fast to synthesise from

See below for audio examples, or read our ICASSP 2024 paper for more details. Code is available in our GitHub repository, along with pre-trained models.

You can also try 🍵 Matcha-TTS in your browser on HuggingFace 🤗 spaces.

Stimuli from the listening test

Click the buttons in the table to load and play the different stimuli.

Currently loaded stimulus: MAT-10 : Sentence 1

Audio player:

Transcription:

It had established periodic regular review of the status of four hundred individuals;

System	Condition	Sentence 1	Sentence 2	Sentence 3	Sentence 4	Sentence 5	Sentence 6
Vocoded speech	VOC
Matcha-TTS	MAT-10
	MAT-4
	MAT-2
Grad-TTS	GRAD-10
Grad-TTS	GRAD-4
Grad-TTS+CFM	GCFM-4
FastSpeech 2	FS2
VITS	VITS

Effect of the number of ODE solver steps

1 500

Steps:

System	Sentence 1	Sentence 2	Sentence 3
Matcha-TTS
Grad-TTS
Grad-TTS + CFM

Citation information

@inproceedings{mehta2024matcha,
  title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
  author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2024}
}