Unified speech and gesture synthesis using flow matching

Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, and Gustav Eje Henter

We introduce a new method, Match-TTSG, for diffusion-like joint synthesis of speech and 3D gestures from text. Our main improvements are:

A new architecture that unifies speech and motion synthesis into one single pathway and decoder.
Training using flow matching, a.k.a. rectified flows.

Compared to the previous state of the art, our new method:

Improves speech and motion quality
Is smaller
Is 10 times faster
Generates speech and gestures that are a much better fit for each other

To our knowledge, this is the first method synthesising 3D motion using flow matching or rectified flows.

Please check out the examples below and read our arXiv preprint for more details. Code and pre-trained models will be made available in a few weeks.

Example stimuli from the evaluation

Speech-only evaluation

Click the buttons in the table to load and play the different stimuli.

Currently loaded stimulus: MA-50

Audio player:

Transcription:

I mean it it's not that I'm against it it's just that I just don't have the time and I just sometimes I'm not bothered and that sort of stuff.

Text prompt #	NAT	DIFF	MA		SM
Solver steps	-	50	50	500	50	500
1
2
3
4

Gesture-only evaluation (no audio)

Currently loaded: MA-50 1

I mean it it's not that I'm against it it's just that I just don't have the time and I just sometimes I'm not bothered and that sort of stuff.

Text prompt #	NAT	DIFF	MA		SM
Solver steps	-	500	50	500	50	500
1
2
3
4

Speech-and-gesture evaluation

Matched	Mismatched

*Note: Matched versus mismatched stimuli were not labelled in the study and presented in random order.

Currently loaded: MA-50 1

Yeah and then obviously there, there's certain choirs that come down to the church. There's a woman called, I can't remember her name. But she has an incredible voice. Like an amazing voice.

Text prompt #	NAT	DIFF	MA		SM
Solver steps	-	50 & 500	50	500	500
1
2
3
4

Faster synthesis speed for long utterances

Currently loaded stimulus: MA-50

Audio player:

Transcription:

The sun slowly rises. Casting a golden hue upon the tranquil landscape. Birds chirp melodiously welcoming the dawn. Nature awakens with a gentle breeze rustling through the leaves creating a harmonious symphony of life. This mesmerizing moment is a reminder that nature's beauty is eternal, an ever-repeating masterpiece that never fails to captivate our senses. As the sun continues its ascent, the world beneath its warm embrace stirs to life. The meandering river, once shrouded in darkness, now glistens like liquid gold, reflecting the radiant morning sky. Each ripple seems to dance to its own rhythm, adding to the symphony of nature's awakening.

Condition	MA-50		SM-50
Text prompt #	Audio	RTF	Audio	RTF
1		0.0221		0.1311
2		0.0213		0.1287
3		0.0242		0.1304