Unified speech and gesture synthesis using flow matching

Unified speech and gesture synthesis using flow matching

This project is maintained by shivammehta25

Unified speech and gesture synthesis using flow matching

Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, and Gustav Eje Henter

We introduce a new method, Match-TTSG, for diffusion-like joint synthesis of speech and 3D gestures from text. Our main improvements are:

  1. A new architecture that unifies speech and motion synthesis into one single pathway and decoder.
  2. Training using flow matching, a.k.a. rectified flows.

Compared to the previous state of the art, our new method:

To our knowledge, this is the first method synthesising 3D motion using flow matching or rectified flows.

Please check out the examples below and read our arXiv preprint for more details. Code and pre-trained models will be made available in a few weeks.

Example stimuli from the evaluation

Speech-only evaluation

Click the buttons in the table to load and play the different stimuli.

Currently loaded stimulus: MA-50

Audio player:

Transcription:

I mean it it's not that I'm against it it's just that I just don't have the time and I just sometimes I'm not bothered and that sort of stuff.

Text prompt # NAT DIFF MA SM
Solver steps - 50 50 500 50 500
1
2
3
4

Gesture-only evaluation (no audio)

Currently loaded: MA-50 1

I mean it it's not that I'm against it it's just that I just don't have the time and I just sometimes I'm not bothered and that sort of stuff.

Text prompt # NAT DIFF MA SM
Solver steps - 500 50 500 50 500
1
2
3
4

Speech-and-gesture evaluation

Matched Mismatched
*Note: Matched versus mismatched stimuli were not labelled in the study and presented in random order.

Currently loaded: MA-50 1

Yeah and then obviously there, there's certain choirs that come down to the church. There's a woman called, I can't remember her name. But she has an incredible voice. Like an amazing voice.

Text prompt # NAT DIFF MA SM
Solver steps - 50 & 500 50 500 500
1
2
3
4

Faster synthesis speed for long utterances

Currently loaded stimulus: MA-50

Audio player:

Transcription:

The sun slowly rises. Casting a golden hue upon the tranquil landscape. Birds chirp melodiously welcoming the dawn. Nature awakens with a gentle breeze rustling through the leaves creating a harmonious symphony of life. This mesmerizing moment is a reminder that nature's beauty is eternal, an ever-repeating masterpiece that never fails to captivate our senses. As the sun continues its ascent, the world beneath its warm embrace stirs to life. The meandering river, once shrouded in darkness, now glistens like liquid gold, reflecting the radiant morning sky. Each ripple seems to dance to its own rhythm, adding to the symphony of nature's awakening.

Condition MA-50 SM-50
Text prompt # Audio RTF Audio RTF
1 0.0221 0.1311
2 0.0213 0.1287
3 0.0242 0.1304

Match-TTSG