Enhancing zero-shot timbre conversion using semantic alignment
This project is maintained by shivammehta25
Zero-shot voice conversion (VC) synthesizes speech in a target speaker’s voice while preserving linguistic and paralinguistic content. However, timbre leakage—where source speaker traits persist—remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution.
Source | Timbre reference | KNNVC | HierSpeech++ | UniAudio | SemAlignVC |
---|---|---|---|---|---|
Source | Timbre reference | KNNVC | HierSpeech++ | UniAudio | SemAlignVC |
---|---|---|---|---|---|
In this section, we demonstrate that SemAlignVC can generate similar timbre with references of different languages not seen during training, further confirming the disentanglement of timbre and content.
Timbre reference 1 | Timbre reference 2 | |
---|---|---|
Input audio | Output 1 | Output 2 |
To see how SemAlignVC handles non-human sounds, we apply it to audio clips of non human speech sounds.
Input | Reference | Generated |
---|---|---|
Human | Dog | Output |
Dog | Human | Output |
Human | Chicken | Output |
Chicken | Human | Output |
Human | Music Beats | Output |
Music Beats | Human | Output |