SemAlignVC

Enhancing zero-shot timbre conversion using semantic alignment

This project is maintained by shivammehta25

SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

Shivam Mehta, Yingru Liu, Zhenyu Tang, Kainan Peng, Vimal Manohar, Shun Zhang, Mike Seltzer, Qing He, Mingbo Ma

Abstract

Zero-shot voice conversion (VC) synthesizes speech in a target speaker’s voice while preserving linguistic and paralinguistic content. However, timbre leakage—where source speaker traits persist—remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution.

Subjective Evaluation

Source Timbre reference KNNVC HierSpeech++ UniAudio SemAlignVC

Objective Evaluation

Source Timbre reference KNNVC HierSpeech++ UniAudio SemAlignVC

Timbre conversion using difference language reference

In this section, we demonstrate that SemAlignVC can generate similar timbre with references of different languages not seen during training, further confirming the disentanglement of timbre and content.

Timbre reference 1 Timbre reference 2
Input audio Output 1 Output 2

Converting Non speech voices

To see how SemAlignVC handles non-human sounds, we apply it to audio clips of non human speech sounds.

Input Reference Generated
Human Dog Output
Dog Human Output
Human Chicken Output
Chicken Human Output
Human Music Beats Output
Music Beats Human Output