SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

Shivam Mehta, Yingru Liu, Zhenyu Tang, Kainan Peng, Vimal Manohar, Shun Zhang, Mike Seltzer, Qing He, Mingbo Ma

Abstract

Zero-shot voice conversion (VC) synthesizes speech in a target speaker’s voice while preserving linguistic and paralinguistic content. However, timbre leakage—where source speaker traits persist—remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution.

Subjective Evaluation

Source	Timbre reference	KNNVC	HierSpeech++	UniAudio	SemAlignVC

Objective Evaluation

Source	Timbre reference	KNNVC	HierSpeech++	UniAudio	SemAlignVC

Timbre conversion using difference language reference

In this section, we demonstrate that SemAlignVC can generate similar timbre with references of different languages not seen during training, further confirming the disentanglement of timbre and content.

	Timbre reference 1	Timbre reference 2

Input audio	Output 1	Output 2

Converting Non speech voices

To see how SemAlignVC handles non-human sounds, we apply it to audio clips of non human speech sounds.

Input	Reference	Generated
Human	Dog	Output

Dog	Human	Output

Human	Chicken	Output

Chicken	Human	Output

Human	Music Beats	Output

Music Beats	Human	Output