Toward Speaker-Preserving Arabic Speech Synthesis: A Bilingual Resource and Comparative Evaluation of Cross-Lingual Voice Cloning
Abstract
Cross-lingual voice cloning (CLVC) aims to synthesize speech in a target language while preserving a speaker's vocal identity, even if the speaker has never recorded audio in that language. Despite recent progress in multilingual speech generation, Arabic remains underrepresented due to the scarcity of high-quality bilingual and same-speaker multilingual speech resources. We present a systematic evaluation of zero-shot CLVC for Arabic and five additional target languages (Chinese, French, German, Russian, and Japanese) across recent autoregressive, diffusion-based, and flow-matching architectures. Using reference speakers recorded exclusively in English from the ACL-60/60 dataset, we evaluate multilingual speech generation using ASR-based transcription error rates and ECAPA-TDNN speaker similarity metrics. Our experiments reveal a persistent trade-off between linguistic intelligibility and speaker identity preservation. Models with stronger multilingual transcription accuracy often exhibit degraded speaker similarity, whereas systems that better preserve speaker identity struggle with linguistic consistency, particularly in Arabic. Among the evaluated systems, only a limited subset produces consistently usable Arabic cross-lingual speech outputs. To support reproducible benchmarking and future research in multilingual speech technologies, we publicly release a paired English-reference and Arabic-synthesized same-speaker bilingual test set, along with a fully reproducible CLVC evaluation pipeline.