arXiv:2311.05844 Abstract | arXiv Analytics

arXiv:2311.05844 [cs.CV]Abstract References Reviews Resources

Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image

Published 2023-09-25Version 1

Generating a voice from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning both speaker identity and prosody from a face image poses a significant challenge. To address the issue, our TTS model incorporates both a face encoder and a prosody encoder. The prosody encoder is specifically designed to model prosodic features that are not captured only with a face image, allowing the face encoder to focus solely on capturing the speaker identity from the face image. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for the face images the model has not trained. Samples are at our demo page https://face-stylespeech.github.io.

Comments: Submitted to ICASSP 2024

Categories: cs.CV, cs.AI, cs.CL, cs.MM, cs.SD, eess.AS

Keywords: face image, natural zero-shot speech synthesis, face-to-voice latent mapping, face-stylespeech, natural speech

Related articles: Most relevant | Search more

arXiv:1007.0618 [cs.CV] (Published 2010-07-05)

Face Synthesis (FASY) System for Determining the Characteristics of a Face Image

Santanu Halder, Debotosh Bhattacharjee, Mita Nasipuri, Dipak Kumar Basu, Mahantapas Kundu

arXiv:2103.02805 [cs.CV] (Published 2021-03-04)

When Face Recognition Meets Occlusion: A New Benchmark

Baojin Huang et al.

arXiv:2103.01745 [cs.CV] (Published 2021-03-02)

IdentityDP: Differential Private Identification Protection for Face Images

Yunqian Wen, Li Song, Bo Liu, Ming Ding, Rong Xie

arXiv Analytics

arXiv:2311.05844 [cs.CV]Abstract References Reviews Resources

Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image

Links

Toolbox

arXiv:2311.05844 [cs.CV]AbstractReferencesReviewsResources

Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image

Links

Toolbox

arXiv:2311.05844 [cs.CV]Abstract References Reviews Resources