Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

Xuanchen Li1     Jianyu Wang1     Yuhao Cheng1†
Yikun Zeng1     Xingyu Ren1     Wenhan Zhu2     Weiming Zhao3     Yichao Yan1‡
1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
2Xueshen AI    3Student Innovation Center, Shanghai Jiao Tong University
(Project leader, Corresponding author)
CVPR 2025
Teaser

We present TexTalk4D, a high-precision 4D audio-mesh-texture-aligned dataset consisting of 100 minutes of scan-level meshes with detailed 8K textures. Based on the dataset, we present TexTalker to generate geometry and aligned dynamic textures from speech simultaneously, advancing towards highly personalized textured facial animation.

Abstract

Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset \textbf{TexTalk4D}, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework \textbf{TexTalker} to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements.

Video

Pipeline

Pipeline
The overview of TexTalker. (a) We train quantized autoencoders to unify the representation of geometry and texture with better efficiency. (b) Based on the learned low-dimensional animation primitives, we employ an LDM to jointly diffuse geometry and texture latent offsets ∆z from the style pivots p for long-term correlation learning. (c) By adding back the style pivots, the motion and wrinkle styles can be independently controlled. Finally, the personalized textured animation assets can be obtained by decoders.

Dataset

TexTalk4D

TexTalk4D consists of 100 minutes of scan-level meshes with detailed 8K textures from 100 identities.
We build the dataset by Topo4D.

Results

Motion Comparison

Texture Comparison

TexTalker can generate textures with a larger dynamic range while maintaining high consistency with the underlying facial movements.

Disentangled Style Control

The speaking and wrinkling styles can be independently switched.

Generalization to Other Languages

Trained on Chinese data, TexTalker can generalize to unseen languages well.

BibTeX


@article{li2025textalker,
  title={Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture},
  author={Li, Xuanchen and Wang, Jianyu and Cheng, Yuhao and Zeng, Yikun and Ren, Xingyu and Zhu, Wenhan and Zhao, Weiming and Yan, Yichao},
  journal={arXiv preprint arXiv:2503.00495},
  year={2025}
}