FinePOSE

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation

via Diffusion Models

Jinglin Xu1 Yijie Guo2 Yuxin Peng2*
1School of Intelligence Science and Technology, University of Science and Technology Beijing
2Wangxuan Institute of Computer Technology, Peking University
* Corresponding author

Abstract

The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named FinePOSE. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios.

Pipeline

We propose a new fine-grained part-aware prompt learning mechanism coupled with diffusion models that possesses human body part controllable high-quality generation capability, beneficial to the 3D human pose estimation task. Our method encodes multi-granularity information about action class, coarse- and fine-grained human parts, and kinematic information, and establishes fine-grained communications between learnable part-aware prompts and poses for enhancing the denoising capability