Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving
Abstract
Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDAR point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments.
Waymo-3DSkelMo: A high-quality 3D Multi-pedestrian motion dataset created using human motion and shape priors from LiDAR range images in the Waymo perception dataset. (Blue) The point clouds, sampled every 0.5 seconds, of a pedestrian from the LiDAR range images. A 3D body mesh can be estimated from the partial LiDAR point cloud using a 3D human shape prior for each sample. (Purple) The Waymo dataset comes with very sparsely annotated 3D skeletal poses. (Yellow) Based on the skeletal poses extracted from the estimated body meshes, a motion prior is used to enhance the motion quality.
Overview of our pipeline. (a) Point clouds are first extracted from the range images of all five Waymo LiDAR sensors, then transformed into a world coordinate system and fused into a unified point cloud representation. (b) Mesh recovery is performed on all point clouds using a human-body prior, followed by motion generation via a motion prior. (c) Regressing SMPL parameters to skeletal motions. (d) An example of different quality of point cloud and 3D pose.
Example scenes from the Waymo-3DSkelMo dataset.
Statistics
Comparison of statistics between the newly proposed Waymo-3DSkelMo and existing human pose forecasting datasets.
Experiments
Quantitative comparison of motion generation methods with and without Frenet-frame alignment. Metrics marked with ↓indicate that lower values are better. Within each setting (with/without Frenet), the best result for each metric is highlighted in bold.
Benchmarking
Version 2 results of JPE, APE, and FDE (in mm) under different numbers of persons. We compare short-term predictions using TBIFormer across varying levels of multi-person interaction.
Waymo-3DSkelMo Dataset
BibTeX
@inproceedings{zhu2025waymo,
title={Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving},
author={Zhu, Guangxun and Fan, Shiyu and Dai, Hang and Ho, Edmond SL},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={13184--13190},
year={2025}
}