Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

1 University of Glasgow, 2 Wuhan University
ACM Multimedia 2025

*Indicates Corresponding Author

Abstract

Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDAR point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments.

Statistics

Comparison of statistics between the newly proposed Waymo-3DSkelMo and existing human pose forecasting datasets. teaser

Experiments

Quantitative comparison of motion generation methods with and without Frenet-frame alignment. Metrics marked with ↓indicate that lower values are better. Within each setting (with/without Frenet), the best result for each metric is highlighted in bold. teaser Version 2 enhances the optimization by incorporating Waymo-annotated 3D bounding boxes as strong geometric constraints and upweighting high-quality LiDAR-based pseudo-labels, resulting in significantly improved joint-position accuracy and overall motion quality.

Benchmarking

Version 2 results of JPE, APE, and FDE (in mm) under different numbers of persons. We compare short-term predictions using TBIFormer across varying levels of multi-person interaction. teaser

Waymo-3DSkelMo Dataset

BibTeX

@inproceedings{zhu2025waymo,
  title={Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving},
  author={Zhu, Guangxun and Fan, Shiyu and Dai, Hang and Ho, Edmond SL},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={13184--13190},
  year={2025}
  }