Synthesizing 3D human motion in a contextual, ecological environment is important for simulating realistic activities people perform in the real world. However, conventional optics-based motion capture systems are not suited for simultaneously capturing human movements and complex scenes. The lack of rich contextual 3D human motion datasets presents a roadblock to creating high-quality generative human motion models. We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world while being motion captured in the real world. Our system enables rapid collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction in the real world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes, paired with ego-centric information of the environment represented in various forms, such as RGBD videos. We use this dataset to train a model that generates human motion conditioned on scene information. Leveraging our dataset, the model learns to use ego-centric scene information to achieve nontrivial reaching tasks in the context of complex 3D scenes.
Example poses from CIRCLE captured from real human motion in a virtual environment.
CIRCLE at a Glance
hours of data
After post-processing, each sequence in CIRCLE include the SMPL-X parameters of a body model fit to the mocap data using MoSh, the VR headset trajectory synchronized to the mocap skeleton,
egocentric RGB-D video (rendered with Habitat and Blender), task specific data, such as initial and goal conditions,
and the scene where the data are collected.
Our system enables collection of high-quality human motion in highly diverse scenes, without the concern of
occlusion or the need for physical scene construction. For the purposes of this dataset, we used the Oculus
Quest 2--though, our system is hardware agnostic.
Wide Variety of Full-Body Motion Data
CIRCLE sequences range in difficulty, from easy sequences with no obstacles to sequences that require
considerable full-body adjustment. The videos below shows the live-capture skeleton and scene rendered within the headset while the subject completes tasks.
Given a start pose, goal position, and 3D scene (a), we first initialize the input poses using constant local
joint rotation and linearly interpolated root translation (b). We extract scene features for each time step using BPS or PointNet++ from the
initialized poses, a fixed point set sampled from a cylinder, and scene point clouds (c). We then concatenate scene features and initialized
poses and feed them to a transformer-based model (d) to generate final poses (e).
We initialize the input to our model using a constant pose that linearly interpolates the start and goal wrist
positions. The output is the edited motion.
We compare our approach against two baselines, GOAL and NO-SCENE. GOAL is an MLP architecture that, given a
start and goal pose, autoregressively predicts the next pose in the sequence. NO-SCENE employs our same
architecture without scene encoding.