hours of data
subjects
scenes
sequences
After post-processing, each sequence in CIRCLE include the SMPL-X parameters of a body model fit to the mocap data using MoSh, the VR headset trajectory synchronized to the mocap skeleton, egocentric RGB-D video (rendered with Habitat and Blender), task specific data, such as initial and goal conditions, and the scene where the data are collected.
Our system enables collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction. For the purposes of this dataset, we used the Oculus Quest 2--though, our system is hardware agnostic.
CIRCLE sequences range in difficulty, from easy sequences with no obstacles to sequences that require considerable full-body adjustment. The videos below shows the live-capture skeleton and scene rendered within the headset while the subject completes tasks.
Given a start pose, goal position, and 3D scene (a), we first initialize the input poses using constant local joint rotation and linearly interpolated root translation (b). We extract scene features for each time step using BPS or PointNet++ from the initialized poses, a fixed point set sampled from a cylinder, and scene point clouds (c). We then concatenate scene features and initialized poses and feed them to a transformer-based model (d) to generate final poses (e).
We initialize the input to our model using a constant pose that linearly interpolates the start and goal wrist positions. The output is the edited motion.
We compare our approach against two baselines, GOAL and NO-SCENE. GOAL is an MLP architecture that, given a start and goal pose, autoregressively predicts the next pose in the sequence. NO-SCENE employs our same architecture without scene encoding.