CIRCLE: Capture in Rich Contextual Environments

We introduce CIRCLE, a dataset of 3D whole body reaching motion and corresponding first person RGDB videos within a fully furnished virtual home.

Joao Pedro Araujo1 Jiaman Li1 Karthik Vetrivel1 Rishi Agarwal1 Deepak Gopinath2 Jiajun Wu1 Alexander Clegg2 C. Karen Liu1
1Stanford University 2Meta AI
Paper Dataset

Abstract

Synthesizing 3D human motion in a contextual, ecological environment is important for simulating realistic activities people perform in the real world. However, conventional optics-based motion capture systems are not suited for simultaneously capturing human movements and complex scenes. The lack of rich contextual 3D human motion datasets presents a roadblock to creating high-quality generative human motion models. We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world while being motion captured in the real world. Our system enables rapid collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction in the real world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes, paired with ego-centric information of the environment represented in various forms, such as RGBD videos. We use this dataset to train a model that generates human motion conditioned on scene information. Leveraging our dataset, the model learns to use ego-centric scene information to achieve nontrivial reaching tasks in the context of complex 3D scenes.

Example poses from CIRCLE captured from real human motion in a virtual environment.

CIRCLE at a Glance

10

hours of data

5

subjects

9

scenes

7000+

sequences

After post-processing, each sequence in CIRCLE include the SMPL-X parameters of a body model fit to the mocap data using MoSh, the VR headset trajectory synchronized to the mocap skeleton, egocentric RGB-D video (rendered with Habitat and Blender), task specific data, such as initial and goal conditions, and the scene where the data are collected.

Collection System

Our system enables collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction. For the purposes of this dataset, we used the Oculus Quest 2--though, our system is hardware agnostic.

Real World

Virtual World

Egocentric View

Wide Variety of Full-Body Motion Data

CIRCLE sequences range in difficulty, from easy sequences with no obstacles to sequences that require considerable full-body adjustment. The videos below shows the live-capture skeleton and scene rendered within the headset while the subject completes tasks.

Easy

Medium

Hard

Model

Given a start pose, goal position, and 3D scene (a), we first initialize the input poses using constant local joint rotation and linearly interpolated root translation (b). We extract scene features for each time step using BPS or PointNet++ from the initialized poses, a fixed point set sampled from a cylinder, and scene point clouds (c). We then concatenate scene features and initialized poses and feed them to a transformer-based model (d) to generate final poses (e).

Model Input

We initialize the input to our model using a constant pose that linearly interpolates the start and goal wrist positions. The output is the edited motion.

Input

Output

Ground Truth

Model Results

We compare our approach against two baselines, GOAL and NO-SCENE. GOAL is an MLP architecture that, given a start and goal pose, autoregressively predicts the next pose in the sequence. NO-SCENE employs our same architecture without scene encoding.

Our Model (BPS)

No Scene

GOAL

Ground Truth