One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting

1Stanford University, 2NVIDIA
 arXiv  Code


Abstract

Extrinsic manipulation, the use of environment contacts to achieve manipulation objectives, enables strategies that are otherwise impossible with a parallel jaw gripper. However, orchestrating a long-horizon sequence of contact interactions between the robot, object, and environment is notoriously challenging due to the scene diversity, large action space, and difficult contact dynamics. We observe that most extrinsic manipulation are combinations of short-horizon primitives, each of which depend strongly on initializing from a desirable contact configuration to succeed. Therefore, we propose to generalize one extrinsic manipulation trajectory to diverse objects and environments by retargeting contact requirements. We prepare a single library of robust short-horizon, goal-conditioned primitive policies, and design a framework to compose state constraints stemming from contacts specifications of each primitive. Given a test scene and a single demo prescribing the primitive sequence, our method enforces the state constraints on the test scene and find intermediate goal states using inverse kinematics. The goals are then tracked by the primitive policies. Using a 7+1 DoF robotic arm-gripper system, we achieved an overall success rate of 80.5% on hardware over 4 long-horizon extrinsic manipulation tasks, each with up to 4 primitives. Our experiments cover 10 objects and 6 environment configurations. We further show empirically that our method admits a wide range of demonstrations, and that contact retargeting is indeed the key to successfully combining primitives for long-horizon extrinsic manipulation.


Pipeline summary

We prepare a primitive library and define each primitive’s contact requirements offline. Given a single demonstration, we identify the relative transforms between the initial and final object states of the primitives. The transforms are first directly applied to the test scene initial object state via the remap_x subroutine. The output are states unlikely to satisfy the contact requirements of the primitives in the test scene. We then perform retarget_x, which modifies the outputs to satisfy the environment-object contact requirements required by the primitives. The outputs of retarget_x are the intermediate goals for each primitive. Next, we run retarget_q, which finds the robot configuration that satisfies the contact requirements of the subsequent primitive. We thereby obtain a sequence of intermediate goals and robot configurations in the test scene. Finally, we execute the primitive policies using the intermediate goals and robot configurations to achieve the task in the test scene. Please refer to our paper for more details.

Summary of our pipeline.



Primitive and contact retargeting implementation

4 primitives, "push," "pull," "pivot," and "grasp," are implemented in this project. Each primitive is a short-horizon, goal-conditioned policy that takes in the current state and a goal state and outputs a sequence of actions. The initial and goal states are states that satisfy the contact requirements of the primitive. Here we summarize the primitives and their contact requirements. Please refer to our paper and code for implementation details.

summary of primitives

Summary of the primitives implemented in this project.

Push primtive

The push primitive is a single reinforcement learning based policy trained in Isaac Gym. The policy is trained to push any of the 7 standard objects and 3 short objects from any initial pose to any goal pose in the workspace. We explicitly inform the policy of the object tested using one-hot encoding.

The robot-object contact is implicitly enforced by the policy, thus there is no need to initialize the robot in a specific contact configuration. This showcases the flexibility of our pipeline in handling diverse contact requirements of each primitives. Such design choice also allows emergent behaviors where the policy switches robot-object contact to correct for tracking errors.

Other than requiring the object to be on the ground, there are no environment-object contact requirements.

Pull primtive

The pull primitive is a two-stage, hand-designed open loop policy that leverages operational space control (OSC). Starting from an initial robot configuration where the robot is in the vicnity of the top of the object, the robot first closes its gripper and moves downward toward the object to ensure contact is established. The robot then moves the gripper horizontally to pull the object toward the goal position.

The robot-object contact required by the pull primitive is to have the gripper approximately in contact with the top of the object. To do so, we compute the top rectangle of the object's bounding box and move the robot to the center of the rectangle.

Other than requiring the object to be on the ground, there are no environment-object contact requirements.

Pivot primtive

The pivot primitive is a three-stage, hand-designed feedback policy that uses OSC. A lower edge of the object is in required to be in contact with the wall and orthogonal to the wall normal. The gripper fingertips are in contact with the object on the opposite side of the wall-object contact. The robot first pushes the object toward the wall to establish contact using OSC. Next, the gripper follows a Trammel of Archimedes path whose parameters are given by the bounding box dimensions of the object. A contact force is maintained in an impedance control manner by commanding a fixed tracking error in the radial direction of the arc. Once the object has been pivoted, the robot breaks contact and clears the object by lifting the gripper upwards. The pose of the object is tracked to ensure the object is pivoted to by the correct angle of approximately π/2.

The robot-object contact is implemented as an intersection of two constraints in Drake: the distances between the object and the fingertips are zero; the fingertips are within a cone centered at the object’s geometric center, has the wall’s normal as its axis, and has a half-angle of π/6.

The environment-object contact is implemented using the bounding box of the object. Of the 4 vertices that are closest to the wall, the 2 lowest ones must be on the wall, and the distance between the wall and the object is 0.

Grasp primtive

The grasp primitive is a hand-designed policy that uses OSC. The primitive begins with the gripper fully open above the object. First, the robot descends to slot the object between the gripper fingers. Due to potential pose estimation error, a hand-designed wiggling motion is performed to increase the success rate. After the object is between the gripper, the gripper is closed and the object is lifted up.

To find the robot configuration that satisfies the contact requirements of the grasp primitive, we compute the bounding box of the object's project on the ground plane. The two fingers are aligned with the short side of the box, and centered at the box's center. The initial gripper height is set to the height of the object's bounding box plus a small tolerance.

Other than requiring the object to be on the ground, there are no environment-object contact requirements.



System setup

Object set

object set photo

Objects used in this project. Standard objects are tested on all tasks. Short objects and impossible are used for additional "occluded grasping" experiments.

object properties

Mass and approximate dimensions of the objects used in the experiments.

Pose estimation pipeline

Our pipeline takes in an RGB image, a prespecified text description, and a textured mesh of the object. It outputs the 6D pose of the object. To obtain the pose estimation from scratch, we perform the following steps:

  1. The prespecified text description of the object is given to OWL-ViT to obtain a bounding box of the object.
  2. The bounding box is given to Segment Anything to produce a segmentation mask of the object.
  3. Megapose uses the segmented object to produce an initial pose estimation
  4. Subsequent pose tracking is done using only the "refiner" component of Megapose. The last estimated pose is used as the initial guess.

Steps 1-3 are only run when a guess of the object pose is unavailable, i.e. at pipeline initialization or when the object is lost. On our workstation with Intel i9-13900K CPU and NVIDIA GeForce RTX 4090 GPU, steps 1-3 typically takes a few seconds to complete, and step 4 is run at a frame rate of 8-12Hz. The pipeline automatically detects when the object is lost using the Megapose refiner's "pose score". If the score is too low, the entire pipeline is rerun.

Pose estimation pipeline output at 1x speed. Step 3 outputs are shown in blue. Step 4 outputs are shown in green.



Results

We evaluate our framework on 4 real-world extrinsic manipulation tasks: "obstacle avoidance," "object storage," "occluded grasping," and "object retrieval." Various environments are used for the demonstrations and tests to showcase our method's robustness against environment changes. All demonstrations are collected on cracker. Every task is evaluated on the 7 standard objects, each with 5 trials. Additionally, occluded grasping is evaluated on the 3 short objects with an extra "pull" step.

Our method achieved an overall success rate of 80.5% (81.7% for standard objects). Despite not being tailored to "occluded grasping," we outperformed the 2022 paper based on deep reinforcement learning Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity, both when the initial object state is against (88.6% vs. 78%) and away from (77.1% vs. 56%) the wall.

To show that our method is agnostic to the specific demonstration, we collect demos for grasping on oat and the 3 impossible objects that are unlikely to be graspable by the robot. We then retarget all demos onto cracker from 5 different initial poses. We achieved 100% success rate across 20 trials.

Main results

Summary of experiments on 7 standard objects.

Additional occluded grasping results

Summary of additional "occluded grasping" experiments.

Below we show 1 success instance of all the task-object combinations. All videos are at 1x speeed.

Obstacle avoidance

Push the object forward, switch contact and push again to avoid the obstacle.

Demonstration

Cereal

Cocoa

Cracker

Flapjack

Oat

Seasoning

Wafer

Object storage

Push an object toward the wall, pivot to align with an opening between the wall and the object, then pull it into the opening for storage.

Demonstration

Cereal

Cocoa

Cracker

Flapjack

Oat

Seasoning

Wafer

Occluded grasping

Push the object in an ungraspable pose toward the wall, pivot it to expose a graspable edge, and grasp it.

Demonstration

Cereal

Cocoa

Cracker

Flapjack

Oat

Seasoning

Wafer

Occluded grasping (short objects)

Push the object in an ungraspable pose toward the wall, pivot it to expose a graspable edge, pull to create space between the wall and the object for inserting the gripper, and grasp it.

Demonstration

Camera

Meat

Onion

Object retrieval

Pull the object from between two obstacles, push toward the wall, pivot it to expose a graspable edge, and grasp it.

Demonstration

Cereal

Cocoa

Cracker

Flapjack

Oat

Seasoning

Wafer



Conclusion

This work presents a framework for generalizing longhorizon extrinsic manipulation from a single demonstration. Our method retargets the demonstration trajectory to the test scene by enforcing contact constraints with IK at every contact switches. The retargeted trajectory is then tracked with a sequence of short-horizon policies for each contact configuration. Our method achieved an overall success rate of 81.7% on real-world objects over 4 challenging long-horizon extrinsic manipulation tasks. Additional experiments show that contact retargeting is crucial to successfully retargeting such long-horizon plans, and a wide range of demonstration can be successfully retargeted with our pipeline.

BibTeX

Coming soon