Policies trained to rigidly track a reference motion treat any deviation as an error to correct. When making unexpected contact with the world, they respond with large, uncontrolled forces, leading to brittle and potentially dangerous behavior.
Instead of training an RL agent to strictly track motions under any perturbations, we want it to respond to external forces with a controllable force-displacement relationship. A key challenge is how to balance compliant behaviors with motion imitation objectives. Rather than tuning this balance through competing rewards, we first generate a dataset of feasible and stylistically desirable compliant motions using an offline IK solver, providing a fine-grained specification that simplifies task prioritization. The policy then learns to reproduce these compliant behaviors while only observing the original reference, forcing it to implicitly infer external forces and react appropriately.
Our policy can be commanded to behave with a specific stiffness. At low stiffness, it interacts gently and safely with its environment; at high stiffness, it firmly resists external forces to maintain its posture.
A teleoperator can command the robot to be stiff or compliant at deployment time. Below, the operator adjusts the joystick in the bottom left corner to increase and decrease stiffness while the reference posture remains unchanged.
SoftMimic policies softly absorb unexpected contacts, whereas traditional motion tracking policies apply unspecified large forces that can damage the environment. Below, a traditional motion tracking baseline (left) and SoftMimic (right) are commanded to raise their arms next to a delicate Lego structure.
Compliance enables a single reference motion to generalize to a range of task variations. Here, a motion reference dimensioned to pick a 20cm-wide box successfully enables picking boxes of various sizes by compliantly adjusting its grip. This policy was never trained on boxes, only generalized external forces.
The same policy can also respond to a variety of failure cases without specialized training. These represent scenarios where e.g. a high-level planner or teleoperator was ignorant of a misplaced box and attempted to pick it anyway.
SoftMimic policy trained with a walking reference can comply with a payload and human interactions while maintaining balance.
SoftMimic policy trained with a pouring reference maintains smooth pouring while its other hand is significantly displaced. A stiff baseline jitters when forced the same distance, spilling the contents.
Website template adated from here.