Eliot Xing
Vernon Luk
Jean Oh
Carnegie Mellon University




TL;DR We present a maximum entropy first-order RL algorithm, alongside a parallel differentiable multiphysics simulation platform for RL that supports simulating various materials beyond just rigid bodies.

Results on Rewarped tasks

AntRun
HandReorient
RollingFlat
SoftJumper
HandFlip
FluidMove
Figure. Visualizations of trajectories from policies learned by SAPO in Rewarped tasks.

Soft Analytic Policy Optimization (SAPO)

We observe that existing first-order RL algorithms are prone to suboptimal convergence to local minima in the reward landscape. We draw on the maximum entropy RL framework to formulate Soft Analytic Policy Optimization (SAPO), a maximum entropy first-order RL algorithm. SAPO uses first-order analytic gradients from differentiable simulation to train a stochastic actor to maximize expected return and entropy.
APG
SHAC
SAPO (ours)
Figure. Loss surface comparison between algorithms – DFlex Ant.


APG
SHAC
SAPO (ours)
Figure. Loss surface comparison between algorithms – Rewarped HandFlip.


Figure. Computational graph of SAPO.

Rewarped

We introduce Rewarped, a parallel differentiable multiphysics simulation platform that provides GPU-accelerated parallel environments for RL and enables computing batched simulation gradients efficiently.

We implement all simulation code in NVIDIA Warp, a library for differentiable programming that converts Python code into CUDA kernels by runtime JIT compilation. We use gradient checkpointing and CUDA graphs to reduce memory requirements and compute batched simulation gradients over multiple time steps efficiently.

Results


Figure. Rewarped tasks training curves.

We compare SAPO, our proposed maximum entropy first-order RL algorithm, against baselines on a range of challenging manipulation and locomotion tasks that involve rigid and soft bodies. SAPO shows (i) more stable training across different random seeds, (ii) improved sample efficiency, and (iii) achieves higher task performance overall.

AntRun

SAPO (ours)
SAC
SHAC
PPO
APG

HandReorient

SAPO (ours)
SAC
SHAC
PPO
APG

RollingFlat

SAPO (ours)
SAC
SHAC
PPO
APG

SoftJumper

SAPO (ours)
SAC
SHAC
PPO
APG

HandFlip

SAPO (ours)
SAC
SHAC
PPO
APG

FluidMove

SAPO (ours)
SAC
SHAC
PPO
APG
Figure. Visualizations of trajectories from different algorithms in Rewarped tasks.

BibTeX
@article{xing2024stabilizing,
    title={Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation},
    author={Eliot Xing and Vernon Luk and Jean Oh},
    journal={International Conference on Learning Representations (ICLR)},
    year={2025}
}