Gradient-Based Planning for World Models: Breakthrough in Long-Horizon Control

Gradient-Based Planning for World Models at Longer Horizons

GRASP, a novel gradient-based planner for learned world models, is transforming how AI systems execute long-horizon tasks by overcoming the critical fragility of traditional planning methods. Developed by researchers at UC Berkeley’s BAIR lab—including Michael Psenka, Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar—GRASP introduces a robust framework that sidesteps the adversarial sensitivity inherent in high-dimensional state spaces while preserving the efficiency of gradient descent. Unlike conventional planners that suffer from vanishing gradients and non-greedy traps over extended sequences, GRASP lifts trajectories into virtual states, injects stochasticity into state iterates, and reshapes gradients to isolate action signals from brittle state-input dependencies.

Overcoming Adversarial Fragility in World Model Planning

Modern world models, trained to predict future observations from high-dimensional inputs like pixels, are powerful but notoriously vulnerable to adversarial perturbations. As noted in the original research, even minor deviations from the learned data manifold can cause the model to produce wildly inaccurate predictions, a phenomenon rooted in the "dimpled manifold" problem described by Stutz et al. (2019). This instability renders standard collocation-based planners ineffective, since optimizing over states directly invites exploitation of sharp, unregularized gradients in the latent space.

GRASP circumvents this by employing a stop-gradient mechanism on state inputs to the dynamics model, effectively decoupling the optimization from the unstable D_s F_\theta gradients. Instead, it relies on the more reliable D_a F_\theta gradients—action gradients that are densely trained and inherently less susceptible to adversarial manipulation. This insight, supported by findings in constrained nonconvex optimization with Markovian data (Roy et al., 2022), enables GRASP to maintain stable learning signals even as horizons extend beyond 80 steps.

Complementing this, GRASP introduces state-space noise during optimization, allowing the planner to escape local minima without destabilizing the action policy. This hybrid strategy—deterministic action updates paired with stochastic state exploration—mirrors principles seen in stochastic approximation under Markovian frameworks, as revisited by Springer Nature (2026), where controlled noise enhances convergence in nonconvex landscapes. The method avoids full Langevin dynamics but strategically leverages noise to sample diverse trajectory basins, a technique that significantly outperforms pure gradient descent or CEM baselines.

Periodic "sync" steps further refine the solution by briefly reverting to serial rollout gradients, ensuring trajectory feasibility without sacrificing speed. This hybrid architecture balances the scalability of lifted optimization with the precision of full-model backpropagation. Experimental results on the Push-T and BallNav benchmarks demonstrate GRASP’s superiority: at H=80, GRASP achieves a 10.4% success rate versus 2.8% for CEM and 6.4% for standard gradient descent, while reducing median computation time by over 50% compared to alternatives.

By decoupling state and action optimization, GRASP represents a paradigm shift in world model planning. It doesn’t merely improve upon existing methods—it redefines what’s computationally feasible for long-horizon control. As world models grow in scale and complexity, GRASP’s architecture provides a scalable, robust foundation for autonomous systems that must reason over extended time horizons without succumbing to adversarial brittleness. Future extensions may integrate diffusion models or RL policy learning, but the core innovation remains: when planning through learned dynamics, trust the actions, not the states.

AI-Powered Content

Sources: link.springer.com • export.arxiv.org • openreview.net • openreview.net • proceedings.mlr.press