Model Predictive Path Integral Control


MPPI: Sampling-Based Optimal Control


Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
From MPC to MPPI
Stochastic optimal control
Information-theoretic derivation
Importance sampling update
Algorithm and properties

A Progression in Optimal Control

Each method relaxes one key assumption of the previous.

Method Dynamics Key relaxation
LQR linear baseline: analytical solution
iLQR nonlinear drop linearity via local linearization
MPC nonlinear drop infinite horizon; add constraints; replan online
MPPI nonlinear drop differentiability; use sampling
PPO unknown drop the model entirely; learn from interaction

LQR → iLQR → MPC all require differentiable dynamics and cost. MPPI breaks that requirement.

When Gradient-Based MPC Struggles

iLQR and SQP-based MPC compute Jacobians and Hessians of ff and \ell at every step.

This fails (or becomes fragile) when:

  • The cost is non-smooth: binary collision indicators, distance-to-obstacle, contact forces
  • The dynamics involve contact or discontinuities: legged robots, manipulation with grasping
  • The dynamics are a black-box simulator: no analytical gradients available
  • The cost landscape is highly non-convex: many local minima, gradient descent gets stuck

MPPI’s answer: replace gradient computation with parallel sampling.

Roll out many trajectories, weight them by cost, take the weighted average. No derivatives needed.

Modern GPUs can run thousands of trajectory simulations in parallel — making sampling practical within a real-time control loop.

Stochastic Optimal Control Problem Setup

System and Cost

Work in discrete time with a control-affine stochastic system:

xt+1=f(xt)+G(xt)(ut+εt),εt𝒩(0,Σ)x_{t+1} = f(x_t) + G(x_t)(u_t + \varepsilon_t), \qquad \varepsilon_t \sim \mathcal{N}(0, \Sigma)

where G(x)G(x) is the control matrix (possibly state-dependent) and noise enters through the control channel.

The path cost over a horizon of TT steps is: S(τ)=ϕ(xT)+t=0T1(xt,ut)S(\tau) = \phi(x_T) + \sum_{t=0}^{T-1} \ell(x_t, u_t)

with running cost decomposed as: (xt,ut)=q(xt)+12utTRut\ell(x_t, u_t) = q(x_t) + \frac{1}{2} u_t^T R\, u_t

  • q(x)q(x): state-dependent cost — may be non-smooth, non-differentiable, or even binary
  • 12uTRu\frac{1}{2} u^T R u: quadratic control penalty

Stochastic Optimal Control Problem

Goal is to minimize cost

minutS(τ) min_{u_t} S(\tau)

Subject to

xt+1=f(xt)+G(xt)(ut+εt),εt𝒩(0,Σ)x_{t+1} = f(x_t) + G(x_t)(u_t + \varepsilon_t), \qquad \varepsilon_t \sim \mathcal{N}(0, \Sigma)

  • Dynamics are control-affine with noise entering through control
  • Cost is sum-of-stage-costs: S(τ)=ϕ(xT)+t=0T1(xt,ut)S(\tau) = \phi(x_T) + \sum_{t=0}^{T-1} \ell(x_t, u_t)
  • Stage cost \ell has specific structure: (xt,ut)=q(xt)+12utTRut\ell(x_t, u_t) = q(x_t) + \frac{1}{2} u_t^T R\, u_t

Information-Theoretic Reformulation

From Trajectory Optimization to Distribution Optimization

Rather than minimizing S(τ)S(\tau) directly, search over distributions q(τ)q(\tau) over trajectories.

Define the passive distribution p0(τ)p_0(\tau): trajectories generated by rolling out the current nominal control sequence v0:T1v_{0:T-1} plus noise,

p0(τ)exp(12tεtTΣ1εt)p_0(\tau) \propto \exp\!\left(-\frac{1}{2}\sum_t \varepsilon_t^T \Sigma^{-1} \varepsilon_t\right)

where ut=vt+εtu_t = v_t + \varepsilon_t and εt𝒩(0,Σ)\varepsilon_t \sim \mathcal{N}(0, \Sigma).

Pose the information-theoretic optimal control problem:

minq(τ)𝔼q[S(τ)]+λKL(qp0)\min_{q(\tau)} \quad \mathbb{E}_q[S(\tau)] + \lambda\, \mathrm{KL}(q \,\|\, p_0)

The two terms balance minimizing cost against staying close to the passive distribution.

Why the KL Term Is a Control Cost

For the control-affine system, the KL divergence between qq and p0p_0 is:

KL(qp0)=12𝔼q[tεtTΣ1εt]\mathrm{KL}(q \,\|\, p_0) = \frac{1}{2}\,\mathbb{E}_q\!\left[\sum_t \varepsilon_t^T \Sigma^{-1} \varepsilon_t\right]

Setting R=λΣ1R = \lambda \Sigma^{-1}, the KL penalty becomes:

λKL(qp0)=12𝔼q[tεtTRεt]=𝔼q[t12utTRut]\lambda\,\mathrm{KL}(q \,\|\, p_0) = \frac{1}{2}\,\mathbb{E}_q\!\left[\sum_t \varepsilon_t^T R\, \varepsilon_t\right] = \mathbb{E}_q\!\left[\sum_t \frac{1}{2} u_t^T R\, u_t\right]

The information-theoretic problem is therefore equivalent to the original stochastic optimal control problem — no approximation yet.

This is the key structural observation: the KL regularizer and the quadratic control cost are the same thing, linked by R=λΣ1R = \lambda \Sigma^{-1}.

Optimal Solution

The Gibbs Distribution

The variational problem

minq(τ)𝔼q[S(τ)]+λKL(qp0)\min_{q(\tau)} \quad \mathbb{E}_q[S(\tau)] + \lambda\, \mathrm{KL}(q \,\|\, p_0)

has a closed-form solution. Taking the functional derivative δ/δq\delta/\delta q and setting to zero:

q*(τ)=1ηp0(τ)exp(S(τ)λ),η=𝔼p0[exp(S(τ)λ)]q^*(\tau) = \frac{1}{\eta}\, p_0(\tau) \exp\!\left(-\frac{S(\tau)}{\lambda}\right), \qquad \eta = \mathbb{E}_{p_0}\!\left[\exp\!\left(-\frac{S(\tau)}{\lambda}\right)\right]

This is the Gibbs (Boltzmann) distribution: it reweights the passive rollout distribution by an exponential of the negative cost.

  • weight \propto 1/cost

The temperature λ\lambda controls the sharpness:

λ\lambda small λ\lambda large
Greedy — concentrates on best trajectory Diffuse — nearly uniform weights
Sharp selection Exploration

Free Energy

The normalization constant η\eta has a direct interpretation. Substituting q*q^* back:

=λlog𝔼p0[exp(S(τ)λ)]\mathcal{F} = -\lambda \log \mathbb{E}_{p_0}\!\left[\exp\!\left(-\frac{S(\tau)}{\lambda}\right)\right]

This is the free energy — the minimum achievable value of the information-theoretic objective.

By Jensen’s inequality applied to the convex function log-\log:

𝔼p0[S(τ)]\mathcal{F} \leq \mathbb{E}_{p_0}[S(\tau)]

Optimal reweighting always improves over the passive (uncontrolled) rollout.

This mirrors the free energy in statistical mechanics, with λ\lambda playing the role of temperature kBTk_B T.

Control Update via Importance Sampling

From Optimal Distribution to Control

Given q*(τ)q^*(\tau), the optimal control at time tt is:

ut*=vt+𝔼q*[εt]u_t^* = v_t + \mathbb{E}_{q^*}[\varepsilon_t]

Using the definition of q*q^* to convert to an expectation under p0p_0:

ut*=vt+𝔼p0[εtexp(S/λ)]𝔼p0[exp(S/λ)]u_t^* = v_t + \frac{\mathbb{E}_{p_0}\!\left[\varepsilon_t \exp(-S/\lambda)\right]}{\mathbb{E}_{p_0}\!\left[\exp(-S/\lambda)\right]}

This is importance sampling: sample from p0p_0 (the easy distribution), reweight by exp(S/λ)\exp(-S/\lambda).

No gradients of ff or \ell appear anywhere in this expression. The dynamics and cost are treated as black boxes.

Monte Carlo Approximation

Approximate with KK samples ε(k)p0\varepsilon^{(k)} \sim p_0:

ut*vt+k=1Kw̃(k)εt(k),w̃(k)=exp(S(k)/λ)jexp(S(j)/λ)u_t^* \approx v_t + \sum_{k=1}^{K} \tilde{w}^{(k)}\, \varepsilon_t^{(k)}, \qquad \tilde{w}^{(k)} = \frac{\exp(-S^{(k)}/\lambda)}{\sum_j \exp(-S^{(j)}/\lambda)}

The normalized weights w̃(k)\tilde{w}^{(k)} sum to one and are higher for lower-cost trajectories.

Numerical stability: subtract β=minkS(k)\beta = \min_k S^{(k)} before exponentiating:

w(k)=exp(S(k)βλ),w̃(k)=w(k)jw(j)w^{(k)} = \exp\!\left(-\frac{S^{(k)} - \beta}{\lambda}\right), \qquad \tilde{w}^{(k)} = \frac{w^{(k)}}{\sum_j w^{(j)}}

This is algebraically identical — β\beta cancels in the normalization — but prevents overflow.

When Is the Derivation Exact?

The derivation above is exact when:

  1. Dynamics are control-affine with noise through the control channel: xt+1=f(xt)+G(xt)(ut+εt)x_{t+1} = f(x_t) + G(x_t)(u_t + \varepsilon_t)
  2. The control cost satisfies R=λΣ1R = \lambda \Sigma^{-1}, so the KL term and control cost coincide

In practice, MPPI is applied more broadly:

  • Non-affine dynamics (black-box simulators)
  • Binary or discontinuous costs
  • Constraints encoded as large penalties

The free-energy / importance-sampling structure is retained as a principled heuristic, but the optimality guarantee no longer holds exactly.

Even as a heuristic, MPPI consistently outperforms random shooting because the weighted average biases the update toward the low-cost region of trajectory space.

The Algorithm

MPPI

At each time step tt with current nominal control sequence ut:t+N1u_{t:t+N-1}:

Input: current state x_t, nominal controls u[0:T-1], noise covariance Σ, temperature λ

1. Sample K perturbation sequences:
     ε^(k)[s] ~ N(0, Σ)   for k = 1..K, s = 0..T-1

2. Roll out each perturbed trajectory:
     x^(k)[0] = x_t
     x^(k)[s+1] = f(x^(k)[s], u[s] + ε^(k)[s])   for s = 0..T-1

3. Compute trajectory costs:
     S^(k) = Σ_s ℓ(x^(k)[s], u[s] + ε^(k)[s]) + φ(x^(k)[T])

4. Compute normalized weights:
     β = min_k S^(k)
     w^(k) = exp(-(S^(k) - β) / λ)
     w̃^(k) = w^(k) / Σ_j w^(j)

5. Update nominal controls:
     u[s] ← u[s] + Σ_k w̃^(k) ε^(k)[s]   for s = 0..T-1

6. Apply u[0], shift sequence: u[s] ← u[s+1], append u[T-1] = 0

Repeat at t+1 with new measured state.

Key Properties

What MPPI Gives You

Property Explanation
No gradients ff and \ell treated as black boxes
Parallelizable KK rollouts are independent → GPU threads
Non-smooth costs Binary collision, discontinuous rewards work fine
Global search Samples explore trajectory space, not just a local neighborhood
Temperature tuning λ\lambda trades off exploitation vs. exploration

What MPPI still requires: a dynamics model ff that can be simulated forward. It need not be differentiable, but it must be fast enough that KK rollouts complete within one control cycle. MPPI is still a model-based method.

Tuning Parameters

Parameter Effect Typical guidance
KK (samples) More → better approximation, more compute 100–10 000; use as many as GPU allows
TT (horizon) Longer → better planning, more compute Match to task timescale
λ\lambda (temperature) Lower → greedier; higher → more exploration Tune; 0.1–10 typical
Σ\Sigma (noise) Larger → wider exploration Match to control authority

Unlike iLQR, MPPI has no convergence criterion within a single control step — it runs a fixed number of samples per cycle. The quality of the solution improves with KK.

Demo

MPPI Pendulum Swing-Up

Pendulum swing-up from θ0=π\theta_0 = \pi (hanging down) to θ=0\theta = 0 or 2π2 \pi (upright).

Energy-shaping running cost: q(x)=(E(x)E*)2q(x) = \bigl(E(x) - E^*\bigr)^2, where E(x)=12θ̇2+gLcosθE(x) = \tfrac{1}{2}\dot\theta^2 + gL\cos\theta is the total mechanical energy and E*=gLE^* = gL is the energy at the upright with zero velocity.

Results

MPPI vs. iLQR / MPC

Side-by-Side Comparison

iLQR / MPC MPPI
Optimization Gradient + Hessian (backward pass) Weighted average of samples
Dynamics requirement Differentiable ff, Jacobians Simulatable ff only
Cost requirement Smooth, differentiable Arbitrary (non-smooth, binary ok)
Parallelism Sequential backward pass Embarrassingly parallel
Global search Local (Newton steps) Stochastic global via samples
Stability guarantees Lyapunov (with terminal cost) Not guaranteed in general
Warm-starting Natural (shift previous solution) Natural (shift previous controls)

In practice: use iLQR/MPC when gradients are available and the landscape is well-behaved. Use MPPI for contact-rich systems, black-box simulators, or when the cost is non-differentiable.

Summary

MPPI

Core idea: replace gradient-based trajectory optimization with importance-weighted sampling.

Derivation in three steps:

  1. Pose stochastic OCP as minimizing 𝔼q[S(τ)]+λKL(qp0)\mathbb{E}_q[S(\tau)] + \lambda\,\mathrm{KL}(q \| p_0)
  2. Solve variationally → optimal q*q^* is the Gibbs distribution
  3. Extract control via importance sampling → no gradients needed

Key trade-offs vs. iLQR/MPC:

  • + No differentiability required; handles non-smooth costs and contact
  • + Massively parallelizable on GPU
  • Still needs a fast forward model
  • No convergence guarantee within one control cycle; quality scales with KK