Proximal Policy Optimization


PPO: Model-Free Reinforcement Learning


Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
Bridge from MPPI
Policy gradient theorem
Importance sampling & trust regions
Clipped surrogate objective
Advantage estimation
Algorithm and properties

A Progression in Optimal Control

Each method relaxes one key assumption of the previous.

Method Dynamics Key relaxation
LQR linear baseline: analytical solution
iLQR nonlinear drop linearity via local linearization
MPC nonlinear drop infinite horizon; add constraints; replan online
MPPI nonlinear drop differentiability; use sampling
PPO unknown drop the model entirely; learn from interaction

Every method above requires a access to a dynamics model ff. PPO removes that requirement.

From MPPI to PPO

MPPI

MPPI moves through these distributions in each planning step:

ε𝒩(0,Σ)1. noise distut=vt+εt2. control distτ via xt+1=f(xt,ut)3. trajectory distS(τ)4. Cost distribution\underbrace{\varepsilon \sim \mathcal{N}(0,\Sigma)}_{\text{1. noise dist}} \;\longrightarrow\; \underbrace{u_t = v_t + \varepsilon_t}_{\text{2. control dist}} \;\longrightarrow\; \underbrace{\tau \text{ via } x_{t+1}=f(x_t,u_t)}_{\text{3. trajectory dist}} \;\longrightarrow\; \underbrace{S(\tau)}_{\text{4. Cost distribution}}

The parameter being optimized is vtv_t — the mean of the control distribution — a fixed sequence of vectors.

The update is a cost-weighted mean: vtvt+kw̃(k)εt(k),w̃(k)exp(S(k)/λ)v_t \leftarrow v_t + \sum_k \tilde{w}^{(k)} \varepsilon_t^{(k)}, \qquad \tilde{w}^{(k)} \propto \exp\!\left(-S^{(k)}/\lambda\right)

The optimized object vtv_t is an open-loop control sequence: it does not depend on state encountered during execution.

PPO

PPO inserts an additional level — a parameterized state-dependent policy:

θ1. param spaceπθ(as)2. action dist (per state)atπθ(st)3. trajectory distR(τ)4. Reward distribution\underbrace{\theta}_{\text{1. param space}} \;\longrightarrow\; \underbrace{\pi_\theta(a \mid s)}_{\text{2. action dist (per state)}} \;\longrightarrow\; \underbrace{a_t \sim \pi_\theta(\cdot \mid s_t)}_{\text{3. trajectory dist}} \;\longrightarrow\; \underbrace{R(\tau)}_{\text{4. Reward distribution}}

The parameter being optimized is θ\theta — the weights of a neural network (or other function approximator) that maps any state to a distribution over actions.

The update adjusts θ\theta so that actions leading to higher reward become more probable.

MPPI PPO
Optimized object control sequence v0:T1v_{0:T-1} policy params θ\theta
Action at runtime fixed per step aπθ(s)a \sim \pi_\theta(\cdot \mid s)
Feedback open-loop (replanned each cycle) closed-loop (state-dependent)
Dynamics model required not required

The Bridge: Same Core Mechanism

Despite the structural difference, MPPI and PPO share the same core update principle:

Sample trajectories under the current distribution.

Weight samples by reward (or advantage).

Update the parameter toward the higher-weight samples.

The proximal constraint in PPO plays the same role as the KL penalty in MPPI’s information-theoretic derivation: both prevent the update from stepping so far from the current distribution that the importance weights become unreliable.

MPPI PPO
minq𝔼q[S]+λKL(qp0)\min_q\;\mathbb{E}_q[S] + \lambda\,\mathrm{KL}(q \| p_0) maxθ𝔼[clipped IS ratio×A]\max_\theta\;\mathbb{E}[\text{clipped IS ratio} \times A]1
KL term limits policy change clipping limits policy change

Policy Gradient Setup

The Reinforcement Learning Objective

An agent interacts with an environment: at each step tt it observes state sts_t, takes action atπθ(st)a_t \sim \pi_\theta(\cdot \mid s_t), and receives reward rtr_t.

Goal: find θ\theta to maximize expected discounted return:

J(θ)=𝔼τπθ[t=0Tγtrt]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{T} \gamma^t r_t\right]

where γ(0,1]\gamma \in (0,1] is a discount factor and τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots).

Unlike MPPI, no dynamics model ff appears. The reward signal arrives directly from the environment. The only requirement is that πθ(as)\pi_\theta(a \mid s) is differentiable in θ\theta.

Reinforcement Learning uses an estimated value function V(s)V(s) or QQ-Value Q(s,a)Q(s,a) to summarize what will happen, instead of an explicit model ff.

Value Functions and Advantage

Define:

Vπ(s)=𝔼τπ[t=tTγttrt|st=s]V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t'=t}^T \gamma^{t'-t} r_{t'} \;\Big|\; s_t = s\right]

the state-value function — expected return from state ss under policy π\pi.

Qπ(s,a)=𝔼τπ[t=tTγttrt|st=s,at=a]Q^\pi(s,a) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t'=t}^T \gamma^{t'-t} r_{t'} \;\Big|\; s_t = s,\; a_t = a\right]

the action-value function — expected return after taking action aa in state ss.

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)

the advantage — how much better (or worse) action aa is than the average action from state ss.

A>0A > 0: this action is better than average — make it more probable. A<0A < 0: worse than average — make it less probable.

Advantages as shortcuts

r(s)=1r(s) = -1 for all states except s0s_0.

So, under blue policy, V(si)=iV(s_i) = - i

Q(s7,"up")=1Q(s_7,\text{"up"}) = -1

Aπ(s7,"up")=Q(s7,"up")V(s7)=1(7)=6 \begin{align} \implies A^{\pi}(s_7, \text{"up"}) &= Q(s_7,\text{"up"}) - V(s_7)\\ &= -1 - (-7) \\&= 6 \end{align}

Stochastic Policy

A simple and commonly used stochastic state-dependent policy is Gaussian:

πθ(a|s)=12πσexp(aμθ(s)σ)2\pi_\theta(a | s) = \frac{1}{\sqrt{2 \pi } \sigma} \exp^{ - \left(\frac{a - \mu_\theta(s)}{ \sigma} \right)^2}

  • μθ(s)\mu_\theta(s) is a state-dependent ‘mean action’.
    Example: at x=1x=1 meters average force is 1-1 N to push a box back to x=0x=0
  • σ\sigma is independent of θ\theta and ss to simplify computations

Policy Gradient Theorem

The gradient of J(θ)J(\theta) with respect to θ\theta is:

θJ(θ)=𝔼τπθ[tθlogπθ(atst)R(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau)\right]

This is the policy gradient theorem (PGT). Key observations:

  • Only logπθ\log \pi_\theta needs to be differentiable — the environment dynamics need not be.
  • θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s): direction in θ\theta-space that increases the probability of aa at ss.
    • πθ(a|s)=12πσexp(aμθ(s)σ)2θlogπθ(a|s)=(aμθ(s))θμθ(s)\pi_\theta(a | s) = \frac{1}{\sqrt{2 \pi } \sigma} \exp^{ - \left(\frac{a - \mu_\theta(s)}{ \sigma} \right)^2} \implies \nabla_\theta \log \pi_\theta(a | s) = (a - \mu_\theta(s) ) \nabla_\theta \mu_\theta(s)
    • Brings μθ(s)\mu_\theta(s) closer to aa, increasing probability of action aa
  • Multiplied by R(τ)R(\tau): More rewards imply greater increase in probability of that action relative to others

Algorithms using PGT

REINFORCE: Simulate using stochastic actions, implement PGT as-is using actual rewards θJ(θ)=𝔼τπθ[tθlogπθ(atst)R(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau)\right]

Vanilla policy gradient: Remove a baseline b(s)b(s) from actual rewards. Usually b(s)=Vπ(s)b(s) = V^{\pi}(s). θJ(θ)=𝔼τπθ[tθlogπθ(atst)(R(τ)b(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot ( R(\tau) - b(s_t) )\right] Why: reduces variance without introducing any bias

Actor-Critic: Use a critic function to estimate R(τ)R(\tau), not actual rewards. A popular critic is the advantage function.

θθ+αtθlogπθ(atst)Ât\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \hat{A}_t

Probability of better actions (At>0A_t>0) increases while that of worse actions (At<0A_t <0) decreases

Why Actor-Critic Is Unstable

Actor-Critic methods work in principle but fails in practice due to step-size sensitivity:

  • Too small a step → slow convergence, wasted samples
  • Too large a step → the policy changes drastically, old samples are no longer representative, and the update may push into a region from which recovery is difficult

The fundamental issue: gradient ascent in θ\theta-space does not respect the geometry of the probability distribution πθ\pi_\theta. A small step in θ\theta can produce a large change in πθ\pi_\theta.

TRPO (Trust Region Policy Optimization) solved this by constraining KL(πθoldπθ)δ\mathrm{KL}(\pi_{\theta_\mathrm{old}} \| \pi_\theta) \leq \delta, but it requires second-order optimization (conjugate gradient + line search) — expensive.

PPO achieves similar stability with a simple first-order trick: clip the importance ratio.

The PPO Objective

Importance Sampling Enters

Suppose we collected data under πθold\pi_{\theta_\mathrm{old}}. We want to optimize θ\theta using those samples — this is importance sampling.

Define the probability ratio:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\mathrm{old}}(a_t \mid s_t)}

Then the policy gradient objective can be written as:

LIS(θ)=𝔼t[rt(θ)At]L^\mathrm{IS}(\theta) = \mathbb{E}_t\!\left[r_t(\theta)\, A_t\right]

This is exactly the importance-sampling reweighting from MPPI: we collected samples under πθold\pi_{\theta_\mathrm{old}} (analogous to p0p_0), and reweight by rt(θ)r_t(\theta) (analogous to exp(S/λ)\exp(-S/\lambda)) to estimate expectations under the new distribution.

Problem: AtA_t comes from θold\theta_{\mathrm{old}}. Unconstrained maximization of LISL^\mathrm{IS} can push rtr_t to extreme values. If πθ\pi_\theta moves far from πθold\pi_{\theta_\mathrm{old}} (rtr_t far from 11), the associated gradient is unreliable and the update collapses.

The Clipped Surrogate Objective

PPO replaces LISL^\mathrm{IS} with a clipped version:

LCLIP(θ)=𝔼t[min(rt(θ)At,clip(rt(θ),1ε,1+ε)At)]L^\mathrm{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\,A_t,\;\mathrm{clip}(r_t(\theta),\,1-\varepsilon,\,1+\varepsilon)\,A_t\right)\right]

where ε0.2\varepsilon \approx 0.2 is a small hyperparameter.

Intuition for the clip:

When |rt(θ)1|>ϵ|r_t(\theta) - 1| > \epsilon, the gradient of the clipped term w.r.t θ\theta is zero, no update when the probabilities have changed too much going from θoldθ\theta_{old} \to \theta.

Intuition for the min:

Makes the clipping asymmetric: only clip when

  • At>0A_t > 0 and new probability under θ\theta is already much higher at current θ\theta
  • At<0A_t < 0 and new probability under θ\theta is already much lower at current θ\theta

Clipping prevents the policy from taking greedy steps that move it outside the region where the current samples are informative — the same motivation as the KL term in MPPI.

Advantage Estimation

Estimating AtA_t in Practice

The true advantage At=Qπ(st,at)Vπ(st)A_t = Q^\pi(s_t,a_t) - V^\pi(s_t) is unknown. Two practical estimators:

Monte Carlo return: ÂtMC=(t=tTγttrt)Vϕ(st)\hat{A}_t^\mathrm{MC} = \left(\sum_{t'=t}^T \gamma^{t'-t} r_{t'}\right) - V_\phi(s_t)

Unbiased but high variance — a single sampled return can be far from the mean.

TD residual (1-step): ÂtTD=rt+γVϕ(st+1)Vϕ(st)\hat{A}_t^\mathrm{TD} = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

Low variance but biased if VϕV_\phi is inaccurate.

Generalized Advantage Estimation (GAE): interpolate with parameter λ̂[0,1]\hat\lambda \in [0,1]: ÂtGAE=k=0Tt(γλ̂)kδt+k,δt=rt+γVϕ(st+1)Vϕ(st)\hat{A}_t^\mathrm{GAE} = \sum_{k=0}^{T-t} (\gamma\hat\lambda)^k \delta_{t+k}, \qquad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

λ̂=0\hat\lambda = 0: pure TD (low variance, more bias); λ̂=1\hat\lambda = 1: Monte Carlo (unbiased, high variance).

The Critic: Learning VϕV_\phi

PPO jointly trains two networks (or one with two heads):

Actor πθ(as)\pi_\theta(a \mid s): the policy — updated via LCLIPL^\mathrm{CLIP}.

Critic Vϕ(s)V_\phi(s): the value function — updated by minimizing the TD error: LVF(ϕ)=𝔼t[(Vϕ(st)V̂ttarget)2]L^\mathrm{VF}(\phi) = \mathbb{E}_t\!\left[\left(V_\phi(s_t) - \hat{V}_t^\mathrm{target}\right)^2\right]

The full PPO loss also includes an entropy bonus to encourage exploration:

LPPO(θ,ϕ)=LCLIP(θ)c1LVF(ϕ)+c2𝔼t[(πθ(st))]L^\mathrm{PPO}(\theta,\phi) = L^\mathrm{CLIP}(\theta) - c_1 L^\mathrm{VF}(\phi) + c_2\, \mathbb{E}_t[\mathcal{H}(\pi_\theta(\cdot \mid s_t))]

There is no equivalent of the critic in MPPI — MPPI uses the raw trajectory cost S(τ)S(\tau) directly rather than learning a value function from data.

The Algorithm

PPO

Input: policy π_θ, value network V_φ, clip ε, rollout length T, epochs K

Repeat until convergence:
  1. Collect rollout:
       For t = 0 to T-1:
         a_t ~ π_θ(· | s_t),  step environment,  observe r_t, s_{t+1}

  2. Compute advantages and target values:
       δ_t  = r_t + γ V_φ(s_{t+1}) − V_φ(s_t)
       Â_t  = GAE(δ_0:T, λ̂)      (generalized advantage estimate)
       V̂_t  = Â_t + V_φ(s_t)     (target for critic)

  3. Save old policy:  θ_old ← θ

  4. Optimize for K epochs over the collected batch:
       For each minibatch (s_t, a_t, Â_t, V̂_t):
         r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)
         L^CLIP  = mean[ min( r_t Â_t,  clip(r_t, 1−ε, 1+ε) Â_t ) ]
         L^VF    = mean[ (V_φ(s_t) − V̂_t)² ]
         L^H     = mean[ H(π_θ(· | s_t)) ]   (entropy bonus)
         Update θ, φ by gradient ascent on  L^CLIP − c₁ L^VF + c₂ L^H

  5. Repeat from step 1.

Steps 1–2 collect experience; step 4 reuses it for multiple gradient updates (multiple epochs over the same batch), which is more sample-efficient than one-step REINFORCE. The clip makes multi-epoch reuse safe.

What PPO Gives You

Property Explanation
Model-free No dynamics model needed; learns purely from (s,a,r,s)(s,a,r,s') tuples
Closed-loop Policy πθ(as)\pi_\theta(a \mid s) adapts to any visited state
Stable updates Clipping prevents destructive large steps; robust to learning rate
Sample reuse Multiple gradient epochs per rollout batch
Continuous or discrete actions Gaussian policy for continuous; softmax for discrete
GPU-friendly Parallel environment rollouts; batch backprop through the network

What PPO still requires: enough environment interactions to learn a good value function and policy. Sample efficiency is better than REINFORCE but far lower than model-based methods like MPPI. PPO may need millions of environment steps for complex tasks.

Hyperparameters

Parameter Effect Typical value
ε\varepsilon (clip range) Larger → more aggressive update 0.1–0.3
TT (rollout length) Longer → better return estimates 128–4096 steps
KK (update epochs) More → better use of each batch 3–10
λ̂\hat\lambda (GAE) Bias-variance tradeoff in advantage 0.95
γ\gamma (discount) Effective planning horizon 1/(1γ)\approx 1/(1-\gamma) 0.99
c1,c2c_1, c_2 (loss weights) Value vs. entropy tradeoff c1=0.5c_1=0.5, c2=0.01c_2=0.01

PPO is often described as the “default” deep RL algorithm because its performance is robust across a wide range of hyperparameter settings — the clip replaces careful learning-rate tuning.

MPPI vs. PPO

MPPI PPO
Dynamics model Required (must simulate ff) Not required
Control structure Open-loop sequence v0:T1v_{0:T-1} Closed-loop policy πθ(as)\pi_\theta(a \mid s)
Distribution chain noise → control → trajectory → cost θ\theta → policy → action → trajectory → reward
Optimization object Control mean vtv_t Policy params θ\theta
“Proximity” mechanism KL penalty λKL(qp0)\lambda\,\mathrm{KL}(q \| p_0) Ratio clip clip(rt,1±ε)\mathrm{clip}(r_t, 1\pm\varepsilon)
Sample efficiency High (model-based) Lower (model-free)

RL Algorithm Landscape

Taxonomy of Key Algorithms

Algorithm Policy On/Off-policy Action space Needs critic Key mechanism
REINFORCE stochastic on discrete / cont. no Monte Carlo returns
A2C / A3C stochastic on discrete / cont. yes (VV) parallel workers + TD advantage
PPO stochastic on discrete / cont. yes (VV) clipped IS ratio; multi-epoch reuse
TD3 deterministic off continuous yes (twin QQ) fixes DDPG overestimation
SAC stochastic off continuous† yes (twin QQ) entropy regularization; max-entropy framework

†Discrete SAC exists but is less common.

On-policy algorithms discard data after each update — samples must come from the current policy. Off-policy algorithms store data in a replay buffer and can reuse old transitions, making them more sample-efficient but harder to stabilize.

Choosing an Algorithm

Action space drives the first cut:

  • Discrete actions (game buttons, token selection): REINFORCE, A2C, PPO
  • Continuous actions (joint torques, velocities): PPO, TD3, SAC, DDPG

Sample budget drives the second cut:

  • Data is cheap (fast simulator, parallelizable): on-policy methods (PPO, ACKTR) are simpler and competitive
  • Data is expensive (real hardware, slow sim): off-policy methods (SAC, TD3) extract more from each transition