Finite Horizon LQR


Linear Quadratic Regulator


Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
Generalized LQR Setup
Dynamic Programming
Backward Riccati Recursion
Double Integrator Demo

Setup

When Does LQR Apply?

LQR solves optimal control problems where

  • time is discrete, tโˆˆ{0,1,2,โ€ฆ,T}t \in \{0, 1, 2, \dots, T\}
  • the dynamics are linear (or affine)
  • the cost is quadratic in state and control

These slides follow Sergey Levineโ€™s formulation, which generalizes naturally to iLQR and connects to reinforcement learning.

System and Cost

At each time tt, the dynamics and cost are:

๐ฑt+1=๐€t[๐ฑt๐ฎt]+๐št\mathbf{x}_{t+1} = \mathbf{A}_{t} \begin{bmatrix}\mathbf{x}_t \\ \mathbf{u}_t\end{bmatrix} + \mathbf{a}_t

ct(๐ฑt,๐ฎt)=12[๐ฑt๐ฎt]T๐‚t[๐ฑt๐ฎt]+[๐ฑt๐ฎt]T๐œtc_t(\mathbf{x}_t, \mathbf{u}_t) = \frac{1}{2}\begin{bmatrix}\mathbf{x}_t \\ \mathbf{u}_t\end{bmatrix}^{\!T} \mathbf{C}_t \begin{bmatrix}\mathbf{x}_t \\ \mathbf{u}_t\end{bmatrix} + \begin{bmatrix}\mathbf{x}_t \\ \mathbf{u}_t\end{bmatrix}^{\!T} \mathbf{c}_t

Total cost over the finite horizon:

J=โˆ‘t=0Tct(๐ฑt,๐ฎt)J = \sum_{t=0}^{T} c_t(\mathbf{x}_t, \mathbf{u}_t)

The terminal cost cT(๐ฑT,๐ฎT)c_T(\mathbf{x}_T, \mathbf{u}_T) is included in the sum. A typical choice is cT=12๐ฑTT๐‚๐ฑ๐ฑ,T๐ฑTc_T = \frac{1}{2}\mathbf{x}_T^T \mathbf{C}_{\mathbf{x}\mathbf{x},T} \mathbf{x}_T (no control at final step).

Double Integrator: Applying the Notation

A unit-mass particle: pฬˆ=u\ddot{p} = u. State ๐ฑt=[pt,vt]T\mathbf{x}_t = [p_t,\, v_t]^T, control ๐ฎt=[at]\mathbf{u}_t = [a_t].

Discrete dynamics (timestep ฮ”t\Delta t):

๐€t=[1ฮ”tฮ”t2201ฮ”t],๐št=๐ŸŽ\mathbf{A}_{t} = \begin{bmatrix}1 & \Delta t & \tfrac{\Delta t^2}{2} \\ 0 & 1 & \Delta t\end{bmatrix}, \quad \mathbf{a}_t = \mathbf{0}

Quadratic cost with no cross terms and ๐œt=๐ŸŽ\mathbf{c}_t = \mathbf{0}:

๐‚t=[qp000qv000r],๐‚T=[qf00qf]\mathbf{C}_t = \begin{bmatrix} q_p & 0 & 0 \\ 0 & q_v & 0 \\ 0 & 0 & r \end{bmatrix}, \quad \mathbf{C}_T = \begin{bmatrix} q_f & 0 \\ 0 & q_f \end{bmatrix}

๐€t\mathbf{A}_{t} maps the augmented state [๐ฑt;๐ฎt][\mathbf{x}_t;\, \mathbf{u}_t] to ๐ฑt+1\mathbf{x}_{t+1}. The columns correspond to [pt,vt,at][p_t,\, v_t,\, a_t].

Dynamic Programming

The Q-Function

Define the action-value function (Q-function) at time tt:

๐t(๐ฑt,๐ฎt)=ct(๐ฑt,๐ฎt)+Vt+1*(๐€t[๐ฑt๐ฎt]+๐št)\mathbf{Q}_t(\mathbf{x}_t, \mathbf{u}_t) = c_t(\mathbf{x}_t,\mathbf{u}_t) + V_{t+1}^*\!\left(\mathbf{A}_{t}\begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t\end{bmatrix} + \mathbf{a}_t\right)

Since Vt+1*(๐ฑ)=12๐ฑT๐•t+1๐ฑ+๐ฑT๐ฏt+1V_{t+1}^*(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{V}_{t+1} \mathbf{x} + \mathbf{x}^T \mathbf{v}_{t+1} is quadratic, ๐t\mathbf{Q}_t is also quadratic:

๐t(๐ฑt,๐ฎt)=const+12[๐ฑt๐ฎt]T๐ฬ‚t[๐ฑt๐ฎt]+[๐ฑt๐ฎt]T๐ชฬ‚t\mathbf{Q}_t(\mathbf{x}_t,\mathbf{u}_t) = \mathrm{const} + \frac{1}{2}\begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t\end{bmatrix}^{\!T}\hat{\mathbf{Q}}_t\begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t\end{bmatrix} + \begin{bmatrix}\mathbf{x}_t\\\mathbf{u}_t\end{bmatrix}^{\!T}\hat{\mathbf{q}}_t

where

๐ฬ‚t=๐‚t+๐€tT๐•t+1๐€t,๐ชฬ‚t=๐œt+๐€tT๐•t+1๐št+๐€tT๐ฏt+1\hat{\mathbf{Q}}_t = \mathbf{C}_t + \mathbf{A}_{t}^T \mathbf{V}_{t+1} \mathbf{A}_{t}, \qquad \hat{\mathbf{q}}_t = \mathbf{c}_t + \mathbf{A}_{t}^T \mathbf{V}_{t+1} \mathbf{a}_t + \mathbf{A}_{t}^T \mathbf{v}_{t+1}

The Q-Function

๐ฬ‚t=๐‚t+๐€tT๐•t+1๐€t=[QtcostStTStRt]+[AtTBtT]๐•t+1[AtBt]=[QtcostStTStRt]+[AtT๐•t+1AtAtT๐•t+1BtBtT๐•t+1AtBtT๐•t+1Bt]=[Qtcost+AtT๐•t+1AtStT+AtT๐•t+1BtSt+BtT๐•t+1AtRt+BtT๐•t+1Bt]=[๐ฬ‚๐ฑ๐ฑ๐ฬ‚๐ฑ๐ฎ๐ฬ‚๐ฎ๐ฑ๐ฬ‚๐ฎ๐ฎ]๐ชฬ‚t=[qtrt]+[AtT๐•t+1๐štBtT๐•t+1๐št]+[AtT๐ฏt+1BtT๐ฏt+1]=[๐ชฬ‚๐ฑ๐ชฬ‚๐ฎ] \begin{aligned} \hat{\mathbf{Q}}_t &= \mathbf{C}_t + \mathbf{A}_{t}^T \mathbf{V}_{t+1} \mathbf{A}_{t}\\ &= \begin{bmatrix} Q_t^{cost} & S_t^{\!T} \\ S_t& R_t \end{bmatrix} + \begin{bmatrix} A_t^{\!T} \\ B_t^{\!T}\end{bmatrix} \mathbf{V}_{t+1} \begin{bmatrix} A_t & B_t\end{bmatrix}\\ &= \begin{bmatrix} Q_t^{cost} & S_t^{\!T} \\ S_t& R_t \end{bmatrix} + \begin{bmatrix} A_t^{\!T} \mathbf{V}_{t+1} A_t & A_t^{\!T} \mathbf{V}_{t+1} B_t\\ B_t^{\!T} \mathbf{V}_{t+1} A_t & B_t^{\!T} \mathbf{V}_{t+1} B_t\end{bmatrix} \\ &= \begin{bmatrix} Q_t^{cost} + A_t^{\!T} \mathbf{V}_{t+1} A_t & S_t^{\!T} + A_t^{\!T} \mathbf{V}_{t+1} B_t\\ S_t + B_t^{\!T} \mathbf{V}_{t+1} A_t & R_t+ B_t^{\!T} \mathbf{V}_{t+1} B_t\end{bmatrix} = \begin{bmatrix} \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{x}} & \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}} \\ \hat{\mathbf{Q}}_{\mathbf{u}\mathbf{x}} & \hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}} \end{bmatrix}\\ \qquad \hat{\mathbf{q}}_t &= \begin{bmatrix} q_t \\ r_t\end{bmatrix} + \begin{bmatrix} A_t^{\!T} \mathbf{V}_{t+1} \mathbf{a}_t \\ B_t^{\!T}\mathbf{V}_{t+1} \mathbf{a}_t\end{bmatrix} + \begin{bmatrix} A_t^{\!T} \mathbf{v}_{t+1} \\ B_t^{\!T}\mathbf{v}_{t+1} \end{bmatrix} = \begin{bmatrix} \hat{\mathbf{q}}_{\mathbf{x}} \\ \hat{\mathbf{q}}_{\mathbf{u}} \end{bmatrix} \end{aligned}

Optimal Control at Each Step

Minimize ๐t\mathbf{Q}_t over ๐ฎt\mathbf{u}_t. Partition the Q-function matrix:

๐ฬ‚t=[๐ฬ‚๐ฑ๐ฑ๐ฬ‚๐ฑ๐ฎ๐ฬ‚๐ฎ๐ฑ๐ฬ‚๐ฎ๐ฎ],๐ชฬ‚t=[๐ชฬ‚๐ฑ๐ชฬ‚๐ฎ]\hat{\mathbf{Q}}_t = \begin{bmatrix} \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{x}} & \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}} \\ \hat{\mathbf{Q}}_{\mathbf{u}\mathbf{x}} & \hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}} \end{bmatrix}, \qquad \hat{\mathbf{q}}_t = \begin{bmatrix} \hat{\mathbf{q}}_{\mathbf{x}} \\ \hat{\mathbf{q}}_{\mathbf{u}} \end{bmatrix}

Setting โˆ‡๐ฎt๐t=0\nabla_{\mathbf{u}_t} \mathbf{Q}_t = 0:

๐ฎt*=๐Št๐ฑt+๐คt\boxed{\mathbf{u}_t^* = \mathbf{K}_t \mathbf{x}_t + \mathbf{k}_t}

๐Št=โˆ’๐ฬ‚๐ฎ๐ฎโˆ’1๐ฬ‚๐ฎ๐ฑ=โˆ’(Rt+BtT๐•t+1Bt)โˆ’1(St+BtT๐•t+1At)๐คt=โˆ’๐ฬ‚๐ฎ๐ฎโˆ’1๐ชฬ‚๐ฎ=(Rt+BtT๐•t+1Bt)โˆ’1(rt+BtT๐•t+1๐št+BtT๐ฏt+1) \begin{aligned} \mathbf{K}_t &= -\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}}^{-1}\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{x}} = -\left( R_t+ B_t^{\!T} \mathbf{V}_{t+1} B_t \right)^{-1} ( S_t + B_t^{\!T} \mathbf{V}_{t+1} A_t )\\ \mathbf{k}_t &= -\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}}^{-1}\hat{\mathbf{q}}_{\mathbf{u}} = \left( R_t+ B_t^{\!T} \mathbf{V}_{t+1} B_t \right)^{-1} (r_t + B_t^{\!T}\mathbf{V}_{t+1} \mathbf{a}_t+ B_t^{\!T}\mathbf{v}_{t+1}) \end{aligned}

The optimal policy is affine in state: ๐ฎt*=๐Št๐ฑt+๐คt\mathbf{u}_t^* = \mathbf{K}_t \mathbf{x}_t + \mathbf{k}_t. When ๐œt=0\mathbf{c}_t = 0 and ๐št=0\mathbf{a}_t = 0, then ๐คt=0\mathbf{k}_t = 0 (linear policy).

Value Function Update

Substituting ๐ฎt*\mathbf{u}_t^* back into ๐t\mathbf{Q}_t gives the value function:

Vt*(๐ฑt)=12๐ฑtT๐•t๐ฑt+๐ฑtT๐ฏt+constV_t^*(\mathbf{x}_t) = \frac{1}{2}\mathbf{x}_t^T \mathbf{V}_t \mathbf{x}_t + \mathbf{x}_t^T \mathbf{v}_t + \mathrm{const}

where

๐•t=๐ฬ‚๐ฑ๐ฑ+๐ฬ‚๐ฑ๐ฎ๐Št=๐ฬ‚๐ฑ๐ฑโˆ’๐ฬ‚๐ฑ๐ฎ๐ฬ‚๐ฎ๐ฎโˆ’1๐ฬ‚๐ฎ๐ฑ\mathbf{V}_t = \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{x}} + \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}}\mathbf{K}_t = \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{x}} - \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}} \hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}}^{-1}\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{x}}

๐ฏt=๐ชฬ‚๐ฑ+๐ฬ‚๐ฑ๐ฎ๐คt=๐ชฬ‚๐ฑโˆ’๐ฬ‚๐ฑ๐ฎ๐ฬ‚๐ฎ๐ฎโˆ’1๐ชฬ‚๐ฎ\mathbf{v}_t = \hat{\mathbf{q}}_{\mathbf{x}} + \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}}\mathbf{k}_t = \hat{\mathbf{q}}_{\mathbf{x}} - \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}}\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}}^{-1}\hat{\mathbf{q}}_{\mathbf{u}}

The cost-to-go at time tt is again quadratic (in ๐ฑt\mathbf{x}_t) โ€” so the structure repeats at tโˆ’1,tโˆ’2,โ€ฆ,0t-1, t-2, \dots, 0.

Algorithm

Finite Horizon LQR

Inputs: ๐€t\mathbf{A}_{t}, ๐št\mathbf{a}_t, ๐‚t\mathbf{C}_t, ๐œt\mathbf{c}_t for t=0โ€ฆTt=0\dots T

Backward pass โ€” initialize ๐•T=๐‚๐ฑ๐ฑ,T\mathbf{V}_T = \mathbf{C}_{\mathbf{x}\mathbf{x},T}, ๐ฏT=๐œ๐ฑ,T\mathbf{v}_T = \mathbf{c}_{\mathbf{x},T}; for t=Tโˆ’1,โ€ฆ,0t = T-1,\dots,0:

  1. ๐ฬ‚t=๐‚t+๐€tT๐•t+1๐€t\hat{\mathbf{Q}}_t = \mathbf{C}_t + \mathbf{A}_{t}^T \mathbf{V}_{t+1} \mathbf{A}_{t}
  2. ๐ชฬ‚t=๐œt+๐€tT(๐•t+1๐št+๐ฏt+1)\hat{\mathbf{q}}_t = \mathbf{c}_t + \mathbf{A}_{t}^T(\mathbf{V}_{t+1}\mathbf{a}_t + \mathbf{v}_{t+1})
  3. ๐Št=โˆ’๐ฬ‚๐ฎ๐ฎโˆ’1๐ฬ‚๐ฎ๐ฑ\mathbf{K}_t = -\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}}^{-1}\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{x}}, ๐คt=โˆ’๐ฬ‚๐ฎ๐ฎโˆ’1๐ชฬ‚๐ฎ\quad\mathbf{k}_t = -\hat{\mathbf{Q}}_{\mathbf{u}\mathbf{u}}^{-1}\hat{\mathbf{q}}_{\mathbf{u}}
  4. ๐•t=๐ฬ‚๐ฑ๐ฑ+๐ฬ‚๐ฑ๐ฎ๐Št\mathbf{V}_t = \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{x}} + \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}}\mathbf{K}_t, ๐ฏt=๐ชฬ‚๐ฑ+๐ฬ‚๐ฑ๐ฎ๐คt\quad\mathbf{v}_t = \hat{\mathbf{q}}_{\mathbf{x}} + \hat{\mathbf{Q}}_{\mathbf{x}\mathbf{u}}\mathbf{k}_t

Forward pass โ€” given ๐ฑ0\mathbf{x}_0, for t=0,โ€ฆ,Tโˆ’1t = 0,\dots,T-1:

  1. ๐ฎt*=๐Št๐ฑt+๐คt\mathbf{u}_t^* = \mathbf{K}_t\mathbf{x}_t + \mathbf{k}_t
  2. ๐ฑt+1=๐€t[๐ฑt;๐ฎt*]+๐št\mathbf{x}_{t+1} = \mathbf{A}_{t}[\mathbf{x}_t;\,\mathbf{u}_t^*] + \mathbf{a}_t

Interactive Demo

Double Integrator: LQR Control

Set initial conditions ๐ฑ0=[p0,v0]T\mathbf{x}_0 = [p_0,\, v_0]^T and observe the optimal trajectory.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 580

from shiny import App, render, ui
import numpy as np

app_ui = ui.page_fluid(
    ui.h4("Finite Horizon LQR โ€” Double Integrator"),
    ui.layout_sidebar(
        ui.sidebar(
            ui.h6("Initial State xโ‚€"),
            ui.input_slider("p0", "Position pโ‚€", min=-5.0, max=5.0, value=3.0, step=0.1),
            ui.input_slider("v0", "Velocity vโ‚€", min=-5.0, max=5.0, value=0.0, step=0.1),
            ui.hr(),
            ui.h6("Horizon & Weights"),
            ui.input_slider("T", "Horizon T", min=5, max=100, value=20, step=5),
            ui.input_slider("qp", "Position weight qp", min=0.1, max=10.0, value=1.0, step=0.1),
            ui.input_slider("qv", "Velocity weight qv", min=0.1, max=10.0, value=1.0, step=0.1),
            ui.input_slider("r",  "Control weight r",   min=0.01, max=5.0, value=1.0, step=0.05),
            ui.input_slider("qf", "Terminal weight qf", min=1.0, max=50.0, value=10.0, step=1.0),
            width=280,
        ),
        ui.output_plot("lqr_plot"),
    ),
)

def server(input, output, session):
    @render.plot
    def lqr_plot():
        import matplotlib.pyplot as plt
        import matplotlib.gridspec as gridspec

        dt = 0.1
        T  = input.T()
        qp, qv, r, qf = input.qp(), input.qv(), input.r(), input.qf()
        x0 = np.array([input.p0(), input.v0()])

        # Augmented dynamics: x_{t+1} = A_aug @ [x_t; u_t]
        A_aug = np.array([[1.0, dt, 0.5*dt**2],
                          [0.0, 1.0, dt]])

        # Cost matrix C_t (3x3 augmented), c_t = 0
        C = np.diag([qp, qv, r])

        # Terminal value function V_T, v_T
        V = np.diag([qf, qf])
        v = np.zeros(2)

        # --- Backward pass ---
        Ks, ks = [], []
        for _ in range(T):
            Q_hat = C + A_aug.T @ V @ A_aug          # 3x3
            q_hat = A_aug.T @ v                       # 3  (a_t=0, c_t=0)
            Qxx, Qxu = Q_hat[:2, :2], Q_hat[:2, 2:]
            Qux, Quu = Q_hat[2:, :2], Q_hat[2:, 2:]
            qx,  qu  = q_hat[:2],     q_hat[2:]
            Quu_inv  = np.linalg.inv(Quu)
            K = -Quu_inv @ Qux                        # 1x2
            k = -Quu_inv @ qu                         # 1
            V = Qxx + Qxu @ K
            v = qx  + Qxu @ k
            Ks.append(K); ks.append(k)
        Ks.reverse(); ks.reverse()

        # --- Forward pass ---
        xs = np.zeros((T + 1, 2))
        us = np.zeros(T)
        Js = np.zeros(T+1)
        xs[0] = x0
        J = 0.0
        for t in range(T):
            u = (Ks[t] @ xs[t] + ks[t]).flatten()
            us[t] = u[0]
            z = np.concatenate([xs[t], u])
            xs[t+1] = A_aug @ z
            J += 0.5*(z @ C @ z)
            Js[t] = J
        J += 0.5*(xs[T] @ np.diag([qf, qf]) @ xs[T])
        Js[T]=J

        # --- Plot ---
        time = np.arange(T + 1) * dt
        fig  = plt.figure(figsize=(8, 6))
        gs   = gridspec.GridSpec(3, 1, hspace=0.55)

        ax1 = fig.add_subplot(gs[0])
        ax1.plot(time, Js, 'b-o', markersize=3, lw=1.5)
        ax1.axhline(0, color='k', lw=0.8, ls='--')
        ax1.set_ylabel("Cost $J$")
        ax1.set_title(f"Optimal Trajectory  (J = {J:.3f})", fontsize=10)
        ax1.grid(True, alpha=0.3)

        ax2 = fig.add_subplot(gs[1])
        ax2.plot(xs[:,0], xs[:, 1], 'g-o', markersize=3, lw=1.5)
        ax2.axhline(0, color='k', lw=0.8, ls='--')
        ax2.set_xlabel("Position $p$")
        ax2.set_ylabel("Velocity $v$")
        ax2.grid(True, alpha=0.3)

        ax3 = fig.add_subplot(gs[2])
        ax3.step(time[:-1], us, 'r-', where='post', lw=1.5)
        ax3.axhline(0, color='k', lw=0.8, ls='--')
        ax3.set_ylabel("Control $u$")
        ax3.set_xlabel("Time (s)")
        ax3.grid(True, alpha=0.3)

        fig.tight_layout()
        return fig

app = App(app_ui, server)

Summary

Key Points

Symbol Meaning
๐€t\mathbf{A}_{t}, ๐št\mathbf{a}_t Augmented dynamics (maps [๐ฑt;๐ฎt][\mathbf{x}_t;\mathbf{u}_t] to ๐ฑt+1\mathbf{x}_{t+1})
๐‚t\mathbf{C}_t, ๐œt\mathbf{c}_t Quadratic and linear cost parameters
๐t(๐ฑt,๐ฎt)\mathbf{Q}_t(\mathbf{x}_t,\mathbf{u}_t) Q-function (action-value function)
๐ฬ‚t\hat{\mathbf{Q}}_t, ๐ชฬ‚t\hat{\mathbf{q}}_t Matrix/vector of Q-function
๐•t\mathbf{V}_t, ๐ฏt\mathbf{v}_t Matrix/vector of value function Vt*V_t^*
๐Št\mathbf{K}_t, ๐คt\mathbf{k}_t Feedback gain and feedforward term

Connection to RL: ๐t\mathbf{Q}_t is the Q-function; Vt*V_t^* is the value function.

Next step: iLQR โ€” linearize nonlinear dynamics at each iteration and re-run LQR.