Optimization Theory and Practice


Unconstrained Optimization


Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
Solutions
Necessary and Sufficient Conditions
Overview of Algorithms

Example: Curve Fitting

Suppose we have data in the form of pairs {ti,yi}i{1,,m}.\{t_{i},y_{i}\}_ {i \in \{1,\dots,m\}}.

We believe that the model is of the form ϕ(t;θ)=θ1+θ2e(θ3t)2θ4+θ5cos(θ6t).\phi(t; \theta) = \theta_{1} + \theta_{2} e^{\frac{(\theta_{3}-t )^2}{\theta_{4}}} + \theta_{5} \cos(\theta_{6} t).

A standard approach is to find the values of θj\theta_{j}, j{1,,6}j \in \{1,\dots,6\} by solving minθ6j=16ri2(θ)\min_{\theta \in \mathbb R^6} \sum_{j=1}^{6} r_{i}^{2}(\theta) where ri(θ)=yiϕ(ti;θ)r_{i}(\theta) = y_{i} - \phi(t_{i}; \theta)

This is a nonlinear least-square problem.

Solutions

Global minimizer

A point x*x^{*} is a global minimizer if f(x*)f(x)f(x^*) \leq f(x) for all xnx \in \mathbb R^n.

Local minimizer

A point x*x^{*} is a local minimizer if there is a neighborhood 𝒩\mathcal N of x*x^* where f(x*)f(x)f(x^*) \leq f(x) for all x𝒩x \in \mathcal N, where a neighborhood of yy is an open set containing yy.

Strict local minimizer

A point x*x^{*} is a strict local minimizer if there is a neighborhood 𝒩\mathcal N of x*x^* where f(x*)<f(x)f(x^*) < f(x) for all x𝒩x \in \mathcal N with xx*x \neq x^*.

Taylor’s theorem

Taylor’s Theorem

Suppose that f:nf \colon \mathbb R^n \to \mathbb R is continuously differentiable and that pnp \in \mathbb{R}^n. Then we have that f(x+p)=f(x)+f(x+tp)pf(x+p) = f(x) + \nabla f(x+t p) p for some t(0,1)t \in (0,1). Moreover, if ff is twice continuously differentiable, we have that f(x+p)=f(x)+012f(x+tp)pdt\nabla f(x+p) = \nabla f(x) + \int_{0}^{1} \nabla^2 f(x+t p) p dt and that f(x+p)=f(x)+f(x)p+12pT2f(x+tp)pf(x+p) = f(x) + \nabla f(x) p + \frac{1}{2} p^T \nabla^2 f(x+t p) p for some t(0,1)t \in (0,1).

First-order Necessary Condition

FONC

If xx^{\star} is a local minimizer and ff is continuously differentiable in an open neighborhood of xx^{\star}, then f(x)=0\nabla f (x^{\star}) = 0.

Notice that this is not the same as f(x)=0x\nabla f (x^{\star}) = 0 \implies x^{\star} is a local minimizer

Stationary point

A point xx^{\star} is a stationary point if it satisfies f(x)=0\nabla f(x^{\star})=0

Second-order Necessary Condition

SONC

If xx^{\star} is a local minimizer of ff and 2f\nabla^2 f exists and is continuous in an open neighborhood of xx^{\star}, then f(x)=0\nabla f (x^{\star}) = 0 and 2f(x)\nabla^2 f (x^{\star}) is positive semidefinite.

Positive semidefinite matrix

A matrix MM is positive semidefinite, denoted1 M0M \succeq 0, if xTMx0xn.x ^T M x \geq 0\quad \forall x \in \mathbb{R}^n.

If M=HTHM = H^T H for some real matrix HH, then MM is positive semidefinite.

Second-order Sufficient Condition

SOSC

Suppose that 2f(x)\nabla^2 f (x^{\star}) is continuous in a neighborhood of xx^{\star} and that f(x)=0\nabla f (x^{\star}) = 0 and 2f(x)\nabla^2 f (x^{\star}) is positive definite. Then xx^{\star} is a strict local minimizer of ff.

Positive definite matrix

A matrix MM is positive definite, denoted1 M0M \succ 0, if xTMx>0xn,x0.x ^T M x > 0\quad \forall x \in \mathbb{R}^n,\ x \neq 0.

First-order Sufficient Condition

There is no first-order sufficient condition for ff when it is only differentiable

However, when ff is also convex, there is.

Convex functions

A function f:𝒟f \colon \mathcal D \to \mathbb R is convex if 𝒟\mathcal D is a convex set and f f(λx+(1λ)y)λf(x)+(1λ)f(y)f( \lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y) for all x,y𝒟x,y \in \mathcal D and λ[0,1]\lambda \in [0,1].

FOSC for convex ff

When ff is convex, any local minimizer xx^{\star} is a global minimizer of ff . If in addition ff is differentiable, then any stationary point xx^{\star} is a global minimizer of ff.

Summary

Condition What we know What it tells us Notes
1st Ord Necessary xx^\star is LM1, ff differentiable f(x)=0\nabla f(x^{\star})=0 Proof justifies steepest descent
2nd Ord Necessary xx^\star is LM, ff and 2f\nabla^2 f exist, are continuous on 𝒩\mathcal N f(x)=0\nabla f(x^{\star})=0 and 2f(x)\nabla^2 f(x^{\star}) is PSD
2nd Ord Sufficient f(x)=0\nabla f(x^{\star})=0 and 2f(x)\nabla^2 f(x^{\star}) is PD xx^\star is strict LM Leads to algorithms for finding LMs
1st Ord Sufficient ff is convex, xx^\star is LM xx^\star is GM2
1st Ord Sufficient ff is convex, differentiable, f(x)=0\nabla f(x^{\star})=0 xx^\star is GM Leads to effective algorithms for finding global minima

Overview of Algorithms

  • Recall that the algorithms are iterative
  • We need a starting point 𝐱0\mathbf x_0
  • … and a method to produce iterates 𝐱1\mathbf{x}_1, 𝐱2\mathbf{x}_2,\dots
  • Generally two methods:
    • Line Search
    • Trust Region

Note

The words method and algorithm are used interchangeably

Line Search Methods

  • The algorithm chooses a direction pknp_k \in \mathbb{R}^n
  • Then, minimize f(x)f(x) on the line defined by 𝐱k+αpk\mathbf{x}_k + \alpha p_k. That is, solve minα>0f(𝐱k+αpk)(1)\min_{\alpha>0}\quad f(\mathbf{x}_k + \alpha p_k) \qquad(1)
    • In practice choose an α\alpha that makes f(𝐱k+1)<f(𝐱k)f(\mathbf{x}_{k+1}) < f(\mathbf{x}_{k}), not solve Equation 1.
    • Practical method uses back-tracking line search to pick α\alpha given pkp_k

Line Search Methods

  • Two common choices for directions.
    • Negative gradient (steepest descent) pk=f(𝐱k)p_k = - \nabla f(\mathbf{x}_{k})
    • Newton direction: pk=(2f(𝐱k))1f(𝐱k)p_k = - (\nabla^2 f(\mathbf{x}_{k}))^{-1}\nabla f(\mathbf{x}_{k})

Example: Line Search Algorithms

minimizef(x)=(0.5x12)2+(x23)2 \begin{align} \operatorname{minimize} & f(x) = (0.5 x_1 - 2)^2 + (x_2 - 3)^2 \end{align}

Important

Newton’s method is not always better

Plotting Code

import numpy as np
import matplotlib.pyplot as plt

# Define the objective function and its gradient and Hessian
def f(x, y):
    return #<answer>

def grad_f(x, y):
    """Gradient of f"""
    return np.array()

def hess_f(x, y):
    """Hessian of f"""
    return np.array()


# Steepest Descent
def steepest_descent(start, alpha=0.1,tol=1e-6, max_iter=50):
    # your code
    return x, iterates


# Plotting
x = np.linspace(-1, 4, 100)
y = np.linspace(-1, 5, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

plt.figure(figsize=(8, 6))
# Contour plot of the objective function
contour = plt.contour(X, Y, Z, levels=30, cmap="viridis")
plt.clabel(contour, inline=True, fontsize=8)
plt.colorbar(contour, label="Objective Function Value")

plt.plot(4, 3, 'x', color="blue", markersize=10, label="True Optimum (4, 3)")

# Run Steepest Descent
start_point = [-1.0, 1.0]
optimum, iterates = steepest_descent(start_point)
# Extract iterate points for plotting
iterates = np.array(iterates)
x_iterates, y_iterates = iterates[:, 0], iterates[:, 1]
plt.plot(x_iterates, y_iterates, 'o-', color="blue", label="SD Iterates alpha=0.1")

# Annotate start and end points
plt.annotate("Start", (x_iterates[0], y_iterates[0]), textcoords="offset points", xytext=(-10, 10), ha="center", color="red")
plt.annotate("End", (x_iterates[-1], y_iterates[-1]), textcoords="offset points", xytext=(-10, -15), ha="center", color="red")

# Labels and legend
plt.xlabel("x_1")
plt.ylabel("x_2")
plt.title("Steepest Descent vs Newton's Method for Optimization")
plt.legend()
plt.grid(True)
plt.show()

Trust Region Methods

  • Approximate ff near 𝐱k\mathbf{x}_k using a model function mkm_k
  • Minimize mkm_k near 𝐱k\mathbf{x}_k by solving minpmk(𝐱k+p)s.t.𝐱k+pTrust region \begin{align} \min_{p} & m_k\left(\mathbf{x}_k+p\right)\\ \text{s.t.} & \mathbf{x}_k+p \in \text{Trust region} \end{align}
  • Common choice: local quadratic Taylor series, with a spherical trust region
    (trust-region Newton’s method)

Line Search and Trust Region Methods

In some cases, the two methods overlap 1 minpmk(𝐱k+p)=fK+pTfks.t.p2αfk2 \begin{align} \min_{p} & m_k\left(\mathbf{x}_k+p\right) = f_K + p^T \nabla f_k\\ \text{s.t.} & \| p\|_2 \leq \alpha \| \nabla f_k\|_2 \end{align} yields p=αfkp = - \alpha \nabla f_{k}

Exercise: Steepest Descent

minimizef(x)=(1x1)2+10(x2x12)2 \begin{align} \operatorname{minimize} & f(x) = (1 - x_1)^2 + 10 (x_2 - x_1^2)^2 %a = 1, b = 10 \end{align}

The optimum is at (1,1)(1,1). Test the behavior of steepest descent algorithms xk+1=xkαfkx_{k+1} = x_{k} - \alpha \nabla f_k with various constant values of α\alpha, starting from (0,1)(0,1).

Solution

f=[2(1x1)40(x2x12)x120(x2x22)] \nabla f = \begin{bmatrix} -2(1-x_1) - 40 (x_2 - x_1^2) x_1\\ 20 (x_2-x_2^2)\end{bmatrix}