ME/AER 647 Systems Optimization I



Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
Wolfe Conditions
Backtracking Line Search
Convergence properties

Step size selection

Recap

  • In Unconstrained Optimization we saw theory that suggests algorithms:
    • In proof of First Order Necessary Conditions, we saw that choosing pk=fkp_k = -\nabla f_k and αk\alpha_k implies f(xk+1)<f(xk)f(x_{k+1}) < f(x_k)
      • However, decreasing fkf_k does not imply we reach the optimum if the decrease is too small
    • Another strategy from the Sufficient Conditions is to solve f(x)=0\|\nabla f(x) = 0\|
  • Now we look at rules for αk\alpha_k and pkp_k in an algorithm xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k that will
    • decrease f(xk)f(x_k) enough so that
    • we solve f(x)=0\|\nabla f(x) = 0\|

Lipschitz Function

A function f:mf \colon \mathbb{R}^m \to \mathbb{R} is Lipschitz on a set SS if for any x,ySx,y \in S f(x)f(y)Lxy \|f(x) - f(y) \| \leq L \|x - y\|

For twice differentiable functions, f\nabla f being differentiable is equivalent to 2f(w)LI \nabla^2 f(w) \preceq L I for all wSw \in S

Steepest Descent

Descent Lemma

Let ff be a twice differentiable function whose gradient f\nabla f is Lipschitz continuous over some convex set SS. Then, for all x,ySx,y \in S, by Taylor’s Theorem, f(y)=f(x)+f(x)T(yx)+12(yx)T2f(z)(yx) f(y) = f(x) + \nabla f(x)^T (y-x) + \frac{1}{2} (y-x)^T \nabla^2 f(z) (y-x)

f(y)f(x)+f(x)T(yx)+L2yx2, \implies f(y) \leq f(x) + \nabla f(x)^T (y-x) + \frac{L}{2} \|y-x\|^2,

This Lemma says that we can define a local quadratic function that is an upper bound for ff near a point xx.

If x=xkx = x_k and y=xkαkfky = x_k - \alpha_k \nabla f_k, we get fk+1fkαk(2Lαk)2fk2 f_{k+1} \leq f_k -\frac{\alpha_{k} (2 - L \alpha_{k} )}{2} \|\nabla f_k\|^2

  • If 0<αk<2L0 < \alpha_k < \frac{2}{L}, then fk+1<fkf_{k+1} < f_{k}.
  • fkfk+1f_{k}-f_{k+1} approaches zero only when fk0\nabla f_k \to 0

Example

minxf(x)=(x3)2 min_{x \in \mathbb{R}} \quad f(x) = (x-3)^2

2f=2 \nabla^2 f = 2

To ensure decrease, we need α<2L=1 \alpha < \frac{2}{L} = 1

If α=1\alpha =1, then fk+1=fkf_{k+1} = f_{k}

If xk=x+yx_k = x^{\star} + y, then xk+1=xyx_{k+1} = x^{\star} -y

Optimal Steepest Descent

Descent Lemma: fk+1fkαk(2Lαk)2fk2 f_{k+1} \leq f_k -\frac{\alpha_{k} (2 - L \alpha_{k} )}{2} \|\nabla f_k\|^2

The smallest bound on fk+1f_{k+1} occurs when αk=1/L\alpha_k = 1/L: fk+1fk12Lfk2 f_{k+1} \leq f_k -\frac{1}{2 L} \|\nabla f_k\|^2

But what happens when we don’t know LL?

  • Start with small L̂\hat L, check whether fk+1fk12L̂fk2f_{k+1} \leq f_k -\frac{1}{2 \hat L} \|\nabla f_k\|^2
  • If not satisfied, double L̂\hat L, repeat

This is a preview of backtracking

Wolfe Conditions

  • Sufficient decrease condition (Armijo rule) on αk\alpha_k: f(xk+αkpk)f(xk)+c1αkf(xk)Tpkf(x_k + \alpha_k p_k) \leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^T p_k
  • Curvature condition on αk\alpha_k: f(xk+αkpk)Tpkc2f(xk)Tpk \nabla f(x_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f(x_k )^T p_k with 0<c1<c2<10<c_1 < c_2 <1.
  • The sufficient condition prevents large steps that don’t provide enough benefit
    • Through a function that linearly links decrease in ff with size of αk\alpha_k
  • The curvature condition prevents small steps that don’t provide enough benefit
    • Through a sufficient change in f\nabla f

Example Wolfe Conditions

minxf(x)=(x3)2 min_{x \in \mathbb{R}} f(x) = (x-3)^2

Backtracking

Replaces curvature condition with a reducing sequence of α\alpha.

  • Choose α>0,c(0,1),ρ(0,1)\bar \alpha > 0, c \in (0,1) ,\rho \in (0,1)
  • Initialize αα\alpha \gets \bar \alpha    (start with large step)
  • While f(xk+αpk)>f(xk)+cαf(xk)Tpkf(x_k + \alpha p_k) > f(x_k) + c \alpha \nabla f(x_k)^T p_k
    • α=ρα\alpha = \rho \alpha (reduce if too large)

Example values: α=1.0\bar \alpha=1.0, c=0.1c = 0.1, ρ=0.5\rho = 0.5

Backtracking Example

f(x)=(x3)2 f(x) = (x-3)^2

Convergence Analysis

Convergence

Informally

Will xkx?x_k \to x^{\star}?

Weaker

Will f(xk)f(x)?f(x_k) \to f(x^{\star})?

Still weaker

Will f(xk)0?\|\nabla f(x_k) \|\to 0?

Zoutendijk Condition

  • Involves the angle between steepest descent direction and chosen direction pkp_k: cosθk=fkTpkfkpk \cos \theta_k = \frac{- \nabla f_k^T p_k}{\| \nabla f_k \| \|p_k\|}
  • If pk=fkp_k = - \nabla f_k, then cosθk=1\cos \theta_k = 1 and θk=0\theta_k =0
  • If θk\theta_k is ±89\pm 89^{\circ}, then cosθk>0\cos \theta_k > 0 but small
  • If |θk|90|\theta_k| \geq 90^{\circ}, then cosθk0\cos \theta_k \leq 0, and pkp_k is not a strict descent direction

Theorem

Consider any iteration of the form xk+1=xk+αkpkx_{k+1} =x_k + \alpha_k p_k, where pkp_k is a descent direction and αk\alpha_k satisfies the Wolfe conditions. Suppose that ff is bounded below in n\mathbb{R}^n and that ff is continuously differentiable in an open set 𝒩\mathcal N containing the level set ={x:f(x)f(x0)}\mathcal L = \{x \colon f(x) \leq f(x_0)\}. Assume also that the gradient f\nabla f is Lipschitz continuous on 𝒩\mathcal N, then k=0cos2θkfk2< \sum_{k=0}^{\infty} \cos^2 \theta_k \|\nabla f_k \|^2 < \infty

Convergence from Zoutendijk Condition

If we choose pkp_k such that cosθkδ>0\cos \theta_k \geq \delta >0 for all k0k \geq 0, then the Zoutendijk condition implies that limkfk=0 \lim_{k \to \infty} \|\nabla f_k \| = 0

  • If BkB_k is positive definite with uniformly bounded condition number (BkBk1M<\| B_k \| \|B_k^{-1}\| \leq M < \infty), then for pk=Bk1fkp_k =-B_k^{-1}\nabla f_k, cosθk1M \cos \theta_k \geq \frac{1}{M}
    • What algorithm does this conclusion apply to when Bk=IB_k = I or Bk=2f0B_k = \nabla^2 f \succ 0?

Similar results apply to αk\alpha_k chosen using backtracking line search.

Convergence Rate

Informally

How fast does xkx?x_k \to x^{\star}?

Less informally

The convergence rate measures how fast the error in the solution decreases. The error can be in terms of the optimum xkx\|x_k - x^\star\|, the optimal value |f(xk)f(x)||f(x_k) - f(x^\star)|, or f(xk)\|\nabla f(x_k)\|.

Example

If xk+1xρxkx\|x_{k+1} - x^\star\| \leq \rho \|x_{k} - x^\star\| for constant ρ<1\rho <1, then the convergence is linear. Moreover, the number of iterations k=𝒪(log1ϵ)k = \mathcal O (\log \frac{1}{\epsilon}).

Rosenbrock Function

logϵ=mk+ck=1mlogϵcmk=1mlog1ϵ+cm \log \epsilon = -m\ k + c \iff k = - \frac{1}{m} \log \epsilon -\frac{c}{m} \iff k = \frac{1}{m} \log \frac{1}{\epsilon} \quad + \frac{c}{m}

Condition Number

f(x,y)=x2+κy2 f(x,y) = x^2 + \kappa y^2

Newton’s Method

Motivation

  • We know that whatever point we are looking must be a stationary one, it must satisfy f(x)=0\nabla f(x) = 0.

  • Newton’s method applies the Newton root-finding algorithm to f\nabla f:

    • To solve g(x)=0g(x) = 0, iteratively solve 1 g(xk)+(xk+1xk)Tg(x)=0xk+1=xk(g(x))g(x)g(x_k) + (x_{k+1} - x_k)^T \nabla g(x) = 0 \implies x_{k+1} = x_k - (\nabla g(x) )^{\sharp} g(x)
  • Apply root finding to f(x)=0\nabla f(x) = 0: xk+1=xk(2f(x))1f(x)x_{k+1} = x_k - (\nabla^2 f(x) )^{-1} \nabla f(x)

Example

  • We want to find the minimizer of

f(x)=12x2sinx,x0=12. f(x) = \frac{1}{2}x^2 - \sin{x}, \quad x_0 = \frac{1}{2}.

  • We want an accuracy of ε=105\varepsilon = 10^{-5}, i.e., stop when

|xk+1xk|<ε. |x_{k+1} - x_k | < \varepsilon.

  • We compute

f(x)=xcosx,f(x)=1+sinx. f'(x) = x - \cos{x}, \quad f''(x) = 1 + \sin{x}.



x1=1212cos121+sin12=0.7552,x2=x1f(x1)f(x1)=x10.027101.685=0.7391,x3=x2f(x2)f(x2)=x29.461×1051.673=0.7390,x4=x3f(x3)f(x3)=x31.17×1091.673=0.7390. \begin{align} x_1 &= \frac{1}{2} - \frac{\frac{1}{2} - \cos{\frac{1}{2}}}{1 + \sin{\frac{1}{2}}} = 0.7552, \\ x_2 &= x_1 - \frac{f'(x_1)}{f''(x_1)} = x_1 - \frac{0.02710}{1.685} = 0.7391, \\ x_3 &= x_2 - \frac{f'(x_2)}{f''(x_2)} = x_2 - \frac{9.461 \times 10^{-5}}{1.673} = 0.7390, \\ x_4 &= x_3 - \frac{f'(x_3)}{f''(x_3)} = x_3 - \frac{1.17 \times 10^{-9}}{1.673} = 0.7390. \end{align}

Alternate Motivation

When 2fk0\nabla^2 f_k \succ 0, we are minimizing the quadratic approximation f(x)fk+fkT(xxk)+12(xxk)T2fk(xxk),f(x) \approx f_k + \nabla f_k^T (x-x_k) + \frac{1}{2} (x-x_k)^T \nabla^2 f_k (x-x_k), which need not have minimum at xx^\star

Example: Quadratic Function

Consider f(x)=xTQxbTx for Q=QT0,f(x) = x^T Q x - b^T x \quad \text{ for }Q=Q^T \succ 0, where x=Q1bx^\star = Q^{-1} b Ideal update: xk+1=xk(xkx)=xk1(xkQ1b)Newton update:xk+1=xk+1(1)Q1(Qxkb)=xk1(xkQ1b)Steepest descent:xk+1=xk+αk(1)(Qxkb)\begin{align} \text{ Ideal update: }&& x_{k+1} &= x_k - (x_k - x^\star) = x_k - 1 \cdot (x_k - Q^{-1} b )\\ \text{Newton update:}&& x_{k+1} &= x_k + 1 \cdot (-1) \cdot Q^{-1} (Q x_k - b) = x_k -1 \cdot (x_k - Q^{-1} b)\\ \text{Steepest descent:}&& x_{k+1} &= x_k + \alpha_k \cdot (-1) \cdot (Q x_k - b) \end{align}

What happens when Q=QT0Q = Q^T \prec 0?

Example: Rosenbrock Function

Convergence Rate

Theorem 3.5

Suppose that ff is twice differentiable and that the Hessian 2f\nabla^2 f is Lipschitz continuous in a neighborhood of a solution xx^\star at which the sufficient conditions are satisfied. Consider the iteration xk+1=xk+pkx_{k+1} = x_k + p_k where pk=2fk1fkp_k = - \nabla^2 f_k^{-1} \nabla f_k. Then

  • If x0x_0 is sufficiently close to xx^\star, then xkxx_k \to x^\star
  • The rate of convergence {xk}\{x_k\} is quadratic
  • The sequence of gradient norms {fk}\{\|\nabla f_k\|\} converges quadratically to zero

Proof Outline

Here, we directly look at xk+1x\| x_{k + 1} - x^{\star}\|, using f(x)=0\nabla f\left( x^{\star} \right) = 0

xk+1x=xk2fk1fkx=2fk1(2fk(xkx)(fkf(x)))\begin{aligned} x_{k + 1} - x^{\star} & = x_{k} - \nabla^{2}f_{k}^{- 1}\nabla f_{k} - x^{\star} \\ & = \nabla^{2}f_{k}^{- 1}\left( \nabla^{2}f_{k}\left( x_{k} - x^{\star} \right) - \left( \nabla f_{k} - \nabla f\left( x^{\star} \right) \right) \right) \end{aligned}

Under the assumption that 2f\nabla^{2}f is LL-Lipschitz, using Taylor’s theorem, we have 2fk(xkx)(fkf(x))xkx201Ltdt\|\nabla^{2}f_{k}\left( x_{k} - x^{\star} \right) - \left( \nabla f_{k} - \nabla f\left( x^{\star} \right) \right)\| \leq \| x_{k} - x^{\star}\|^{2}\int_{0}^{1}Ltdt

For xkxr\| x_{k} - x^{\star}\| \leq r, we can rewrite this expression using

xk+1x2fk1L2xkx2\| x_{k + 1} - x^{\star}\| \leq \|\nabla^{2}f_{k}^{- 1}\|\frac{L}{2}\| x_{k} - x^{\star}\|^{2}

For xkxr\| x_{k} - x^{\star}\| \leq r, we can rewrite this expression using L=L2f(1)(x)\overset{\sim}{L} = L\|\nabla^{2}f^{( - 1)}\left( x^{\star} \right)\| as

xk+1xLxkx2\| x_{k + 1} - x^{\star}\| \leq \overset{\sim}{L}\| x_{k} - x^{\star}\|^{2}

If x0x<min(r,1L̃)\|x_0 - x^\star\| < \min\left(r, \frac{1}{ \tilde{L}} \right), then we get quadratic convergence.

Newton’s Revenge

Modifications

Although Newton’s method is very attractive in terms of its convergence properties the solution, it requires modification before it can be used at points that are far from the solution.

The following modifications are typically used:

  • Step-size reduction (damping)
  • Modifying Hessian to be positive definite
  • Approximation of Hessian

Damping

A search parameter α\alpha is introduced 𝐱k+1=𝐱kαk2f(𝐱k)1f(𝐱k), \bm{x}_{k+1} = \bm{x}_k - \alpha_k \nabla^2 f(\bm{x}_k)^{-1}\nabla f(\bm{x}_k), where αk\alpha_k is selected to minimize ff.

A popular selection method is backtracking line search.

Positive Definiteness and Scaling

General class of algorithms is given by 𝐱k+1=𝐱k+αpk=𝐱kαBkfk,(1) \bm{x}_{k+1} = \bm{x}_k + \alpha p_k = \bm{x}_k - \alpha B_k \nabla f_k, \qquad(1)

  • SD: Bk=𝐈B_k = \bm{I}, Newton: Bk=2f(𝐱k)1B_k = \nabla^2 f(\bm{x}_k)^{-1}.

For small α\alpha, it can be shown that f(𝐱k+1)=f(𝐱k)αfkTBkfk+O(α2). f(\bm{x}_{k+1}) = f(\bm{x}_k) - \alpha \nabla f_k^T B_k \nabla f_k + O(\alpha^2).

  • As α0\alpha \rightarrow 0, the second term on the rhs dominates the third.
  • To guarantee a decrease in ff, we must have fkTBkfk>0\nabla f_k^T B_k \nabla f_k > 0.
    • Simplest way to ensure this is to require Bk𝟎B_k \succ \bm{0}.

Positive Definiteness and Scaling

  • In practice, Newton’s method must be modified to accommodate the possible non-positive definiteness of 2f\nabla^2 f at regions far from the solution.

  • Common approach: Bk=[μk𝐈+2f(𝐱k)]1B_k = [\mu_k\bm{I} + \nabla^2 f(\bm{x}_k)]^{-1} for some μk>0\mu_k > 0.

  • This can be regarded as a compromise between SD (μk\mu_k very large) and Newton’s method (μk=0\mu_k = 0).

Levenberg-Marquardt performs Cholesky factorization for a given value of μk\mu_k as follows μk𝐈+2f(𝐱k)=𝐆T𝐆.\mu_k \bm{I} + \nabla^2 f(\bm{x}_k) = \bm{G}^T\bm{G}.

  • This factorization checks implicitly for positive definiteness.

  • If the factorization fails (matrix not PD) μk\mu_k is increased.

  • Step direction is found by solving 𝐆T𝐆pk=fk\bm{G}^T\bm{G} p_k = -\nabla f_k.

Newton’s Revenge Avenged

Quasi-Newton Method

Computes Bk2fkB_k \approx \nabla^2 f_k as a solution to the secant equation: Bk+1(xk+1xk)sk=fk+1fkykB_{k+1} \underbrace{(x_{k+1}-x_{k})}_{s_k} = \underbrace{\nabla f_{k+1}-\nabla f_k}_{y_k}

Then solve pk=Bk1skp_{k} = - B_k^{-1} s_{k}

One dimension

2fx2xf(xk+1)xf(xk)xk+1xk\frac{\partial^2 f}{\partial x^2} \approx \frac{\frac{\partial}{\partial x} f(x_{k+1}) - \frac{\partial}{\partial x} f(x_k)}{x_{k+1} -x_{k}}

Quasi-Newton Method

Algorithms for updating BkB_k:

  1. Broyden, Fletcher, Goldfarb, and Shanno (BFGS): Bk+1=BkBkskskTBkskTBksk+ykykTykTskB_{k+1} = B_{k} - \frac{B_{k}s_{k}s_{k}^TB_{k}}{s_{k}^TB_{k}s_k} + \frac{y_k y_k^T}{y_k^T s_k}

  2. Symmetric Rank-One (SR1): Bk+1=Bk+(ykBksk)(ykBksk)T(ykBksk)TskB_{k+1} = B_{k} + \frac{(y_{k}-B_{k} s_{k})(y_{k}-B_{k} s_{k})^T}{(y_{k}-B_{k} s_{k})^Ts_{k}}

Algorithms for updating Hk=Bk1H_k = B_k^{-1}:

Hk+1=(IρkskykT)Hk(IρkykskT)+ρkskskT,ρk=1ykTsk H_{k+1} = ( I - \rho_k s_k y_k^T) H_k ( I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T, \quad \rho_k = \frac{1}{y_k^T s_k} pk=Hkfkp_k = - H_k \nabla f_k

Quasi-Newton Method: Details

  • Initialization: Start with Hk=IH_k = I

  • Step size: αk=1\alpha_k = 1 for all kk may interfere with HkH_k approximating the Hessian

    • Backtracking isn’t enough to choose good αk\alpha_k
    • Choose αk\alpha_k to satisfy strong Wolfe conditions
    • Details are very technical (numerical computing challenges)

Example: Rosenbrock Function

Summary

Summary

You can now minimize f(x)f(x) (when unconstrained) iteratively. Find LM xx^\star by

  • Choosing direction pkp_k
    • Steepest descent with constant step size is easy and reliable but slow
    • Newton method is fast but expensive and unreliable
      • needs modifications for reliable convergence
    • Quasi-newton methods are a balance, but tricky
  • Choosing step size αk\alpha_k
    • Back-tracking for SD and modified NM improves performance
    • Strong Wolfe conditions for Quasi-newton methods
  • Not tackled: constraints, expensive ff/f\nabla f, discrete xx, non-smooth ff