Optimization Theory and Practice


Nonlinear Least Squares


Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
Nonlinear Least Squares
Necessary and Sufficient Conditions
Overview of Algorithms

Forward Problems

We are often trying to work with a model that turns input xnx \in \mathbb{R}^{n} into output ymy\in \mathbb{R}^{m}:

yϕθ(x), y \gets \phi_{\theta}(x),

where θp\theta \in \mathbb{R}^p are the parameters of the model ϕθ\phi_{\theta}.

Example: Given joint angles qq, where is the robot arm’s ‘hand’? (Forward Kinematics)

Inverse Problems

The inverse problem is to find xx given yy and parameters θ\theta

x solution of y=ϕθ(x), x \gets \text{ solution of } y = \phi_{\theta}(x),

where θp\theta \in \mathbb{R}^p are the parameters of the model ϕθ\phi_{\theta}.

Example: Given where we want the robot’s arm to be, what values should we set the joint angles qq to? (Inverse Kinematics)

Nonlinear Least Squares

Residual (or error): r(x)=e(x)=yϕθ(x)r(x) = e(x) = y - \phi_{\theta}(x) Loss: Squared Error:

L(x)=eT(x)e(x)\begin{aligned} L(x) &= e^T(x) e(x) \end{aligned}
xL(x)=2eT(x)J(x)x2L(x)=2JT(x)J(x)2ei(x)Hi(x)\begin{aligned} \implies \nabla_x L(x) &= -2 e^T(x) J(x)\\ \implies \nabla_x^2 L(x) &= -2 J^T(x) J(x)- 2 \sum e_i(x) H_i(x) \end{aligned}

Where J(x)J(x) is the Jacobian xϕθ(x)\nabla_x \phi_{\theta}(x)
Hi(x)H_i(x) is the Hessian x2ϕθ(x)\nabla_x^2 \phi_{\theta}(x)

Nonlinear Least Squares

Residual (or error): r(x)=e(x)=yϕθ(x)r(x) = e(x) = y - \phi_{\theta}(x) Loss: Squared Error:

L(x)=eT(x)e(x)\begin{aligned} L(x) &= e^T(x) e(x) \end{aligned}
dL(x)dt=eT(x)e(x)xẋ(t)=eT(x)J(x)ẋ(t)\begin{aligned} \implies \frac{ \mathrm{ d} L(x)}{\mathrm{ d} t} &= e^T(x) \frac{ \partial e(x)}{\partial x} \dot x(t) = -e^T(x) J(x) \dot x(t) \end{aligned}

Choosing ẋ(t)\dot x(t) produces some dL(x(t))dt\frac{ \mathrm{ d} L(x(t))}{\mathrm{ d} t}. What values should we want for these?

Nonlinear Least Squares

We want dL(x(t))dt<0\frac{ \mathrm{ d} L(x(t))}{\mathrm{ d} t} < 0 so as to minimize L(x(t))L(x(t)) as tt \to \infty

We can achieve that if dL(x)dt=eT(x)Be(x)\frac{ \mathrm{ d} L(x)}{\mathrm{ d} t} = - e^T(x) B e(x)

\implies solve J(x)ẋ=Be(x)J(x) \dot x = B e(x) for ẋ\dot x where B=BT0B = B^T \succ 0

Options:

  1. Steepest descent: ẋ(t)=JT(x)e(x)\dot x(t) = J^T(x) e(x)
  2. Gauss-Newton direction: ẋ(t)=J+e(x)\dot x(t) = J^{+} e(x)1
  3. Levenberg–Marquardt direction: ẋ(t)=Jdamped+e(x)\dot x(t) = J^{+}_{\text{damped}} e(x)

Dimensions

xnx \in \mathbb{R}^n and ϕθ(x)m\phi_{\theta}(x) \in \mathbb R^{m},

J(x)(m×n)\implies J(x) \in \mathbb R^(m \times n), eTJ(x)(1×n)\implies e^T J(x) \in \mathbb R^(1 \times n)

xL(x)(1×n)\nabla_x L(x) \in \mathbb{R}^(1 \times n), x2L(x)(n×n)\nabla_x^{2} L(x) \in \mathbb{R}^(n \times n)

Assume J(x)=JJ(x) = J has rank min(m,n)\min(m,n)

If m<nm < n then J+=JT(JJT)m×m1(n×m)J^{+} = J^T ( J J^T )_{m \times m}^{-1} \in \mathbb{R}^(n \times m)

If m>nm > n then J+=(JTJ)n×n1JT(n×m)J^{+} = ( J^T J )_{n \times n}^{-1} J^T \in \mathbb{R}^(n \times m)

Compute J+J^{+} using a singular value decomposition of J(x)J(x)

Aside: Supervised Machine Learning

Suppose we have data in the form of pairs {xi,yi}i{1,,m}.\{x_{i},y_{i}\}_ {i \in \{1,\dots,m\}}.

We believe that the model is of the form ϕθ(x;θ)=θ1+θ2e(θ3x)2θ4+θ5cos(θ6x).\phi_{\theta}(x; \theta) = \theta_{1} + \theta_{2} e^{\frac{(\theta_{3}-x )^2}{\theta_{4}}} + \theta_{5} \cos(\theta_{6} x).

A standard approach is to find the values of θj\theta_{j}, j{1,,6}j \in \{1,\dots,6\} by solving minθ6i=1mri2(θ)\min_{\theta \in \mathbb R^6} \sum_{i=1}^{m} r_{i}^{2}(\theta) where ri(θ)=yiϕθ(xi)r_{i}(\theta) = y_{i} - \phi_{\theta}(x_i)

This is a nonlinear least-square problem.

However, xix_i are parameters and we are searching for θ\theta; opposite of previous slides

Nonlinear Least Squares

Residual (or error): ri(θ)=ei(θ)=yiϕ(xi;θ)r_{i}(\theta) = e_{i}(\theta) = y_{i} - \phi(x_{i};\theta) Loss: Mean Square Error:

L(θ)=1mi=1mri2(θ)=1mi=1mriT(θ)ri(θ)=eT(θ)e(θ)\begin{aligned} L(\theta) &= \frac{1}{m} \sum_{i=1}^{m} r_{i}^{2}(\theta) = \frac{1}{m} \sum_{i=1}^{m} r_{i}^T(\theta) r_{i}(\theta)\\ &= e^T(\theta) e(\theta)\\ \end{aligned} θL(θ)=2eT(θ)J(θ)θ2L(θ)=2JT(θ)J(θ)2ei(θ)Hi(θ)\begin{aligned} \implies \nabla_\theta L(\theta) &= -2 e^T(\theta) J(\theta)\\ \implies \nabla_\theta^2 L(\theta) &= -2 J^T(\theta) J(\theta)- 2 \sum e_i(\theta) H_i(\theta) \end{aligned}

Where J(θ)J(\theta) is the Jacobian θϕθ(x)\nabla_\theta \phi_{\theta}(x)
Hi(θ)H_i(\theta) is the Hessian θ2ϕθ(xi;θ)\nabla_\theta^2 \phi_{\theta}(x_i;\theta)

Dimensions

θp\theta \in \mathbb{R}^p and yi,ϕθ(xi;θ),eiy_i, \phi_{\theta}(x_i;\theta),e_i \in \mathbb R,

em\implies e \in \mathbb R^m J(θ)(m×p)\implies J(\theta) \in \mathbb R^(m \times p), eTJ(θ)(1×p)\implies e^T J(\theta) \in \mathbb R^(1 \times p)

θL(θ)(1×p)\nabla_\theta L(\theta) \in \mathbb{R}^(1 \times p), θ2L(θ)(p×p)\nabla_\theta^{2} L(\theta) \in \mathbb{R}^(p \times p)

Assume J(θ)=JJ(\theta) = J has rank min(m,p)\min(m,p)

If m<pm < p then J+=JT(JJT)1(p×m)J^{+} = J^T ( J J^T )^{-1} \in \mathbb{R}^(p \times m)

If m>pm > p then J+=(JTJ)1JT(p×m)J^{+} = ( J^T J )^{-1} J^T \in \mathbb{R}^(p \times m)

Compute J+J^{+} using a singular value decomposition of J(θ)J(\theta)

Solutions

Global minimizer

A point x*x^{*} is a global minimizer if f(x*)f(x)f(x^*) \leq f(x)1 for all xnx \in \mathbb R^n.

Local minimizer

A point x*x^{*} is a local minimizer if there is a neighborhood 𝒩\mathcal N of x*x^* where f(x*)f(x)f(x^*) \leq f(x) for all x𝒩x \in \mathcal N, where a neighborhood of yy is an open set containing yy.

Strict local minimizer

A point x*x^{*} is a strict local minimizer if there is a neighborhood 𝒩\mathcal N of x*x^* where f(x*)<f(x)f(x^*) < f(x) for all x𝒩x \in \mathcal N with xx*x \neq x^*.

Summary

Condition What we know What it tells us Notes
1st Ord Necessary xx^\star is LM1, ff differentiable f(x)=0\nabla f(x^{\star})=0 Proof justifies steepest descent
2nd Ord Necessary xx^\star is LM, ff and 2f\nabla^2 f exist, are continuous on 𝒩\mathcal N f(x)=0\nabla f(x^{\star})=0 and 2f(x)\nabla^2 f(x^{\star}) is PSD
2nd Ord Sufficient f(x)=0\nabla f(x^{\star})=0 and 2f(x)\nabla^2 f(x^{\star}) is PD xx^\star is strict LM Leads to algorithms for finding LMs
1st Ord Sufficient ff is convex, xx^\star is LM xx^\star is GM2
1st Ord Sufficient ff is convex, differentiable, f(x)=0\nabla f(x^{\star})=0 xx^\star is GM Leads to effective algorithms for finding global minima

Overview of Algorithms

  • Recall that the algorithms are iterative
  • We need a starting point 𝐱0\mathbf x_0
  • … and a method to produce iterates 𝐱1\mathbf{x}_1, 𝐱2\mathbf{x}_2,\dots
  • Generally two methods:
    • Line Search
    • Trust Region

Note

The words method and algorithm are used interchangeably

Line Search Methods

  • The algorithm chooses a direction pknp_k \in \mathbb{R}^n
  • Then, minimize f(x)f(x) on the line defined by 𝐱k+αpk\mathbf{x}_k + \alpha p_k. That is, solve minα>0f(𝐱k+αpk)(1)\min_{\alpha>0}\quad f(\mathbf{x}_k + \alpha p_k) \qquad(1)
    • In practice choose an α\alpha that makes f(𝐱k+1)<f(𝐱k)f(\mathbf{x}_{k+1}) < f(\mathbf{x}_{k}), not solve Equation 1.
    • Practical method uses back-tracking line search to pick α\alpha given pkp_k

Line Search Methods

  • Two common choices for directions.
    • Negative gradient (steepest descent) pk=f(𝐱k)p_k = - \nabla f(\mathbf{x}_{k})
    • Newton direction: pk=(2f(𝐱k))1f(𝐱k)p_k = - (\nabla^2 f(\mathbf{x}_{k}))^{-1}\nabla f(\mathbf{x}_{k})

Example: Line Search Algorithms

minimizef(x)=(0.5x12)2+(x23)2 \begin{align} \operatorname{minimize} & f(x) = (0.5 x_1 - 2)^2 + (x_2 - 3)^2 \end{align}

Important

Newton’s method is not always better

Plotting Code

import numpy as np
import matplotlib.pyplot as plt

# Define the objective function and its gradient and Hessian
def f(x, y):
    return #<answer>

def grad_f(x, y):
    """Gradient of f"""
    return np.array()

def hess_f(x, y):
    """Hessian of f"""
    return np.array()


# Steepest Descent
def steepest_descent(start, alpha=0.1,tol=1e-6, max_iter=50):
    # your code
    return x, iterates


# Plotting
x = np.linspace(-1, 4, 100)
y = np.linspace(-1, 5, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

plt.figure(figsize=(8, 6))
# Contour plot of the objective function
contour = plt.contour(X, Y, Z, levels=30, cmap="viridis")
plt.clabel(contour, inline=True, fontsize=8)
plt.colorbar(contour, label="Objective Function Value")

plt.plot(4, 3, 'x', color="blue", markersize=10, label="True Optimum (4, 3)")

# Run Steepest Descent
start_point = [-1.0, 1.0]
optimum, iterates = steepest_descent(start_point)
# Extract iterate points for plotting
iterates = np.array(iterates)
x_iterates, y_iterates = iterates[:, 0], iterates[:, 1]
plt.plot(x_iterates, y_iterates, 'o-', color="blue", label="SD Iterates alpha=0.1")

# Annotate start and end points
plt.annotate("Start", (x_iterates[0], y_iterates[0]), textcoords="offset points", xytext=(-10, 10), ha="center", color="red")
plt.annotate("End", (x_iterates[-1], y_iterates[-1]), textcoords="offset points", xytext=(-10, -15), ha="center", color="red")

# Labels and legend
plt.xlabel("x_1")
plt.ylabel("x_2")
plt.title("Steepest Descent vs Newton's Method for Optimization")
plt.legend()
plt.grid(True)
plt.show()

Trust Region Methods

  • Approximate ff near 𝐱k\mathbf{x}_k using a model function mkm_k
  • Minimize mkm_k near 𝐱k\mathbf{x}_k by solving minpmk(𝐱k+p)s.t.𝐱k+pTrust region \begin{align} \min_{p} & m_k\left(\mathbf{x}_k+p\right)\\ \text{s.t.} & \mathbf{x}_k+p \in \text{Trust region} \end{align}
  • Common choice: local quadratic Taylor series, with a spherical trust region
    (trust-region Newton’s method)

Line Search and Trust Region Methods

In some cases, the two methods overlap 1 minpmk(𝐱k+p)=fK+pTfks.t.p2αfk2 \begin{align} \min_{p} & m_k\left(\mathbf{x}_k+p\right) = f_K + p^T \nabla f_k\\ \text{s.t.} & \| p\|_2 \leq \alpha \| \nabla f_k\|_2 \end{align} yields p=αfkp = - \alpha \nabla f_{k}

Exercise: Steepest Descent

minimizef(x)=(1x1)2+10(x2x12)2 \begin{align} \operatorname{minimize} & f(x) = (1 - x_1)^2 + 10 (x_2 - x_1^2)^2 %a = 1, b = 10 \end{align}

The optimum is at (1,1)(1,1). Test the behavior of steepest descent algorithms xk+1=xkαfkx_{k+1} = x_{k} - \alpha \nabla f_k with various constant values of α\alpha, starting from (0,1)(0,1).

Solution

f=[2(1x1)40(x2x12)x120(x2x22)] \nabla f = \begin{bmatrix} -2(1-x_1) - 40 (x_2 - x_1^2) x_1\\ 20 (x_2-x_2^2)\end{bmatrix}

Step size selection

Recap

  • In Unconstrained Optimization we saw theory that suggests algorithms:
    • In proof of First Order Necessary Conditions, we saw that choosing pk=fkp_k = -\nabla f_k and αk\alpha_k implies f(xk+1)<f(xk)f(x_{k+1}) < f(x_k)
      • However, decreasing fkf_k does not imply we reach the optimum if the decrease is too small
    • Another strategy from the Sufficient Conditions is to solve f(x)=0\|\nabla f(x) = 0\|
  • Now we look at rules for αk\alpha_k and pkp_k in an algorithm xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k that will
    • decrease f(xk)f(x_k) enough so that
    • we solve f(x)=0\|\nabla f(x) = 0\|

Lipschitz Function

A function f:mf \colon \mathbb{R}^m \to \mathbb{R} is Lipschitz on a set SS if for any x,ySx,y \in S f(x)f(y)Lxy \|f(x) - f(y) \| \leq L \|x - y\|

For twice differentiable functions, f\nabla f being differentiable is equivalent to 2f(w)LI \nabla^2 f(w) \preceq L I for all wSw \in S

Steepest Descent

Descent Lemma

Let ff be a twice differentiable function whose gradient f\nabla f is Lipschitz continuous over some convex set SS. Then, for all x,ySx,y \in S, by Taylor’s Theorem, f(y)=f(x)+f(x)T(yx)+12(yx)T2f(z)(yx) f(y) = f(x) + \nabla f(x)^T (y-x) + \frac{1}{2} (y-x)^T \nabla^2 f(z) (y-x)

f(y)f(x)+f(x)T(yx)+L2yx2, \implies f(y) \leq f(x) + \nabla f(x)^T (y-x) + \frac{L}{2} \|y-x\|^2,

This Lemma says that we can define a local quadratic function that is an upper bound for ff near a point xx.

If x=xkx = x_k and y=xkαkfky = x_k - \alpha_k \nabla f_k, we get fk+1fkαk(2Lαk)2fk2 f_{k+1} \leq f_k -\frac{\alpha_{k} (2 - L \alpha_{k} )}{2} \|\nabla f_k\|^2

  • If 0<αk<2L0 < \alpha_k < \frac{2}{L}, then fk+1<fkf_{k+1} < f_{k}.
  • fkfk+1f_{k}-f_{k+1} approaches zero only when fk0\nabla f_k \to 0

Example

minxf(x)=(x3)2 min_{x \in \mathbb{R}} \quad f(x) = (x-3)^2

2f=2 \nabla^2 f = 2

To ensure decrease, we need α<2L=1 \alpha < \frac{2}{L} = 1

If α=1\alpha =1, then fk+1=fkf_{k+1} = f_{k}

If xk=x+yx_k = x^{\star} + y, then xk+1=xyx_{k+1} = x^{\star} -y

Rosenbrock Function

logϵ=mk+ck=1mlogϵcmk=1mlog1ϵ+cm \log \epsilon = -m\ k + c \iff k = - \frac{1}{m} \log \epsilon -\frac{c}{m} \iff k = \frac{1}{m} \log \frac{1}{\epsilon} \quad + \frac{c}{m}

Condition Number

f(x,y)=x2+κy2 f(x,y) = x^2 + \kappa y^2

Newton’s Method

Motivation

  • We know that whatever point we are looking must be a stationary one, it must satisfy f(x)=0\nabla f(x) = 0.

  • Newton’s method applies the Newton root-finding algorithm to f\nabla f:

    • To solve g(x)=0g(x) = 0, iteratively solve 1 g(xk)+(xk+1xk)Tg(x)=0xk+1=xk(g(x))g(x)g(x_k) + (x_{k+1} - x_k)^T \nabla g(x) = 0 \implies x_{k+1} = x_k - (\nabla g(x) )^{\sharp} g(x)
  • Apply root finding to f(x)=0\nabla f(x) = 0: xk+1=xk(2f(x))1f(x)x_{k+1} = x_k - (\nabla^2 f(x) )^{-1} \nabla f(x)

Example

  • We want to find the minimizer of

f(x)=12x2sinx,x0=12. f(x) = \frac{1}{2}x^2 - \sin{x}, \quad x_0 = \frac{1}{2}.

  • We want an accuracy of ε=105\varepsilon = 10^{-5}, i.e., stop when

|xk+1xk|<ε. |x_{k+1} - x_k | < \varepsilon.

  • We compute

f(x)=xcosx,f(x)=1+sinx. f'(x) = x - \cos{x}, \quad f''(x) = 1 + \sin{x}.



x1=1212cos121+sin12=0.7552,x2=x1f(x1)f(x1)=x10.027101.685=0.7391,x3=x2f(x2)f(x2)=x29.461×1051.673=0.7390,x4=x3f(x3)f(x3)=x31.17×1091.673=0.7390. \begin{align} x_1 &= \frac{1}{2} - \frac{\frac{1}{2} - \cos{\frac{1}{2}}}{1 + \sin{\frac{1}{2}}} = 0.7552, \\ x_2 &= x_1 - \frac{f'(x_1)}{f''(x_1)} = x_1 - \frac{0.02710}{1.685} = 0.7391, \\ x_3 &= x_2 - \frac{f'(x_2)}{f''(x_2)} = x_2 - \frac{9.461 \times 10^{-5}}{1.673} = 0.7390, \\ x_4 &= x_3 - \frac{f'(x_3)}{f''(x_3)} = x_3 - \frac{1.17 \times 10^{-9}}{1.673} = 0.7390. \end{align}

Alternate Motivation

When 2fk0\nabla^2 f_k \succ 0, we are minimizing the quadratic approximation f(x)fk+fkT(xxk)+12(xxk)T2fk(xxk),f(x) \approx f_k + \nabla f_k^T (x-x_k) + \frac{1}{2} (x-x_k)^T \nabla^2 f_k (x-x_k), which need not have minimum at xx^\star

Example: Quadratic Function

Consider f(x)=xTQxbTx for Q=QT0,f(x) = x^T Q x - b^T x \quad \text{ for }Q=Q^T \succ 0, where x=Q1bx^\star = Q^{-1} b Ideal update: xk+1=xk(xkx)=xk1(xkQ1b)Newton update:xk+1=xk+1(1)Q1(Qxkb)=xk1(xkQ1b)Steepest descent:xk+1=xk+αk(1)(Qxkb)\begin{align} \text{ Ideal update: }&& x_{k+1} &= x_k - (x_k - x^\star) = x_k - 1 \cdot (x_k - Q^{-1} b )\\ \text{Newton update:}&& x_{k+1} &= x_k + 1 \cdot (-1) \cdot Q^{-1} (Q x_k - b) = x_k -1 \cdot (x_k - Q^{-1} b)\\ \text{Steepest descent:}&& x_{k+1} &= x_k + \alpha_k \cdot (-1) \cdot (Q x_k - b) \end{align}

What happens when Q=QT0Q = Q^T \prec 0?

Example: Rosenbrock Function

Convergence Rate

Theorem 3.5

Suppose that ff is twice differentiable and that the Hessian 2f\nabla^2 f is Lipschitz continuous in a neighborhood of a solution xx^\star at which the sufficient conditions are satisfied. Consider the iteration xk+1=xk+pkx_{k+1} = x_k + p_k where pk=2fk1fkp_k = - \nabla^2 f_k^{-1} \nabla f_k. Then

  • If x0x_0 is sufficiently close to xx^\star, then xkxx_k \to x^\star
  • The rate of convergence {xk}\{x_k\} is quadratic
  • The sequence of gradient norms {fk}\{\|\nabla f_k\|\} converges quadratically to zero

Newton’s Revenge

Modifications

Although Newton’s method is very attractive in terms of its convergence properties the solution, it requires modification before it can be used at points that are far from the solution.

The following modifications are typically used:

  • Step-size reduction (damping)
  • Modifying Hessian to be positive definite
  • Approximation of Hessian

Damping

A search parameter α\alpha is introduced 𝐱k+1=𝐱kαk2f(𝐱k)1f(𝐱k), \bm{x}_{k+1} = \bm{x}_k - \alpha_k \nabla^2 f(\bm{x}_k)^{-1}\nabla f(\bm{x}_k), where αk\alpha_k is selected to minimize ff.

A popular selection method is backtracking line search.

Positive Definiteness and Scaling

General class of algorithms is given by 𝐱k+1=𝐱k+αpk=𝐱kαBkfk,(2) \bm{x}_{k+1} = \bm{x}_k + \alpha p_k = \bm{x}_k - \alpha B_k \nabla f_k, \qquad(2)

  • SD: Bk=𝐈B_k = \bm{I}, Newton: Bk=2f(𝐱k)1B_k = \nabla^2 f(\bm{x}_k)^{-1}.

For small α\alpha, it can be shown that f(𝐱k+1)=f(𝐱k)αfkTBkfk+O(α2). f(\bm{x}_{k+1}) = f(\bm{x}_k) - \alpha \nabla f_k^T B_k \nabla f_k + O(\alpha^2).

  • As α0\alpha \rightarrow 0, the second term on the rhs dominates the third.
  • To guarantee a decrease in ff, we must have fkTBkfk>0\nabla f_k^T B_k \nabla f_k > 0.
    • Simplest way to ensure this is to require Bk𝟎B_k \succ \bm{0}.

Positive Definiteness and Scaling

  • In practice, Newton’s method must be modified to accommodate the possible non-positive definiteness of 2f\nabla^2 f at regions far from the solution.

  • Common approach: Bk=[μk𝐈+2f(𝐱k)]1B_k = [\mu_k\bm{I} + \nabla^2 f(\bm{x}_k)]^{-1} for some μk>0\mu_k > 0.

  • This can be regarded as a compromise between SD (μk\mu_k very large) and Newton’s method (μk=0\mu_k = 0).

Levenberg-Marquardt performs Cholesky factorization for a given value of μk\mu_k as follows μk𝐈+2f(𝐱k)=𝐆T𝐆.\mu_k \bm{I} + \nabla^2 f(\bm{x}_k) = \bm{G}^T\bm{G}.

  • This factorization checks implicitly for positive definiteness.

  • If the factorization fails (matrix not PD) μk\mu_k is increased.

  • Step direction is found by solving 𝐆T𝐆pk=fk\bm{G}^T\bm{G} p_k = -\nabla f_k.

Newton’s Revenge Avenged

Nonlinear Least Squares: Localization Example

Problem Definition

Range-Based Localization: Find position x=(x1,x2)x = (x_1, x_2) given noisy range measurements to known beacons.

Problem Definition

Range-Based Localization: Find position x=(x1,x2)x = (x_1, x_2) given noisy range measurements to known beacons.

Given:

  • Beacon positions b1,b2,b3,b42b_1, b_2, b_3, b_4 \in \mathbb{R}^2
  • Measured distances d1,d2,d3,d4d_1, d_2, d_3, d_4

Residuals: ei(x)=xbidie_i(x) = \|x - b_i\| - d_i

Objective: minxi=14ei(x)2=minxe(x)2\min_x \sum_{i=1}^4 e_i(x)^2 = \min_x \|e(x)\|^2

Jacobian: Row ii is Ji=(xbi)TxbiJ_i = \frac{(x - b_i)^T}{\|x - b_i\|}

This is a unit vector pointing from beacon ii toward xx.

This is a common robotics problem: GPS, UWB localization, acoustic positioning

Algorithm Comparison

Three methods to solve minxe(x)2\min_x \|e(x)\|^2:

Method Update Rule Direction
Gradient Descent xk+1=xkαJTex_{k+1} = x_k - \alpha J^T e Steepest descent on e2\|e\|^2
Gauss-Newton xk+1=xk(JTJ)1JTex_{k+1} = x_k - (J^T J)^{-1} J^T e Newton with HJTJH \approx J^T J
Levenberg-Marquardt xk+1=xk(JTJ+λI)1JTex_{k+1} = x_k - (J^T J + \lambda I)^{-1} J^T e Damped Gauss-Newton

Key Insights:

  • GN approximates Hessian as JTJJ^T J, ignoring second-order residual terms eiHi\sum e_i H_i
  • LM blends GD (large λ\lambda) and GN (small λ\lambda) via damping parameter
  • Adaptive λ\lambda: increase when step rejected, decrease when accepted

Code: Localization Example

Show setup code
import numpy as np
import matplotlib.pyplot as plt


# Generate noisy measurements
np.random.seed(42)
noise_std = 0.1
true_distances = np.linalg.norm(beacons - true_pos, axis=1)
measured_distances = true_distances + np.random.randn(4) * noise_std

def residuals(x):
    """Compute residual vector e(x) = ||x - b_i|| - d_i"""
    return np.linalg.norm(beacons - x, axis=1) - measured_distances

def jacobian(x):
    """Compute Jacobian matrix J(x) where J_i = (x - b_i)^T / ||x - b_i||"""
    J = np.zeros((4, 2))
    for i, b in enumerate(beacons):
        diff = x - b
        norm = np.linalg.norm(diff)
        if norm > 1e-10:
            J[i] = diff / norm
    return J

def cost(x):
    """Compute sum of squared residuals"""
    e = residuals(x)
    return np.sum(e**2)

Gradient Descent for NLS

Update: xk+1=xkαJTex_{k+1} = x_k - \alpha J^T e

  • Direction JTe-J^T e is gradient of 12e2\frac{1}{2}\|e\|^2
  • Requires step size α\alpha selection
  • Slow but reliable convergence

Gauss-Newton Method

Update: xk+1=xk(JTJ)1JTex_{k+1} = x_k - (J^T J)^{-1} J^T e

  • Solves linear system (JTJ)p=JTe(J^T J) p = -J^T e
  • No step size needed (full step)
  • Fast quadratic convergence near solution

Levenberg-Marquardt Method

Update: xk+1=xk(JTJ+λI)1JTex_{k+1} = x_k - (J^T J + \lambda I)^{-1} J^T e

  • Adaptive λ\lambda: blend GD and GN
  • Large λ\lambda \to gradient descent (robust)
  • Small λ\lambda \to Gauss-Newton (fast)

Visualization: Iterate Paths

Results Discussion

Observations:

Method Iterations Convergence
Gradient Descent Many (~50) Linear rate
Gauss-Newton Few (~5) Quadratic near solution
Levenberg-Marquardt Few (~5) Adaptive, robust

Key Takeaways:

  1. GD: Slow but reliable, many iterations
  2. GN: Fast quadratic convergence, may fail far from solution
  3. LM: Best of both worlds
    • Large λ\lambda \to GD-like (robust)
    • Small λ\lambda \to GN-like (fast)

When to use which?

  • GD: When Jacobian is expensive or problem is poorly conditioned
  • GN: Well-posed problems with good initial guess
  • LM: Default choice for most NLS problems

Summary

Summary

You can now minimize f(x)f(x) (when unconstrained) iteratively. Find LM xx^\star by

  • Choosing direction pkp_k
    • Steepest descent with constant step size is easy and reliable but slow
    • Newton method is fast but expensive and unreliable
      • needs modifications for reliable convergence
    • Quasi-newton methods are a balance, but tricky
  • Choosing step size αk\alpha_k
    • Back-tracking for SD and modified NM improves performance
    • Strong Wolfe conditions for Quasi-newton methods
  • Not tackled: constraints, expensive ff/f\nabla f, discrete xx, non-smooth ff