In proof of First Order Necessary Conditions, we saw that choosing and implies
However, decreasing does not imply we reach the optimum if the decrease is too small
Another strategy from the Sufficient Conditions is to solve
Now we look at rules for and in an algorithm that will
decrease enough so that
we solve
Lipschitz Function
A function is Lipschitz on a set if for any
For twice differentiable functions, being differentiable is equivalent to for all
Steepest Descent
Descent Lemma
Let be a twice differentiable function whose gradient is Lipschitz continuous over some convex set . Then, for all , by Taylor’s Theorem,
This Lemma says that we can define a local quadratic function that is an upper bound for near a point .
If and , we get
If , then .
approaches zero only when
Example
To ensure decrease, we need
If , then
If , then
Rosenbrock Function
Condition Number
Newton’s Method
Motivation
We know that whatever point we are looking must be a stationary one, it must satisfy .
Newton’s method applies the Newton root-finding algorithm to :
To solve , iteratively solve 1
Apply root finding to :
Example
We want to find the minimizer of
We want an accuracy of , i.e., stop when
We compute
Alternate Motivation
When , we are minimizing the quadratic approximation which need not have minimum at
Example: Quadratic Function
Consider where
What happens when ?
Example: Rosenbrock Function
Convergence Rate
Theorem 3.5
Suppose that is twice differentiable and that the Hessian is Lipschitz continuous in a neighborhood of a solution at which the sufficient conditions are satisfied. Consider the iteration where . Then
If is sufficiently close to , then
The rate of convergence is quadratic
The sequence of gradient norms converges quadratically to zero
Newton’s Revenge
Modifications
Although Newton’s method is very attractive in terms of its convergence properties the solution, it requires modification before it can be used at points that are far from the solution.
The following modifications are typically used:
Step-size reduction (damping)
Modifying Hessian to be positive definite
Approximation of Hessian
Damping
A search parameter is introduced where is selected to minimize .
A popular selection method is backtracking line search.
Positive Definiteness and Scaling
General class of algorithms is given by
SD: , Newton: .
For small , it can be shown that
As , the second term on the rhs dominates the third.
To guarantee a decrease in , we must have .
Simplest way to ensure this is to require .
Positive Definiteness and Scaling
In practice, Newton’s method must be modified to accommodate the possible non-positive definiteness of at regions far from the solution.
Common approach: for some .
This can be regarded as a compromise between SD ( very large) and Newton’s method ().
Levenberg-Marquardt performs Cholesky factorization for a given value of as follows
This factorization checks implicitly for positive definiteness.
If the factorization fails (matrix not PD) is increased.
Step direction is found by solving .
Newton’s Revenge Avenged
Nonlinear Least Squares: Localization Example
Problem Definition
Range-Based Localization: Find position given noisy range measurements to known beacons.
Problem Definition
Range-Based Localization: Find position given noisy range measurements to known beacons.
Given:
Beacon positions
Measured distances
Residuals:
Objective:
Jacobian: Row is
This is a unit vector pointing from beacon toward .
This is a common robotics problem: GPS, UWB localization, acoustic positioning
Algorithm Comparison
Three methods to solve :
Method
Update Rule
Direction
Gradient Descent
Steepest descent on
Gauss-Newton
Newton with
Levenberg-Marquardt
Damped Gauss-Newton
Key Insights:
GN approximates Hessian as , ignoring second-order residual terms
LM blends GD (large ) and GN (small ) via damping parameter
Adaptive : increase when step rejected, decrease when accepted
Code: Localization Example
Show setup code
import numpy as npimport matplotlib.pyplot as plt# Generate noisy measurementsnp.random.seed(42)noise_std =0.1true_distances = np.linalg.norm(beacons - true_pos, axis=1)measured_distances = true_distances + np.random.randn(4) * noise_stddef residuals(x):"""Compute residual vector e(x) = ||x - b_i|| - d_i"""return np.linalg.norm(beacons - x, axis=1) - measured_distancesdef jacobian(x):"""Compute Jacobian matrix J(x) where J_i = (x - b_i)^T / ||x - b_i||""" J = np.zeros((4, 2))for i, b inenumerate(beacons): diff = x - b norm = np.linalg.norm(diff)if norm >1e-10: J[i] = diff / normreturn Jdef cost(x):"""Compute sum of squared residuals""" e = residuals(x)return np.sum(e**2)
Gradient Descent for NLS
Update:
Direction is gradient of
Requires step size selection
Slow but reliable convergence
Gauss-Newton Method
Update:
Solves linear system
No step size needed (full step)
Fast quadratic convergence near solution
Levenberg-Marquardt Method
Update:
Adaptive : blend GD and GN
Large gradient descent (robust)
Small Gauss-Newton (fast)
Visualization: Iterate Paths
Results Discussion
Observations:
Method
Iterations
Convergence
Gradient Descent
Many (~50)
Linear rate
Gauss-Newton
Few (~5)
Quadratic near solution
Levenberg-Marquardt
Few (~5)
Adaptive, robust
Key Takeaways:
GD: Slow but reliable, many iterations
GN: Fast quadratic convergence, may fail far from solution
LM: Best of both worlds
Large GD-like (robust)
Small GN-like (fast)
When to use which?
GD: When Jacobian is expensive or problem is poorly conditioned
GN: Well-posed problems with good initial guess
LM: Default choice for most NLS problems
Summary
Summary
You can now minimize (when unconstrained) iteratively. Find LM by
Choosing direction
Steepest descent with constant step size is easy and reliable but slow
Newton method is fast but expensive and unreliable
needs modifications for reliable convergence
Quasi-newton methods are a balance, but tricky
Choosing step size
Back-tracking for SD and modified NM improves performance
Strong Wolfe conditions for Quasi-newton methods
Not tackled: constraints, expensive /, discrete , non-smooth