ME/AER 647 Systems Optimization I

Applications

Instructor: Hasan A. Poonawala

Mechanical and Aerospace Engineering
University of Kentucky, Lexington, KY, USA

Topics:
Scan Matching
Inverse Kinematics
Machine Learning

Scan Matching

Laser scan taken at two different positions can be aligned to estimate robot motion

Scan Matching

Scan Matching: Observe Same Points

Scan 1: $\mathbf x_i = \{x_i,y_i\}$
Scan 2: $\mathbf x_j = \{x_j,y_j\}$ $\leftrightarrow$ $\{x_{i(j)},y_{i(j)}\}$
Perfect association:
We can map a point in second scan to a unique point in first scan: $\{x_j,y_j\}$ $\leftrightarrow$ $\{x_{i(j)},y_{i(j)}\}$
$\min_{\Delta x, \Delta y, \theta} \sum_{j} \|$ $\mathbf x_j$ $- T_{\Delta x, \Delta y, \theta}($ $\mathbf x_{i(j)}$ $)\|^2$ , where $T_{\Delta x, \Delta y, \theta}(\mathbf x) = \begin{bmatrix}\cos \theta & -\sin \theta \\ \sin \theta & \cos \theta\end{bmatrix} \mathbf x + \begin{bmatrix} \Delta x \\ \Delta y\end{bmatrix}$

Scan Matching: Observe Same Points

Scan Matching: Different Points

Inverse Kinematics

Forward & Inverse Kinematics

Given joint angles $\mathbf q = \{q_i\}$ we can predict the end effector pose $\mathbf x = (x,y,\theta)$

Forward & Inverse Kinematics

The Forward Kinematics problem $\mathbf x = f(\mathbf q)$ combines known closed-form expressions for individual homogenous transformations
Computing the inverse $\mathbf q = f^{-1}(\mathbf x)$ , however, is not as easy
The inverse kinematics problem is often not even unique, which has algorithmic implications

IK Approaches

Since we know how to build $f(\mathbf q)$ , we arrive at two approaches to inverse kinematics

Analytic approaches: Build the closed-form expression $f(\mathbf q)$ and define a closed-form inverse $f^{-1}(\mathbf q)$
Numerical approaches: Numerically search for values of $\mathbf q$ so that $f(\mathbf q) = \mathbf x$ , where the function $f$ is either closed-form or numerical

Numerical IK Approach

solve optimization: $\min_{\mathbf q}\quad \lVert \mathbf x - f(\mathbf q) \rVert_2^2$
We can add constraints that make the solution unique, or other benefits

Example: Planar Elbow Manipulator

Frame $\{2\}$ has pose $(x,y,\theta)$ given by $\begin{align} x &= L_1 \cos q_1 + L_{c2} \cos (q_1+q_2) \\ y &= L_1 \sin q_1 + L_{c2} \sin (q_1+q_2) \\ \theta &= q_1 + q_2\\ &\text{ or} \\ \mathbf x &= f(\mathbf q) \end{align}$

Logistic Regression

We have vectors $\bm{a}_i \in \mathbb{R}^d$ for $i = 1, 2, \ldots, n_1$ in a class and vectors $\bm{b}_j \in \mathbb{R}^d$ for $j = 1, 2, \ldots, n_2$ not in that class.

Two clusters of points $\bm{a}_ {i}$ (red) and $\bm{b}_ {j}$ (green) in $\mathbb R^2$ (d=2)

Approach

We wish to classify the points into one of two clusters
We convert classification into regression by requiring that for some function $f$ : $f(\bm{a}_i) = 1 \text{ and } f(\bm{a}_i) = 0$
One solution is to use the logistic function after linearly mapping inputs $\bm{x}$ to a scalar: $\mathrm{logistic}(s) = \frac{e^s}{1+e^{s}}; \quad s(\bm{x}) = \bm{x}^T \bm{w} + \beta$
New goal: find a vector $\bm{w} \in \mathbb{R}^d$ and a number $\beta$ such that

$\frac{\operatorname{exp}(\bm{a}_i^\top \bm{w} + \beta)}{1 + \operatorname{exp}(\bm{a}_i^\top \bm{w} + \beta)} \approx 1, \;\; \forall i, \quad \text{ and } \quad \frac{\operatorname{exp}(\bm{b}_j^\top \bm{w} + \beta)}{1 + \operatorname{exp}(\bm{b}_j^\top \bm{w} + \beta)} \approx 0, \;\; \forall j.$

Approach

This problem can be cast as an unconstrained optimization problem

$\operatorname{maximize}_{\bm{w}, \beta} \left(\prod_i \frac{\operatorname{exp}(\bm{a}_i^\top \bm{w} + \beta)}{1 + \operatorname{exp}(\bm{a}_i^\top \bm{w} + \beta)}\right) \left(\prod_j \left(1 - \frac{\operatorname{exp}(\bm{b}_j^\top \bm{w} + \beta)}{1 + \operatorname{exp}(\bm{b}_j^\top \bm{w} + \beta)} \right) \right)$

which may equivalently be expressed using a log transformation as

$\operatorname{minimize}_{\bm{w}, \beta} \sum_i \operatorname{log}\left(1 + \operatorname{exp}(-\bm{a}_i^\top \bm{w} - \beta) \right) + \sum_j \operatorname{log}\left(1 + \operatorname{exp}(\bm{b}_i^\top \bm{w} + \beta) \right).$

$\prod \left( \frac{e^{\bm{a}_i^\top \bm{w} + \beta}}{1 + e^{\bm{a}_i^\top \bm{w} + \beta}} \right) = \prod \left( \frac{1}{1 + e^{-\bm{a}_i^\top \bm{w} - \beta}} \right)$

$\prod \left( 1 - \frac{e^{\bm{b}_j^\top \bm{w} + \beta}}{1 + e^{\bm{b}_j^\top \bm{w} + \beta}} \right) = \prod \left( \frac{1}{1 + e^{\bm{b}_j^\top \bm{w} + \beta}} \right)$

Is this an easy or a hard problem?

Example: Logistic Regression


The optimal value is 12.37578346960808
A solution x is
w =  [2.35589697 2.25825204]
beta =  [-6.84717488]

Example: Logistic Regression

# Import packages.
import cvxpy as cp
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['text.usetex'] = True
# Generate a random non-trivial linear program.
m = 100
n = 2
np.random.seed(1)
B = np.random.randn(m, n)
A = np.random.randn(m, n)+np.array([3,3])

plt.figure(figsize=(8, 6))
plt.scatter(A[:,0],A[:,1])
plt.scatter(B[:,0],B[:,1])
# beta = A @ x0 + s0
# c = -A.T @ lamb0

# Define and solve the CVXPY problem.
w = cp.Variable(n)
beta = cp.Variable(1)
prob = cp.Problem(cp.Minimize(cp.sum(cp.logistic( -A @ w- beta)) +cp.sum(cp.logistic( B @ w+ beta) )  ) )
#                  [A @ x <= beta])
prob.solve()

# Print result.
print("\nThe optimal value is", prob.value)
print("A solution x is")
print("w = ", w.value)
print("beta = ",beta.value)




x = np.linspace(-2, 5, 100)
y = np.linspace(-2, 5, 100)
X, Y = np.meshgrid(x, y)
Z = (w.value[0]*X+w.value[1]*Y)+beta.value


# Contour plot of the objective function
contour = plt.contour(X, Y, Z, levels=20, cmap="viridis")
plt.colorbar(contour, label="Objective Function Value")

## Superimpose the line w^T x + beta = 0
x = np.linspace(-2, 5, 100)
y = -(w.value[0]/w.value[1])*x - beta.value/w.value[1]
plt.plot(x,y,color="red",label="class boundary")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.legend()

plt.show()

Parametric Estimation (Nonconvex)

Estimating the parameters of a neural network is typically nonconvex.
This network has $6$ layers, where the initial layer is the input vector $\bm{x} = \bm{f}^0$ and the last layer is the function output $\bm{f}(\bm{x}) = \bm{f}^5$ .

The vector function $\bm{f}^\ell$ , $\ell = 0, 1, \ldots, 5$ , is defined recursively by the parameter weights between two consecutive layers $w_{ij}^{\ell-1}$ as a piecewise linear/affine function

$f_j^\ell = \operatorname{max}\left\{0, \sum_i w_{ij}^{\ell-1} f_i^{\ell -1 }\right\}, \quad \forall j.$

Parametric Estimation (Nonconvex)

Similarly, for a sequence of variable value vector 𝐱k\bm{x}^k and observed function value vector 𝐠(𝐱k)\bm{g}(\bm{x}^k),
- We would like to find all weights $\left(w_{ij}^\ell \right)$ ’s to minimize the total difference between $\bm{f}(\bm{x}^k)$ and $\bm{g}(\bm{x}^k)$ for all $k$ . $\sum_k \left| \bm{f}(\bm{x}^k) - \bm{g}(\bm{x}^k) \right|^2.$
Challenges:
- Non-convexity
- Large amounts of data ( $f$ and $\nabla f$ are expensive)
Solutions:
- Stochastic gradient descent (SGD)
- Adaptive Moment (ADAM)

Some History: CNNs

Sequential Decision Making

MDP Slides

MDP

Finite Markov Decision Processes

A finite MDP, is described by the tuple (𝒮,𝒜,T,γ,ℛ)(\mathcal{S}, \mathcal{A}, T,\gamma,\mathcal{R}):
- $\mathcal S$ : A finite set of states
- $\mathcal A$ : A finite set of actions that can be taken in each state
- $T$ : A (probabilistic) transition map
- $\gamma$ : The discount factor
- $\mathcal R$ : the reward function

The agent and environment interact at each of a sequence of discrete time steps, t=0,1,2,…t = 0, 1, 2, \ldots.
- At each time step $t$ , the agent receives some respresentation of the environment’s state, $S_t \in \mathcal{S}$ , and on that basis selects an action $A_t \in \mathcal{A}(s)$ .
- One time step later, in part as a consequence of its actions, the agent receives a numerical reward, $R_{t+1} \in \mathcal{R} \in \mathbb{R}$ and finds itself in a new state, $S_{t+1}$ .
The MDP and agent together give rise to a trajectory that begins like this: $S_0, A_0, S_1, R_1, A_1, S_2, R_2, A_2, R_3, \ldots \qquad(1)$

Transition Model

For finite MDPs, the random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent on the preceding state and action.

$T(s', r \mid s, a) \triangleq \mathbb{P}\{S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a\} \qquad(2)$

This function TT defines the dynamics of the MDP.
- It specifies a probability distribution over $S$ for each choice of $s$ and $a$ , i.e.,

$\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} T(s', r \mid s, a) = 1, \quad \forall s \in \mathcal{S}, \; a \in \mathcal{A}(s). \qquad(3)$

Linear Programming Methods

We solve the MDP by finding a policy $\pi \colon S \to A$ that maximizes the expected discounted sum of rewards $V(s)$ obtained from any state $s \in S$ .

Through Dynamic Programming, we can solve for $V(s)$ using the following Linear Program:

$\begin{align} \operatorname{minimize} & \sum_{s} V(s) \\ \text{subject to} & V(s) \geq r(s,a) + \gamma \sum_{s'} T(s'|s,a) V(s'), \quad s, s' \in \mathcal{S}, \;\; a \in \mathcal{A}(s). \end{align} \qquad(4)$

Equality Constrained Optimization Examples

Example 1 – Geometric Prog.: Max Volume

We seek to construct a cardboard box of maximum volume, given a fixed area of the cardboard.

$\begin{align} \operatorname{maximize} & xyz \\ \text{subject to} & (xy + yz + xz) = \frac{c}{2}, \quad c > 0 \,(\text{area}). \end{align}$

First-Order Necessary Conditions

$\begin{align} yz - \lambda (y+z) &= 0, \\ xz - \lambda (x+z) &= 0, \\ xy - \lambda (x+y) &= 0. \end{align}$

Since no variables can be zero, we have $x = y = z = \sqrt{\frac{c}{6}} \quad \text{and} \quad \lambda = \frac{\sqrt{6c}}{12}.$

Summing the FONC gives $(xy + yz + xz) - 2\lambda(x+y+z) = 0$ .
Using the constraint with this implies c2−2λ(x+y+z)=0. \frac{c}{2} - 2\lambda(x+y+z) = 0.
- From this it is clear that $\lambda \neq 0$ .
Next, we see that none of xx, yy, and zz are zero.
- This is because if, say, $x=0$ , then $z$ becomes zero (second equation), which implies $y=0$ from the first equation.
Multiply the first by $x$ , second by $y$ and subtract to obtain $\lambda(x-y)z = 0.$
Similarly operate on second and third to obtain $\lambda (y-z)x = 0$ .

Example 2 – Hanging Chain

A chain is suspended from two thin hooks that are $16$ ft. apart on a horizontal line. Each link is one foot in length (measured inside). We wish to formulate the problem to determine the equilibrium shape of the chain.

The solution can be found by minimizing the potential energy of the chain.

Example 2 – Hanging Chain

Let link $i$ span an $x$ distance of $x_i$ and a $y$ distance of $y_i$ .
Then $x_i^2 + y_i^2 = 1$ .
The potential energy of a link is its weight times its height.
The total potential energy of the chain is the sum of those of each link.

With the top of the chain as reference and assuming the mass of each link is concentrated at its center

$\begin{align} P &= \frac{y_1}{2} + (y_1 + \frac{y_2}{2}) + (y_1 + y_2 + \frac{y_3}{2}) + \cdots \\ &+ \, (y_1 + y_2 + \cdots + y_{n-1} + \frac{y_n}{2}) = \sum_{i=1}^n (n-i+\frac{1}{2})y_i. \end{align}$

where $n = 20$ in our example.

Constraints: The total $y$ displacement is zero and the total $x$ displacement is $16$ .

Formulation

$\begin{align} \operatorname{minimize} & \sum_{i=1}^n (n-i + \frac{1}{2})y_i \\ \text{subject to} & \sum_{i=1}^n y_i = 0, \quad \sum_{i=1}^n \sqrt{1 - y_i^2} = 16. \end{align}$

First-Order Necessary Conditions

$\begin{align} &(n-i + \frac{1}{2}) - \lambda + \frac{\mu y_i}{\sqrt{1-y_i^2}} = 0, \;\; i = 1, 2, \ldots, n. \\ \end{align}$

Example 2 – Hanging Chain

FONC directly leads to

$y_i = -\frac{n - i + \frac{1}{2} - \lambda}{\sqrt{\mu^2 + (n - i + \frac{1}{2} - \lambda)^2}}.$

The solution is determined once the Lagrange multiplier are known.
- They must be selected so that the solution satisfies the two constraints.

Example 3 – Compressed Sensing

We often want to find the sparsest solution to fit exact data measurements in regression.
That is, we want to minimize the number of nonzero entries in 𝐱\bm{x} that satisfies a system of linear equations 𝐀𝐱=𝐛\bm{Ax} = \bm{b}.
- This discrete cardinality function is not continuous so we approximate it by a continuous and mostly differentiable pseudo-norm function. $\left(|\bm{x}|_p \right)^p = \sum_{j=1}^n |x_j|^p, \quad 0 < p \leq 1.$
- This becomes the L $_1$ norm function when $p = 1$ .

Example 3 – Compressed Sensing

We want to solve the lienar equality constrained minimization problem.

$\begin{align} \operatorname{minimize} & \sum_{j=1}^n |x_j|^p \\ \text{subject to} & \bm{Ax} - \bm{b} = \bm{0}. \end{align}$

The derivative of $|x_j|^p$ , when $x_j \neq 0$ is $p(|x_j|^{p-1} \operatorname{sign}(x_j))$ .
Let us remove those zero entries in $\bm{x}$ , then the remaining nonzero variables must still meet the FONC: for the $j^{\text{th}}$ column $\bm{a}_j$ of $\bm{A}$ and some $\bm{\lambda}$

$p ( |x_j|^{p-1} \operatorname{sign}(x_j) ) - \bm{\lambda}^\top\bm{a}_j = 0, \;\; \forall x_j \neq 0.$

Multiplying each equation by $x_j$ from the right and summing them up, we have

$p \sum_{j: x_j \neq 0} |x_j|^p = \bm{\lambda}^\top \left(\sum_{j: x_j \neq 0} \bm{a}_j x_j \right) = \bm{\lambda}^\top\bm{b} \leq |\bm{\lambda}| |\bm{b}|.$

This means that the sum of the $p^{\text{th}}$ power of absolute values of the nonzero entries is bounded above. For $p = \frac{1}{2}$ , we have $\sum \sqrt{|x_j|} \leq 2 |\bm{\lambda}| |\bm{b}|$ . Moreover,

$|x_j|^{-\frac{1}{2}} \operatorname{sign}(x_j) = 2\bm{\lambda}^\top \bm{a}_j, \;\; \Longrightarrow \;\; \frac{1}{\sqrt{|x_j|}} \leq 2 |\bm{\lambda}| |\bm{a}_j|.$

This establishes a lower bound on the absolute values of each nonzero entry of any possible local minimizer of the problem.

Examples of Linear Programming Problems

Example 1 – The Diet Problem

Determine the most economical diet that satisfies the basic minimum nutritional requirements for good health

There are available $n$ different foods.
There are $m$ basic nutritional ingredients,
Each unit of food $j$ contains $a_{ij}$ units of the $i^{\text{th}}$ nutrient.

$j^{\text{th}}$ food sells at a price $c_j$ per unit.
Each individual must receive at least $b_i$ units of the $i^{\text{th}}$ nutrient per day.

If we denote by $x_j$ the number of units of food $j$ in the diet, the problem is to select $x_j$ ’s to minimize the total cost $c_1x_1 + c_2x_2 + \cdots + c_nx_n$

subject to the nutritional constraints $a_{i1}x_1 + a_{i2}x_2 + \cdots + a_{in}x_n \geq b_i, \; i=1, \ldots, m,$

and the nonnegative constraints $x_1 \geq 0, x_2 \geq 0, \ldots, x_n \geq 0,$ on the food quantities.

This problem can be converted to standard form by subtracting a nonnegative surplus variable from the left side of each of the $m$ linear inequalities.

Example 2– The Resource-Allocation Problem

A facility is capable of manufacturing $n$ different products.
Each product can be produced at any level $x_j \geq 0$ , $j=1, 2, \ldots, n$ .
Each unit of the $j^{\text{th}}$ product needs $a_{ij}$ units of the $i^{\text{th}}$ resource, $i = 1, 2, \ldots, m$ .

Each product may require various amounts of $m$ different resources.
Each unit of the $j^{\text{th}}$ product can sell for $\pi_j$ dollars.
Each $b_i$ , $i = 1, 2, \ldots, m$ describe the available quantities of the $m$ resources.

We wish to manufacture products at maximum revenue

$\begin{align} \operatorname{maximize} & \pi_1x_1 + \pi_2x_2 + \cdots + \pi_nx_n \end{align}$

subject to the resource constraints

$\begin{align} \text{subject to} & a_{i1}x_1 + a_{i2}x_2 + \cdots + a_{in}x_n \leq b_i, \; i=1, \ldots, m \end{align}$

and the nonnegativity consraints on all production variables.

The problem can also be interpreted as
- fund $n$ different activities, where
- $\pi_j$ is the full reward from the $j^{\text{th}}$ activity,
- $x_j$ is restricted to $0 \leq x_j \leq 1$ , representing the funding level from $0\%$ to $100\%$ .

Example 3 – The Transportation Problem

Quantities $a_1, a_2, \ldots, a_m$ of a certain product are to be shipped from $m$ locations.
Shipping a unit of product from origin $i$ to destination $j$ costs $c_{ij}$ .

These products will be received in amounts of $b_1, b_2, \ldots, b_n$ at each of $n$ destinations.
We want to determine the amounts $x_{ij}$ to be shipped between each origin-destination pair $i = 1, 2, \ldots, m$ ; $j=1, 2, \ldots, n$ .

$x_{11}$	$x_{12}$	$\cdots$	$x_{1n}$	\|	$a_1$
$x_{21}$	$x_{22}$	$\cdots$	$x_{2n}$	\|	$a_2$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	\|	$\vdots$
$x_{m1}$	$x_{m2}$	$\cdots$	$x_{mn}$	\|	$a_m$
——	——	——	——
$b_{1}$	$b_{2}$	$\cdots$	$b_{n}$

The $i^{\text{th}}$ row in this array defines the variables associated with the $i^{\text{th}}$ origin.
The $j^{\text{th}}$ column defines the variables associated with the $j^{\text{th}}$ destination.
Problem: find the nonnegative variables $x_{ij}$ so that the sum across the $i^{\text{th}}$ row is $a_j$ , the sum down the $j^{\text{th}}$ column is $b_j$ , and the weighted sum $\sum_{j=1}^n\sum_{i=1}^m c_{ij}x_{ij}$ is minimized.

Example 4 – The Maximal Flow Problem

Maximal flow problem

Determine the maximal flow that can be established in such a network.

$\begin{align} \operatorname{maximize} & f \\ \text{subject to} & \sum_{j=1}^n x_{1j} - \sum_{j=1}^n x_{j1} - f = 0, \\ & \sum_{j=1}^n x_{ij} - \sum_{j=1}^n x_{ji} = 0, \quad i \neq 1, m, \\ & \sum_{j=1}^n x_{mj} - \sum_{j=1}^n x_{jm} + f = 0, \\ & 0 \leq x_{ij} \leq k_{ij}, \quad \forall i, j, \end{align}$

where $k_{ij} = 0$ for those no-arc pairs $(i,j)$ .

Capacitated network in which two special nodes, called the source (node 1); and the sink (node $m$ ) are distinguished.
All other nodes must satisfy the conservation requirement: net flow into these nodes must be zero.
- the source may have a net outflow,
- the sink may have a net inflow.
The outlow $f$ of the source will equal the inflow of the sink.

Example 5 – A Supply-Chain Problem

A warehouse is buying and selling stock of a certain commodity in order to maximize profit over a certain length of time.

Warehouse has a fixed capacity $C$ .
The price, $p_i$ , of the commodity is known to fluctuate over a number of time periods, say months, indexed by $i$ .
The warehouse is originally empty and is required to be empty at the end of the last period.

There is a cost $r$ per unit of holding stock for one period.
In any period the same price holds for both purchase and sale.
$x_i$ : level of stock in the warehouse at the beginning of period $i$ , $u_i$ : amount bought during this period, $s_i$ : amount sold during this period.

$\begin{align} \operatorname{maximize} & \sum_{i=1}^n \left(p_i(s_i - u_i) - rx_i \right) \\ \text{subject to} & x_{i+1} = x_i + u_i - s_i, \quad i = 1, 2, \ldots, n-1, \\ & 0 = x_n + u_n - s_n, \\ & x_i + z_i = C, \quad i = 2, \ldots, n, \\ & x_1 = 0, x_i \geq 0, u_i \geq 0, s_i \geq 0, z_i \geq 0, \end{align}$

where $z_i$ is a slack variable.

Example – A Supply-Chain Problem

A warehouse is buying and selling stock of a certain commodity in order to maximize profit over a certain length of time.

$C$ : Warehouse capacity.
$s_i$ : The amount of commodity sold during the $i^{\mathrm{th}}$ period.
$u_i$ : The amount of commodity bought during the $i^{\mathrm{th}}$ period.
$x_i$ : level of stock in the warehouse at the beginning of period the last period.

$n$ : Number of periods.
$p_i$ : The unit price of the commodity during the $i^{\mathrm{th}}$ period.
$r_i$ : The unit cost of holding the commodity during the $i^{\mathrm{th}}$ period.
The warehouse is originally empty and is required to be empty at the end of

$\begin{align} \operatorname{maximize}_{\bm{x},\bm{u},\bm{s}} & \bm{p}^\top(\bm{s}-\bm{u}) - \bm{r}^\top \bm{x}\\ \text{subject to} & x_{i+1} = x_i + u_i - s_i, \quad i = 1, 2, \ldots, n-1, \\ & x_1 = 0\\ & x_n +u_n= s_n\\ & \bm{x} \leq C \bm{1}, \\ & \bm{x} \geq 0, \bm{u} \geq 0, \bm{s} \geq 0 \end{align}$

Example 6 – Linear Classifier and Support Vector Machine

$d$ -dimensional data points are to be classified into two distinct classes.

In general, we have vectors $\bm{a}_i \in \mathbb{R}^d$ for $i=1, 2, \ldots, n_1$ and vector $\bm{b}_j \in \mathbb{R}^d$ for $j = 1, 2, \ldots, n_2$ .
We wish to find a hyperplane that separates $\bm{a}_i$ ’s from the $\bm{b}_j$ ’s, i.e., find a slope-vector $y \in \mathbb{R}^d$ and an intercept $\beta$ such that

$\begin{align} \bm{a}_i^\top \bm{y} + \beta &\geq 1, \quad \forall i, \\ \bm{b}_j^\top \bm{y} + \beta &\leq 1, \quad \forall j, \\ \end{align}$

where $\{\bm{x}: \bm{x}^\bm{y} + \beta = 0\}$ is the desired hyperplane.

The separation is defined by the fixed margins $+1$ and $-1$ , which could be made soft or variable later.

Example

Two-dimensional data points may be grade averages in science and humanities for different students.
We also know the academic major of each student, as being science or humanities, which serves as the classification.