Optimal Hyperplane

In Perceptron, we learn that there might be multiple solutions. To decide which one is best, we have the optimal hyperplane.

For sample set $(x_{1}, y_{1}), \dots, (x_{l}, y_{l}), x \in R^{d}, y \in {+ 1, - 1}$ , that can be separated by a hyperplane $(w \cdot x) + b = 0$ , the optimal hyperplane is the function $f (x) = sgn ((w \cdot x) + b)$ with maximal margin between the samples of the two classes (basically most distance between the two classes), that separate the two classes without error.

Formalizing this, we want to fix the scale by setting it to 1: $(w \cdot x_{i}) + b \geq 1. if y_{i} = 1$ $(w \cdot x_{i}) + b \leq - 1. if y_{i} = - 1$ We can do this because we can just freely absorb a scalar to $w$ and $x$ .

Optimal Hyperplane

$\begin{aligned}$

&\min \Phi(\mathbf{w}) = \frac{1}{2}(\mathbf{w}\cdot \mathbf{w}) \text{ w.r.t }\mathbf{w}, \ &\text{s.t. } y_{i}(\mathbf{w} \cdot \mathbf{x}{i})+b \geq 1, \quad \text{if }i=1, 2,\dots,l\ &\text{for training samples } (y{1}, \mathbf{x}{1}),\dots,(y{l}, \mathbf{x}_{l}), \quad y \in {-1, 1} \end{aligned}

In plain English, $Φ (w) = \frac{1}{2} (w \cdot w)$ is equivalent to maximizing the margin and $y_{i} (w \cdot x_{i}) + b \geq 1$ is the requirement that there is no misclassification.

We have to use Langragian equation to get the solution. This is convex optimization problem. We must find the saddle point.

w, b min α max L (w, b, α) = \frac{1}{2} (w \cdot w) - i = 1 \sum l α_{i} {[x_{i} \cdot w + b] y_{i} - 1} where α_{i}^{0} \geq 0, i = 1, \dots, l

Differentiating with $w$ and $b$ gives us:

\frac{\partial L ( w _{0} , b _{0} , α ^{0} )}{\partial b} = 0 \Rightarrow i = 1 \sum l α_{i}^{0} y_{i} = 0 \frac{\partial L ( w _{0} , b _{0} , α ^{0} )}{\partial w} = 0 \Rightarrow w_{0} = i = 1 \sum l y_{i} α_{i}^{0} x_{i}

From here, we get 3 observations

For optimal hyperplane, $α_{i}^{0}$ is:

i = 1 \sum l α_{i}^{0} y_{i} = 0, α_{i}^{0} \geq 0, i = 1, \dots, l

$w_{0}$ must be the linear combination of the training samples:

w_{0} = i = 1 \sum l y_{i} α_{i}^{0} x_{i}, α_{i}^{0} \geq 0, i = 1, \dots, l

Only support vectors have non-zero coefficient of $α_{i}^{0}$ in $w_{0}$ .
- This is based on the Kuhn Tucker theorem.

Optimal Hyperplane (Solution)

$w_{0} = S V s \sum y_{i} α_{i}^{0} x_{i}, α_{i}^{0} \geq 0$
We sum over $S V s$ only because they are the only non-zeros (3rd observation).

Plugging this back to $w_{0}^{T} + b_{0}$ , we get the decision function

f (x) = sign (S V s \sum y_{i} α_{i}^{0} (x_{i} \cdot x) + b_{0})

The threshold $b_{0}$ can be obtained from the support vectors of the two classes. We pick a support vector $x^{*} (+ 1)$ from the +1 class and $x^{*} (- 1)$ from the -1 class.

y = + 1, w_{0}^{T} x^{*} (1) + b_{0} = 1 \Rightarrow b_{0} = 1 - w_{0}^{T} x^{*} (1) y = - 1, w_{0}^{T} x^{*} (- 1) + b_{0} = - 1 \Rightarrow b_{0} = - 1 - w_{0}^{T} x^{*} (- 1)

Then we take average (to reduce noise) to find $b_{0}$ .

b_{0} = - \frac{1}{2} [(w_{0} \cdot x^{*} (1)) + (w_{0} \cdot x^{*} (- 1))]

Messy Notes

Explorer

Optimal Hyperplane

Graph View

Backlinks