In Perceptron, we learn that there might be multiple solutions. To decide which one is best, we have the optimal hyperplane.
For sample set , that can be separated by a hyperplane , the optimal hyperplane is the function with maximal margin between the samples of the two classes (basically most distance between the two classes), that separate the two classes without error.
Formalizing this, we want to fix the scale by setting it to 1: We can do this because we can just freely absorb a scalar to and .
Optimal Hyperplane
\begin{aligned}
&\min \Phi(\mathbf{w}) = \frac{1}{2}(\mathbf{w}\cdot \mathbf{w}) \text{ w.r.t }\mathbf{w}, \ &\text{s.t. } y_{i}(\mathbf{w} \cdot \mathbf{x}{i})+b \geq 1, \quad \text{if }i=1, 2,\dots,l\ &\text{for training samples } (y{1}, \mathbf{x}{1}),\dots,(y{l}, \mathbf{x}_{l}), \quad y \in {-1, 1} \end{aligned}
In plain English, is equivalent to maximizing the margin and is the requirement that there is no misclassification.
We have to use Langragian equation to get the solution. This is convex optimization problem. We must find the saddle point.
Differentiating with and gives us:
From here, we get 3 observations
- For optimal hyperplane, is:
- must be the linear combination of the training samples:
- Only support vectors have non-zero coefficient of in .
- This is based on the Kuhn Tucker theorem.
Optimal Hyperplane (Solution)
We sum over only because they are the only non-zeros (3rd observation).
Plugging this back to , we get the decision function
The threshold can be obtained from the support vectors of the two classes. We pick a support vector from the +1 class and from the -1 class.
Then we take average (to reduce noise) to find .