Kernel Trick

This is mainly about how we deal with nonlinear classification.

The idea is that we want to do a non-linear transformation $z = ϕ (x)$ such that the nonlinear input space becomes a linearly separable feature space.

Let’s say we want to model a 2nd order polynomial (quadratic) decision function, we have to do transformation $X \to Z$ , where $X \subset R^{D}$ and $Z \subset R^{D}$ , where $D = \frac{d ( d + 3 )}{2}$ .

We would have to do mapping for three different terms:

Linear → $z^{1} = x^{1}, \dots, z^{d} = x^{d}$ (total of $d$ coordinates): to model straight lines
Square → $z^{d + 1} = (x^{1})^{2}, \dots, (z^{d})^{2}$ (total of $d$ coordinates): to model circles / ellipses
Cross → $z^{2 d + 1} = x^{1} x^{2}, \dots, z^{D}$ (total of $\frac{d ( d - 1 )}{2}$ coordinates): to model curves

This is bad because (1) A lot of compute is needed, (2) Dimensionality increasing.

So we have the kernel trick. Following Optimal Hyperplane solution, we have:

f (x) = sgn {i = 1 \sum n α^{*} y_{i} (x_{i} \cdot x) + b^{*}}

Notice that we are only using the inner-products of the transformed vector. We can just define $x \to ϕ (x)$ and then $(ϕ (x_{i}) \cdot ϕ (x_{j})) = K (x_{i}, x_{j})$ . In this way, we don’t have to store each of the mapping, just take the inner product!

Dual Problem with Kernel Trick

$α max Q (α) = i = 1 \sum l α_{i} - \frac{1}{2}_{i, j = 1}^{l} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})$

We transform the decision function into:

f (x) = sgn {i = 1 \sum l α_{i}^{*} y_{i} K (x_{i}, x) + b^{*}}

To determine whether the Kernel $K$ is a proper kernel, we have to fulfill the Mercer Theorem. And Here is a list of Commonly Used Kernels.

Messy Notes

Explorer

Kernel Trick

Graph View

Backlinks