This is mainly about how we deal with nonlinear classification.

The idea is that we want to do a non-linear transformation such that the nonlinear input space becomes a linearly separable feature space.

Let’s say we want to model a 2nd order polynomial (quadratic) decision function, we have to do transformation , where and , where .

We would have to do mapping for three different terms:

  1. Linear β†’ (total of coordinates): to model straight lines
  2. Square β†’ (total of coordinates): to model circles / ellipses
  3. Cross β†’ (total of coordinates): to model curves

This is bad because (1) A lot of compute is needed, (2) Dimensionality increasing.

So we have the kernel trick. Following Optimal Hyperplane solution, we have:

Notice that we are only using the inner-products of the transformed vector. We can just define and then . In this way, we don’t have to store each of the mapping, just take the inner product!

We transform the decision function into:

To determine whether the Kernel is a proper kernel, we have to fulfill the Mercer Theorem. And Here is a list of Commonly Used Kernels.