This is mainly about how we deal with nonlinear classification.
The idea is that we want to do a non-linear transformation such that the nonlinear input space becomes a linearly separable feature space.
Letβs say we want to model a 2nd order polynomial (quadratic) decision function, we have to do transformation , where and , where .
We would have to do mapping for three different terms:
- Linear β (total of coordinates): to model straight lines
- Square β (total of coordinates): to model circles / ellipses
- Cross β (total of coordinates): to model curves
This is bad because (1) A lot of compute is needed, (2) Dimensionality increasing.
So we have the kernel trick. Following Optimal Hyperplane solution, we have:
Notice that we are only using the inner-products of the transformed vector. We can just define and then . In this way, we donβt have to store each of the mapping, just take the inner product!
Dual Problem with Kernel Trick
We transform the decision function into:
To determine whether the Kernel is a proper kernel, we have to fulfill the Mercer Theorem. And Here is a list of Commonly Used Kernels.