If we want to do feature selection, we can’t just do trial and error to find the features that correspond to the lowest error too expensive!

So, we have to determine the separability of the features. The more separable they are, the better classification work and therefore higher accuracy. Doing this requires some metrics, and there are a bunch of metrics.

Metrics based on distributions

To measure the overlapping of two distributions:

is the feature vector; are the two classes; is the probability of class ; and is the function that computes the overlap between the distributions.

There are many equations, but unnecessary for now.

Metrics based on information theory

We have Shannon entropy, for feature :