If we want to do feature selection, we can’t just do trial and error to find the features that correspond to the lowest error ⇒ too expensive!
So, we have to determine the separability of the features. The more separable they are, the better classification work and therefore higher accuracy. Doing this requires some metrics, and there are a bunch of metrics.
Metrics based on distributions
To measure the overlapping of two distributions:
is the feature vector; are the two classes; is the probability of class ; and is the function that computes the overlap between the distributions.
There are many equations, but unnecessary for now.
Metrics based on information theory
We have Shannon entropy, for feature :