A generalization of Nearest-Neighbor Method.
We have the same setup. Letβs say we have a sample set and is the sample while .
We have the formula
where is the number of samples belonging to among the nearest neighbors of .
Note again that is the class number and is the class name.
The decision is
Basically, each neighbor has one voting rights. Whichever has the highest vote, wins.
Whatβs the issue with this method?
- May not work if number of samples, is small.
- Is still risky if the data is noisy.
- To help with this, we have Editing Nearest Neighbor Method
- Huge memory (from storing all the training samples) and computation (from comparing with all samples) required.
- To help with this, we have Branch-Bound Algorithm to help with computation and Condensed Nearest Neighbor Method to help with memory.
In what scenario should we use KNN?
- Dimensionality is not too high β the higher the dimension, the sparser it is, and the less accurate KNN is.
- Sample size is not too small β prevent sparsity
How do we choose the best value of ?
- Avoid ties in voting.
- Balancing the sample size and complexity (bias-variance trade-off)
- You sometimes needs to choose between your model being overfitted (low bias high variance) or underfitted (high bias low variance).
- Bias here means error or misclassification.