A generalization of Nearest-Neighbor Method.

We have the same setup. Let’s say we have a sample set and is the sample while .

We have the formula

where is the number of samples belonging to among the nearest neighbors of .

Note again that is the class number and is the class name.

The decision is

Basically, each neighbor has one voting rights. Whichever has the highest vote, wins.

What’s the issue with this method?

In what scenario should we use KNN?

  • Dimensionality is not too high β†’ the higher the dimension, the sparser it is, and the less accurate KNN is.
  • Sample size is not too small β†’ prevent sparsity

How do we choose the best value of ?

  • Avoid ties in voting.
  • Balancing the sample size and complexity (bias-variance trade-off)
    • You sometimes needs to choose between your model being overfitted (low bias high variance) or underfitted (high bias low variance).
    • Bias here means error or misclassification.