Minimum Squared Error

Instead of finding $α^{T} y > 0$ for all samples, we find for as many samples as possible.

One method to do this is by introducing an error / bias term. Basically, we change from the cost function of Perceptron

J_{P} (α) = y_{j} \in Y^{k} \sum (- α^{T} y_{j})

J_{s} (α) = i = 1 \sum N (α^{T} y_{i} - b_{i})^{2}

We then perform differentiation for gradient descent to get

▽ J_{s} (α) = 2 Y^{T} (Y α - b)

and so we update the weights for every example with

α (k + 1) = α (k) + ρ_{k} (b_{k} - α (k)^{T} y^{k}) y^{k}

This is the ADALINE / LMS algorithm.

Messy Notes