The main breakthrough here, compared to previous methods is the addition of activation function.
In here, the activation function is the sigmoid function:
Based on the Kolmogorov Theorem,
Kolmogorov Theorem
Any continuous function
defined on the unit hypercube where and can be represented in the form for a properly chosen function
and .
As such,
Universal Approximation Theorem
Which means, MLP is capable of “implementing” any continuous function from input to output, given sufficient number of hidden units, proper activation function and weights.
Definition of MLP
The model:
The training data:
Objective function:
And finally, the learning algorithm:
Of course, to get
For the output layer, we have
For the hidden layers, we have
Note that in multi-class classification setting, we usually use cross-entropy loss as the objective function instead.
where