The main breakthrough here, compared to previous methods is the addition of activation function.
In here, the activation function is the sigmoid function:
Based on the Kolmogorov Theorem,
Kolmogorov Theorem
Any continuous function defined on the unit hypercube where and can be represented in the form
for a properly chosen function and .
As such,
Universal Approximation Theorem
Which means, MLP is capable of βimplementingβ any continuous function from input to output, given sufficient number of hidden units, proper activation function and weights.
Definition of MLP
The model:
The training data:
Objective function:
And finally, the learning algorithm:
Of course, to get we need to differentiate wrt . In the following equation, refers to the output layer, while refers to the value before activation function and is after.
For the output layer, we have which is the value of . We assume is sigmoid.
For the hidden layers, we have
Note that in multi-class classification setting, we usually use cross-entropy loss as the objective function instead.
where is the softmax output of the last layer,