The main breakthrough here, compared to previous methods is the addition of activation function.

In here, the activation function is the sigmoid function:

Based on the Kolmogorov Theorem,

Kolmogorov Theorem

Any continuous function defined on the unit hypercube where and can be represented in the form

for a properly chosen function and .

As such,

Universal Approximation Theorem

Which means, MLP is capable of β€œimplementing” any continuous function from input to output, given sufficient number of hidden units, proper activation function and weights.

Definition of MLP

The model:

The training data:

Objective function:

And finally, the learning algorithm:

Of course, to get we need to differentiate wrt . In the following equation, refers to the output layer, while refers to the value before activation function and is after.

For the output layer, we have which is the value of . We assume is sigmoid.

For the hidden layers, we have

Note that in multi-class classification setting, we usually use cross-entropy loss as the objective function instead.

where is the softmax output of the last layer,