Multi-Layer Perceptron

The main breakthrough here, compared to previous methods is the addition of activation function.

In here, the activation function is the sigmoid function:

Based on the Kolmogorov Theorem,

Kolmogorov Theorem

Any continuous function defined on the unit hypercube where and can be represented in the form

for a properly chosen function and .

As such,

Universal Approximation Theorem

Which means, MLP is capable of “implementing” any continuous function from input to output, given sufficient number of hidden units, proper activation function and weights.

Definition of MLP

The model:

The training data:

Objective function:

And finally, the learning algorithm:

Of course, to get we need to differentiate wrt . In the following equation, refers to the output layer, while refers to the value before activation function and is after.

For the output layer, we have which is the value of . We assume is sigmoid.

For the hidden layers, we have

Note that in multi-class classification setting, we usually use cross-entropy loss as the objective function instead.

where is the softmax output of the last layer,

Explorer

Multi-Layer Perceptron

Definition of MLP

Graph View

Backlinks