Multi-Layer Perceptron

The main breakthrough here, compared to previous methods is the addition of activation function.

In here, the activation function is the sigmoid function:

f (α) = \frac{1}{1 + e ^{- 1 (α - θ)}}

Based on the Kolmogorov Theorem,

Kolmogorov Theorem

Any continuous function $g (x)$ defined on the unit hypercube $I^{n}$ where $I = [0, 1]$ and $n \geq 2$ can be represented in the form
$g (x) = j = 1 \sum 2 n + 1 Ξ_{j} (i = 1 \sum d ψ_{ij} (x_{i}))$
for a properly chosen function $Ξ_{j}$ and $ψ_{ij}$ .

As such,

Universal Approximation Theorem

$g_{k} (x) \equiv y_{k} = f (j \sum w_{jk} f (i \sum w_{ij} x_{i} + w_{j 0}) + w_{k 0})$
Which means, MLP is capable of “implementing” any continuous function from input to output, given sufficient number of hidden units, proper activation function and weights.

Definition of MLP

The model:

g (x) = f (j \sum w_{jk} f (i \sum w_{ij} x_{i} + w_{j_{0}}) + w_{k_{0}})

The training data:

{(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}, x_{j} \in R^{d + 1}, y_{j} \in R

Objective function:

min E = \frac{1}{2} j = 1 \sum N (g (x_{j}) - y_{j})^{2}

And finally, the learning algorithm:

w (k + 1) = w (k) - ρ_{k} ▽ E

Of course, to get $▽ E$ we need to differentiate wrt $w$ . In the following equation, $k$ refers to the output layer, while $z$ refers to the value before activation function and $a$ is after.

For the output layer, we have $δ_{k}$ which is the value of $\frac{\partial E}{\partial z _{k}}$ . We assume $f (x)$ is sigmoid.

δ_{k} = (a_{k} - y_{k}) \cdot f^{'} (z_{k}) = (a_{k} - y_{k}) a_{k} (1 - a_{k})

For the hidden layers, we have

δ_{j} = \frac{\partial L}{\partial z _{j}} = k \sum \frac{\partial L}{\partial z _{k}} \frac{\partial z _{k}}{\partial a _{j}} \frac{\partial a _{j}}{\partial z _{j}} = k \sum δ_{k} w_{kj} f^{'} (z_{j}) = f^{'} (z_{j}) k \sum w_{kj} δ_{k} = a_{j} (1 - a_{j}) k \sum w_{kj} δ_{k}

Note that in multi-class classification setting, we usually use cross-entropy loss as the objective function instead.

E_{i} = - i = 1 \sum C y_{i} lo g (\overset{y}{^}_{i})

where $\overset{y}{^}_{i}$ is the softmax output of the last layer,

\overset{y}{^}_{i} = \frac{e ^{z_{i}}}{\sum _{j} e ^{z_{j}}}

Messy Notes

Explorer

Multi-Layer Perceptron

Definition of MLP

Graph View

Backlinks