Backpropagation Derivation for 3-Layer MLP

ReLU Hidden Layer + Softmax Output + Cross-Entropy Loss

Network Architecture

Layers:

Input layer: $x \in R^{d}$ (d features)
Hidden layer: $h \in R^{m}$ (m hidden units with ReLU)
Output layer: $\hat{y} \in R^{k}$ (k classes with Softmax)

*Parameters:

Parameter	Dimension	Description
$W^{(1)}$	$m \times d$	Input-to-hidden weights
$b^{(1)}$	$m \times 1$	Hidden layer biases
$W^{(2)}$	$k \times m$	Hidden-to-output weights
$b^{(2)}$	$k \times 1$	Output layer biases

Forward Pass

Hidden Layer (ReLU Activation)

Pre-activation: $z^{(1)} = W^{(1)} x + b^{(1)}$

Activation (ReLU): $h = ReLU (z^{(1)}) = max (0, z^{(1)})$

Element-wise: $h_{j} = max (0, z_{j}^{(1)})$ for $j = 1, \dots, m$

ReLU Function: $ReLU (z) = {z 0 if z > 0 if z \leq 0$

Output Layer (Softmax Activation)

Pre-activation: $z^{(2)} = W^{(2)} h + b^{(2)}$ Activation (Softmax): $\overset{y}{^}_{i} = \frac{e x p ( z _{i}^{(2)} )}{\sum _{j = 1}^{k} e x p ( z _{j}^{(2)} )} for i = 1, \dots, k$ Properties:

$\overset{y}{^}_{i} \in (0, 1)$ for all $i$
$\sum_{i = 1}^{k} \overset{y}{^}_{i} = 1$
Interpretable as class probabilities

Loss Function (Cross-Entropy)

Given true label $t$ (one-hot encoded vector):

$L = - \sum_{i = 1}^{k} t_{i} lo g (\overset{y}{^}_{i})$

For a single sample with true class $c$ : $L = - lo g (\overset{y}{^}_{c})$

Backward Pass (Backpropagation)

Goal: Compute $\frac{\partial L}{\partial W ^{(2)}}$ , $\frac{\partial L}{\partial b ^{(2)}}$ , $\frac{\partial L}{\partial W ^{(1)}}$ , $\frac{\partial L}{\partial b ^{(1)}}$

Softmax + Cross-Entropy Gradient

Theorem: For softmax output with cross-entropy loss: $\frac{\partial L}{\partial z ^{(2)}} = \hat{y} - t$

Proof

We need to compute $\frac{\partial L}{\partial z _{j}^{(2)}}$ for each output $j$ . Gradient of loss w.r.t. softmax output:

$\frac{\partial L}{\partial y ^ _{i}} = \frac{\partial}{\partial y ^ _{i}} (- \sum_{k = 1}^{K} t_{k} lo g \overset{y}{^}_{k}) = - \frac{t _{i}}{y ^ _{i}}$

Gradient of softmax w.r.t. pre-activation; The Jacobian of softmax has two cases:

$\frac{\partial y ^ _{i}}{\partial z _{j}^{(2)}} = {\overset{y}{^}_{i} (1 - \overset{y}{^}_{i}) if i = j - \overset{y}{^}_{i} \overset{y}{^}_{j} if i \neq = j$

Applying chain rule

$\frac{\partial L}{\partial z _{j}^{(2)}} = \sum_{i = 1}^{k} \frac{\partial L}{\partial y ^ _{i}} \cdot \frac{\partial y ^ _{i}}{\partial z _{j}^{(2)}}$

For component $j$ :

$\begin{align} \frac{\partial L}{\partial z_j^{(2)}} &= \left(-\frac{t_j}{\hat{y}_j}\right) \cdot \hat{y}_j(1 - \hat{y}_j) + \sum_{i \neq j} \left(-\frac{t_i}{\hat{y}_i}\right) \cdot (-\hat{y}_i \hat{y}_j) \ &= -t_j(1 - \hat{y}_j) + \sum_{i \neq j} t_i \hat{y}_j \ &= -t_j + t_j\hat{y}_j + \sum_{i \neq j} t_i \hat{y}_j \ &= -t_j + \hat{y}_j \sum_{i=1}^k t_i \ &= -t_j + \hat{y}_j \quad \text{(since } \sum_i t_i = 1\text{)} \ &= \hat{y}_j - t_j \end{align}$

Therefore: $δ^{(2)} = \hat{y} - t$

Output Layer Gradients

Define the error signal: $δ^{(2)} = \frac{\partial L}{\partial z ^{(2)}} = \hat{y} - t$

Gradient w.r.t. $W^{(2)}$ : $\frac{\partial L}{\partial W ^{(2)}} = δ^{(2)} h^{⊤}$ Element-wise: $\frac{\partial L}{\partial W _{ij}^{(2)}} = δ_{i}^{(2)} \cdot h_{j} = (\overset{y}{^}_{i} - t_{i}) \cdot h_{j}$

Gradient w.r.t. $b^{(2)}$ : $\frac{\partial L}{\partial b ^{(2)}} = δ^{(2)}$ Element-wise: $\frac{\partial L}{\partial b _{i}^{(2)}} = δ_{i}^{(2)} = \overset{y}{^}_{i} - t_{i}$

Hidden Layer Gradients

Backpropagate error to hidden layer:

$\frac{\partial L}{\partial h} = (W^{(2)})^{⊤} δ^{(2)}$

Applying ReLU derivative: $\frac{\partial ReLU ( z )}{\partial z} = {1 if z > 00 if z \leq 0 = 1 (z > 0)$ The error at pre-activation of hidden layer: $δ^{(1)} = \frac{\partial L}{\partial z ^{(1)}} = \frac{\partial L}{\partial h} ⊙ ReLU^{'} (z^{(1)})$ $δ^{(1)} = [(W^{(2)})^{⊤} δ^{(2)}] ⊙ 1 (z^{(1)} > 0)$

where $⊙$ denotes element-wise multiplication (Hadamard product).

Gradient w.r.t. $W^{(1)}$ : $\frac{\partial L}{\partial W ^{(1)}} = δ^{(1)} x^{⊤}$

Element-wise: $\frac{\partial L}{\partial W _{ij}^{(1)}} = δ_{i}^{(1)} \cdot x_{j}$

Gradient w.r.t. $b^{(1)}$ : $\frac{\partial L}{\partial b ^{(1)}} = δ^{(1)}$

Messy Notes

Explorer

TU ML HW 3