Backpropagation Derivation for 3-Layer MLP
ReLU Hidden Layer + Softmax Output + Cross-Entropy Loss
Network Architecture
Layers:
- Input layer: (d features)
- Hidden layer: (m hidden units with ReLU)
- Output layer: (k classes with Softmax)
*Parameters:
| Parameter | Dimension | Description |
|---|---|---|
| Input-to-hidden weights | ||
| Hidden layer biases | ||
| Hidden-to-output weights | ||
| Output layer biases |
Forward Pass
Hidden Layer (ReLU Activation)
Pre-activation:
Activation (ReLU):
Element-wise: for
ReLU Function:
Output Layer (Softmax Activation)
Pre-activation: Activation (Softmax): Properties:
- for all
- Interpretable as class probabilities
Loss Function (Cross-Entropy)
Given true label (one-hot encoded vector):
For a single sample with true class :
Backward Pass (Backpropagation)
Goal: Compute , , ,
Softmax + Cross-Entropy Gradient
Theorem: For softmax output with cross-entropy loss:
Proof
We need to compute for each output . Gradient of loss w.r.t. softmax output:
Gradient of softmax w.r.t. pre-activation; The Jacobian of softmax has two cases:
Applying chain rule
For component :
\begin{align} \frac{\partial L}{\partial z_j^{(2)}} &= \left(-\frac{t_j}{\hat{y}_j}\right) \cdot \hat{y}_j(1 - \hat{y}_j) + \sum_{i \neq j} \left(-\frac{t_i}{\hat{y}_i}\right) \cdot (-\hat{y}_i \hat{y}_j) \ &= -t_j(1 - \hat{y}_j) + \sum_{i \neq j} t_i \hat{y}_j \ &= -t_j + t_j\hat{y}_j + \sum_{i \neq j} t_i \hat{y}_j \ &= -t_j + \hat{y}_j \sum_{i=1}^k t_i \ &= -t_j + \hat{y}_j \quad \text{(since } \sum_i t_i = 1\text{)} \ &= \hat{y}_j - t_j \end{align}
Therefore:
Output Layer Gradients
Define the error signal:
Gradient w.r.t. : Element-wise:
Gradient w.r.t. : Element-wise:
Hidden Layer Gradients
Backpropagate error to hidden layer:
Applying ReLU derivative: The error at pre-activation of hidden layer:
where denotes element-wise multiplication (Hadamard product).
Gradient w.r.t. :
Element-wise:
Gradient w.r.t. :