Backpropagation Derivation for 3-Layer MLP
ReLU Hidden Layer + Softmax Output + Cross-Entropy Loss
Network Architecture
Layers:
- Input layer:
(d features) - Hidden layer:
(m hidden units with ReLU) - Output layer:
(k classes with Softmax)
*Parameters:
| Parameter | Dimension | Description |
|---|---|---|
| Input-to-hidden weights | ||
| Hidden layer biases | ||
| Hidden-to-output weights | ||
| Output layer biases |
Forward Pass
Hidden Layer (ReLU Activation)
Pre-activation:
Activation (ReLU):
Element-wise:
ReLU Function:
Output Layer (Softmax Activation)
Pre-activation:
for all - Interpretable as class probabilities
Loss Function (Cross-Entropy)
Given true label
For a single sample with true class
Backward Pass (Backpropagation)
Goal: Compute
Softmax + Cross-Entropy Gradient
Theorem: For softmax output with cross-entropy loss:
Proof
We need to compute
Gradient of softmax w.r.t. pre-activation; The Jacobian of softmax has two cases:
Applying chain rule
For component
Therefore:
Output Layer Gradients
Define the error signal:
Gradient w.r.t.
Gradient w.r.t.
Hidden Layer Gradients
Backpropagate error to hidden layer:
Applying ReLU derivative:
where
Gradient w.r.t.
Element-wise:
Gradient w.r.t.