Backpropagation Derivation for 3-Layer MLP

ReLU Hidden Layer + Softmax Output + Cross-Entropy Loss

Network Architecture

Layers:

  • Input layer: (d features)
  • Hidden layer: (m hidden units with ReLU)
  • Output layer: (k classes with Softmax)

*Parameters:

ParameterDimensionDescription
Input-to-hidden weights
Hidden layer biases
Hidden-to-output weights
Output layer biases

Forward Pass

Hidden Layer (ReLU Activation)

Pre-activation:

Activation (ReLU):

Element-wise: for

ReLU Function:


Output Layer (Softmax Activation)

Pre-activation: Activation (Softmax): Properties:

  • for all
  • Interpretable as class probabilities

Loss Function (Cross-Entropy)

Given true label (one-hot encoded vector):

For a single sample with true class :


Backward Pass (Backpropagation)

Goal: Compute , , ,

Softmax + Cross-Entropy Gradient

Theorem: For softmax output with cross-entropy loss:

Proof

We need to compute for each output . Gradient of loss w.r.t. softmax output:

Gradient of softmax w.r.t. pre-activation; The Jacobian of softmax has two cases:

Applying chain rule

For component :

\begin{align} \frac{\partial L}{\partial z_j^{(2)}} &= \left(-\frac{t_j}{\hat{y}_j}\right) \cdot \hat{y}_j(1 - \hat{y}_j) + \sum_{i \neq j} \left(-\frac{t_i}{\hat{y}_i}\right) \cdot (-\hat{y}_i \hat{y}_j) \ &= -t_j(1 - \hat{y}_j) + \sum_{i \neq j} t_i \hat{y}_j \ &= -t_j + t_j\hat{y}_j + \sum_{i \neq j} t_i \hat{y}_j \ &= -t_j + \hat{y}_j \sum_{i=1}^k t_i \ &= -t_j + \hat{y}_j \quad \text{(since } \sum_i t_i = 1\text{)} \ &= \hat{y}_j - t_j \end{align}

Therefore:

Output Layer Gradients

Define the error signal:

Gradient w.r.t. : Element-wise:

Gradient w.r.t. : Element-wise:

Hidden Layer Gradients

Backpropagate error to hidden layer:

Applying ReLU derivative: The error at pre-activation of hidden layer:

where denotes element-wise multiplication (Hadamard product).

Gradient w.r.t. :

Element-wise:

Gradient w.r.t. :