Backpropagation Derivation for 3-Layer MLP

ReLU Hidden Layer + Softmax Output + Cross-Entropy Loss

Network Architecture

Layers:

  • Input layer: (d features)
  • Hidden layer: (m hidden units with ReLU)
  • Output layer: (k classes with Softmax)

*Parameters:

ParameterDimensionDescription
Input-to-hidden weights
Hidden layer biases
Hidden-to-output weights
Output layer biases

Forward Pass

Hidden Layer (ReLU Activation)

Pre-activation:

Activation (ReLU):

Element-wise: for

ReLU Function:


Output Layer (Softmax Activation)

Pre-activation: Activation (Softmax): Properties:

  • for all
  • Interpretable as class probabilities

Loss Function (Cross-Entropy)

Given true label (one-hot encoded vector):

For a single sample with true class :


Backward Pass (Backpropagation)

Goal: Compute , , ,

Softmax + Cross-Entropy Gradient

Theorem: For softmax output with cross-entropy loss:

Proof

We need to compute for each output . Gradient of loss w.r.t. softmax output:

Gradient of softmax w.r.t. pre-activation; The Jacobian of softmax has two cases:

Applying chain rule

For component :

Therefore:

Output Layer Gradients

Define the error signal:

Gradient w.r.t. : Element-wise:

Gradient w.r.t. : Element-wise:

Hidden Layer Gradients

Backpropagate error to hidden layer:

Applying ReLU derivative: 𝟙 The error at pre-activation of hidden layer: 𝟙

where denotes element-wise multiplication (Hadamard product).

Gradient w.r.t. :

Element-wise:

Gradient w.r.t. :