Backpropagation Derivation for 3-Layer MLP

ReLU Hidden Layer + Softmax Output + Cross-Entropy Loss

Network Architecture

Layers:

Input layer: (d features)
Hidden layer: (m hidden units with ReLU)
Output layer: (k classes with Softmax)

*Parameters:

Parameter	Dimension	Description
		Input-to-hidden weights
		Hidden layer biases
		Hidden-to-output weights
		Output layer biases

Forward Pass

Hidden Layer (ReLU Activation)

Pre-activation:

Activation (ReLU):

Element-wise: for

ReLU Function:

Output Layer (Softmax Activation)

Pre-activation: Activation (Softmax): Properties:

for all
Interpretable as class probabilities

Loss Function (Cross-Entropy)

Given true label (one-hot encoded vector):

For a single sample with true class :

Backward Pass (Backpropagation)

Goal: Compute , , ,

Softmax + Cross-Entropy Gradient

Theorem: For softmax output with cross-entropy loss:

Proof

We need to compute for each output . Gradient of loss w.r.t. softmax output:

Gradient of softmax w.r.t. pre-activation; The Jacobian of softmax has two cases:

Applying chain rule

For component :

Therefore:

Output Layer Gradients

Define the error signal:

Gradient w.r.t. : Element-wise:

Hidden Layer Gradients

Backpropagate error to hidden layer:

Applying ReLU derivative: $𝟙$ The error at pre-activation of hidden layer: $𝟙$

where denotes element-wise multiplication (Hadamard product).

Gradient w.r.t. :

Element-wise: