Problem 1: Derivation of Fisher Linear Discriminant

Problem Setup

We are given two classes of data: Class 1 ( $ω_{1}$ ) with $N_{1}$ samples and Class 2 ( $ω_{2}$ ) with $N_{2}$ samples. Each data point $x_{i}$ exists in a d-dimensional space, $x_{i} \in R^{d}$ . Our goal is to find a projection direction $w \in R^{d}$ that best separates the two classes when we project the data onto a one-dimensional space using: $y = w^{T} x$ This projection maps our d-dimensional data to a scalar value $y$ .

Quantities in the Original Space

Before projecting, we define several important quantities in the original d-dimensional space.

The class means are: $m_{1} = \frac{1}{N _{1}} \sum_{x_{j} \in X_{1}} x_{j}, m_{2} = \frac{1}{N _{2}} \sum_{x_{j} \in X_{2}} x_{j}$ The within-class scatter matrices measure how spread out the data is within each class: $\mathbf{S}_1 = \sum_{\mathbf{x}_j \in \mathcal{X}_1}(\mathbf{x}_j - \mathbf{m}_1)(\mathbf{x}_j - \mathbf{m}_1)^T$$$$\mathbf{S}_2 = \sum_{\mathbf{x}_j \in \mathcal{X}_2}(\mathbf{x}_j - \mathbf{m}_2)(\mathbf{x}_j - \mathbf{m}_2)^T$ The total within-class scatter matrix combines both classes: $S_{w} = S_{1} + S_{2}$ The between-class scatter matrix measures the separation between class means: $S_{b} = (m_{1} - m_{2}) (m_{1} - m_{2})^{T}$ These are all taken from the lecture notes.

Quantities in the Projected Space

After projection, we need to express these quantities in terms of the one-dimensional projected values. The projected class means are: $\tilde{m}_{i} = \frac{1}{N _{i}} \sum_{y_{j} \in Y_{i}} y_{j} = \frac{1}{N _{i}} \sum_{x_{j} \in X_{i}} w^{T} x_{j} = w^{T} (\frac{1}{N _{i}} \sum_{x_{j} \in X_{i}} x_{j}) = w^{T} m_{i}$ For the projected within-class scatter, we compute: $\tilde{S}_{i} = \sum_{y_{j} \in Y_{i}} (y_{j} - \tilde{m}_{i})^{2} = \sum_{x_{j} \in X_{i}} (w^{T} x_{j} - w^{T} m_{i})^{2}$ Factoring out $w^{T}$ : $\tilde{S}_{i} = \sum_{x_{j} \in X_{i}} [w^{T} (x_{j} - m_{i})]^{2} = \sum_{x_{j} \in X_{i}} w^{T} (x_{j} - m_{i}) (x_{j} - m_{i})^{T} w$ Bringing $w$ outside the summation: $\tilde{S}_{i} = w^{T} [\sum_{x_{j} \in X_{i}} (x_{j} - m_{i}) (x_{j} - m_{i})^{T}] w = w^{T} S_{i} w$ The total projected within-class scatter is: $\tilde{S}_{w} = \tilde{S}_{1} + \tilde{S}_{2} = w^{T} S_{1} w + w^{T} S_{2} w = w^{T} S_{w} w$ The projected between-class scatter is the squared distance between projected means: $\begin{align} \tilde{S}_b &= (\tilde{m}_1 - \tilde{m}_2)^2 \\ &= (\mathbf{w}^T\mathbf{m}_1 - \mathbf{w}^T\mathbf{m}_2)^2 \\ &= [\mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)]^2 \\ &= \mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)(\mathbf{m}_1 - \mathbf{m}_2)^T\mathbf{w} \\ &= \mathbf{w}^T\mathbf{S}_b\mathbf{w} \end{align}$

Fisher’s Criterion

Fisher proposed finding the projection direction that maximizes the ratio of between-class variance to within-class variance: $J_{F} (w) = \frac{between-class variance}{within-class variance} = \frac{( m ~ _{1} - m ~ _{2} ) ^{2}}{S ~ _{1} + S ~ _{2}} = \frac{w ^{T} S _{b} w}{w ^{T} S _{w} w}$ Of course, we want the between-class variance to be as large as possible (large numerator) and the within-class variance to be as small as possible (small denominator).

An important observation is that $J_{F} (c w)$ is scale-invariant. If we replace $w$ with $c w$ for any non-zero constant $c$ : $J_{F} (c w) = \frac{( c w ) ^{T} S _{b} ( c w )}{( c w ) ^{T} S _{w} ( c w )} = \frac{c ^{2} w ^{T} S _{b} w}{c ^{2} w ^{T} S _{w} w} = J_{F} (w)$ Since only the direction of $w$ matters, not its magnitude, we can fix the denominator to any positive constant c and simply maximize the numerator. This transforms our problem into: $max_{w} w^{T} S_{b} w subject to w^{T} S_{w} w = c$

Lagrangian Optimization

We solve this constrained optimization problem using Lagrange multipliers. The Lagrangian is¹: $L (w, λ) = w^{T} S_{b} w - λ (w^{T} S_{w} w - c)$ Taking the derivative with respect to w and setting it to zero: $\frac{\partial L}{\partial w} = 2 S_{b} w - 2 λ S_{w} w = 0$ This simplifies to: $S_{b} w = λ S_{w} w$ Assuming $S_{w}$ is invertible, we multiply both sides by $S_{w}^{- 1}$ : $S_{w}^{- 1} S_{b} w = λ w$ This is a generalized eigenvalue problem where w is an eigenvector of $S_{w}^{- 1} S_{b}$ with eigenvalue $λ$ .

The between-class scatter matrix has a special structure: $S_{b} = (m_{1} - m_{2}) (m_{1} - m_{2})^{T}$ This is an outer product, making $S_{b}$ a rank-1 matrix. When we apply $S_{b}$ to any vector $w$ : $S_{b} w = (m_{1} - m_{2}) (m_{1} - m_{2})^{T} w = (m_{1} - m_{2}) scalar α [(m_{1} - m_{2})^{T} w]$ The result is always in the direction of $(m_{1} - m_{2})$ scaled by $α = (m_{1} - m_{2})^{T} w$ .

Substituting this into our eigenvalue equation: $S_{w}^{- 1} [α (m_{1} - m_{2})] = λ w$ $α S_{w}^{- 1} (m_{1} - m_{2}) = λ w$ This shows that w is proportional to $S_{w}^{- 1} (m_{1} - m_{2})$ . Since we only care about the direction of $w$ , we can ignore the constants $α$ and $λ$ .

As such, the optimal projection direction is: $w^{*} = S_{w}^{- 1} (m_{1} - m_{2})$ where:

$S_{w} = S_{1} + S_{2}$ is the total within-class scatter matrix.
$m_{1}, m_{2}$ are the means of the tw classes.

This is shamelessly taken from the lecture notes. ↩

Messy Notes

Explorer

TU ML HW 2

Problem 1: Derivation of Fisher Linear Discriminant

Problem Setup

Quantities in the Original Space

Quantities in the Projected Space

Fisher’s Criterion

Lagrangian Optimization

Graph View

Table of Contents

Messy Notes

Explorer

TU ML HW 2

Problem 1: Derivation of Fisher Linear Discriminant

Problem Setup

Quantities in the Original Space

Quantities in the Projected Space

Fisher’s Criterion

Lagrangian Optimization

Footnotes

Graph View

Table of Contents