Problem 1: Derivation of Fisher Linear Discriminant

Problem Setup

We are given two classes of data: Class 1 () with samples and Class 2 () with samples. Each data point exists in a d-dimensional space, . Our goal is to find a projection direction that best separates the two classes when we project the data onto a one-dimensional space using: This projection maps our d-dimensional data to a scalar value .

Quantities in the Original Space

Before projecting, we define several important quantities in the original d-dimensional space.

The class means are: The within-class scatter matrices measure how spread out the data is within each class: \mathbf{S}_1 = \sum_{\mathbf{x}_j \in \mathcal{X}_1}(\mathbf{x}_j - \mathbf{m}_1)(\mathbf{x}_j - \mathbf{m}_1)^T$$$$\mathbf{S}_2 = \sum_{\mathbf{x}_j \in \mathcal{X}_2}(\mathbf{x}_j - \mathbf{m}_2)(\mathbf{x}_j - \mathbf{m}_2)^TThe total within-class scatter matrix combines both classes: The between-class scatter matrix measures the separation between class means: These are all taken from the lecture notes.

Quantities in the Projected Space

After projection, we need to express these quantities in terms of the one-dimensional projected values. The projected class means are: For the projected within-class scatter, we compute: Factoring out : Bringing outside the summation: The total projected within-class scatter is: The projected between-class scatter is the squared distance between projected means: \begin{align} \tilde{S}_b &= (\tilde{m}_1 - \tilde{m}_2)^2 \\ &= (\mathbf{w}^T\mathbf{m}_1 - \mathbf{w}^T\mathbf{m}_2)^2 \\ &= [\mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)]^2 \\ &= \mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)(\mathbf{m}_1 - \mathbf{m}_2)^T\mathbf{w} \\ &= \mathbf{w}^T\mathbf{S}_b\mathbf{w} \end{align}

Fisher’s Criterion

Fisher proposed finding the projection direction that maximizes the ratio of between-class variance to within-class variance: Of course, we want the between-class variance to be as large as possible (large numerator) and the within-class variance to be as small as possible (small denominator).

An important observation is that is scale-invariant. If we replace with for any non-zero constant : Since only the direction of matters, not its magnitude, we can fix the denominator to any positive constant c and simply maximize the numerator. This transforms our problem into:

Lagrangian Optimization

We solve this constrained optimization problem using Lagrange multipliers. The Lagrangian is1: Taking the derivative with respect to w and setting it to zero: This simplifies to: Assuming is invertible, we multiply both sides by : This is a generalized eigenvalue problem where w is an eigenvector of with eigenvalue .

The between-class scatter matrix has a special structure: This is an outer product, making a rank-1 matrix. When we apply to any vector : The result is always in the direction of scaled by .

Substituting this into our eigenvalue equation: This shows that w is proportional to . Since we only care about the direction of , we can ignore the constants and .

As such, the optimal projection direction is: where:

  • is the total within-class scatter matrix.
  • are the means of the tw classes.

Footnotes

  1. This is shamelessly taken from the lecture notes. ↩