Problem 1: Derivation of Fisher Linear Discriminant
Problem Setup
We are given two classes of data: Class 1 () with samples and Class 2 () with samples. Each data point exists in a d-dimensional space, . Our goal is to find a projection direction that best separates the two classes when we project the data onto a one-dimensional space using: This projection maps our d-dimensional data to a scalar value .
Quantities in the Original Space
Before projecting, we define several important quantities in the original d-dimensional space.
The class means are: The within-class scatter matrices measure how spread out the data is within each class: \mathbf{S}_1 = \sum_{\mathbf{x}_j \in \mathcal{X}_1}(\mathbf{x}_j - \mathbf{m}_1)(\mathbf{x}_j - \mathbf{m}_1)^T$$$$\mathbf{S}_2 = \sum_{\mathbf{x}_j \in \mathcal{X}_2}(\mathbf{x}_j - \mathbf{m}_2)(\mathbf{x}_j - \mathbf{m}_2)^TThe total within-class scatter matrix combines both classes: The between-class scatter matrix measures the separation between class means: These are all taken from the lecture notes.
Quantities in the Projected Space
After projection, we need to express these quantities in terms of the one-dimensional projected values. The projected class means are: For the projected within-class scatter, we compute: Factoring out : Bringing outside the summation: The total projected within-class scatter is: The projected between-class scatter is the squared distance between projected means: \begin{align} \tilde{S}_b &= (\tilde{m}_1 - \tilde{m}_2)^2 \\ &= (\mathbf{w}^T\mathbf{m}_1 - \mathbf{w}^T\mathbf{m}_2)^2 \\ &= [\mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)]^2 \\ &= \mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)(\mathbf{m}_1 - \mathbf{m}_2)^T\mathbf{w} \\ &= \mathbf{w}^T\mathbf{S}_b\mathbf{w} \end{align}
Fisherβs Criterion
Fisher proposed finding the projection direction that maximizes the ratio of between-class variance to within-class variance: Of course, we want the between-class variance to be as large as possible (large numerator) and the within-class variance to be as small as possible (small denominator).
An important observation is that is scale-invariant. If we replace with for any non-zero constant : Since only the direction of matters, not its magnitude, we can fix the denominator to any positive constant c and simply maximize the numerator. This transforms our problem into:
Lagrangian Optimization
We solve this constrained optimization problem using Lagrange multipliers. The Lagrangian is1: Taking the derivative with respect to w and setting it to zero: This simplifies to: Assuming is invertible, we multiply both sides by : This is a generalized eigenvalue problem where w is an eigenvector of with eigenvalue .
The between-class scatter matrix has a special structure: This is an outer product, making a rank-1 matrix. When we apply to any vector : The result is always in the direction of scaled by .
Substituting this into our eigenvalue equation: This shows that w is proportional to . Since we only care about the direction of , we can ignore the constants and .
As such, the optimal projection direction is: where:
- is the total within-class scatter matrix.
- are the means of the tw classes.
Footnotes
-
This is shamelessly taken from the lecture notes. β©