Statistical Learning Theory

Instead of 100% trusting in the training loss as a metric for model performance, we want to a theoretically rigorous way to ensure learning happens.

Learning is defined as finding a function that maps inputs $x$ to outputs $y$ . The function is $f (x, α)$ , where $α$ is the model parameter.

Some important terms:

Loss Function $L (y, f (x, α))$ : measures how wrong your prediction is.
- The value $L$ is the penalty of predicting $f (x, α)$ instead of $y$ .
Risk Functional $R (α)$ : measures the expected loss over the entire data distribution $F (x, y)$ . $R (α) = \int L (y, f (x, α)) d F (x, y)$
- Note that the value $R (α)$ is the objective function to minimize.

So, we want to find the best function $f (x, α_{0})$ that minimize the expected loss

α_{0} = ar g α \in Λ min R (α)

Basically, we want to search $α$ across all the possible search space $Λ$ . But we can’t do this since we don’t know the true distribution $F (x, y)$ . So we have to approximate it with empirical risk $R_{e m p}$ .

R_{e m p} (α) = \frac{1}{l} i = 1 \sum l L (y_{i}, f (x_{i}, α))

We will then minimize $R_{e m p}$ . This is called empirical risk minimization (EMR).

Furthermore, we have the upper bound of $R_{e m p}$ from Vapnik-Chervonenkis inequality.

R (α) \leq R_{e m p} (α) + Φ (\frac{h}{l})

where $Φ (\cdot)$ is a monotonic function and we can expand it to

Φ (\frac{h}{l}) = \frac{h ( ln ( \frac{2 l}{h} ) + 1 ) - ln ( \frac{η}{4} )}{l}

So now, we need to minimize both $R_{e m p}$ and $Φ$ .

Messy Notes

Explorer

Statistical Learning Theory

Graph View

Backlinks