Instead of 100% trusting in the training loss as a metric for model performance, we want to a theoretically rigorous way to ensure learning happens.

Learning is defined as finding a function that maps inputs to outputs . The function is , where is the model parameter.

Some important terms:

  1. Loss Function : measures how wrong your prediction is.
    • The value is the penalty of predicting instead of .
  2. Risk Functional : measures the expected loss over the entire data distribution .
    • Note that the value is the objective function to minimize.

So, we want to find the best function that minimize the expected loss

Basically, we want to search across all the possible search space . But we can’t do this since we don’t know the true distribution . So we have to approximate it with empirical risk .

We will then minimize . This is called empirical risk minimization (EMR).

Furthermore, we have the upper bound of from Vapnik-Chervonenkis inequality.

where is a monotonic function and we can expand it to

So now, we need to minimize both and .