Linear Regression

Crucial components:

  • Input vector:
  • Weight vector (one per feature):
    • Note that the weight vector tells how important each feature are.
  • And it outputs a predicted class
  • Objective function: minimize the residual sum of squares (RSS)

TODO: I should look through the derivation of the derivative and try it by hand.

Logistic Regression (Single Class)

  • We do the same as above, but we add a sigmoid function
  • We set the threshold at 0.5
  • Objective function: Cross Entropy error function

Logistic Regression (Multi-Classs)

  • Instead of sigmoid function, we use softmax
  • So our objective function is

Gradient Descent

  1. Batch gradient descent: uses the entire dataset.
  2. Stochastic gradient descent: uses one data point.
  3. Mini-batch gradient descent combines both.

Backpropagation

  • Just slowly go back from the last layer
  • Remember that the weights are between and

Regularization

  1. Dropout — make the model learns a sparse representation, has an ensemble effect
  2. L1/2 regularization — discourages the weights from being too large
- where $\lambda$ is the penalty hyperparameter and $p = 2$ or $1$, where
  1. Gradient clipping, we put a threshold on the gradient update. If it is above the threshold, we scale it down.