Lecture 1 -- Introduction to Deep Learning

Linear Regression

Crucial components:

Input vector: $x = [x_{1}, x_{2}, \dots, x_{n}]$
Weight vector (one per feature): $W = [w_{1}, w_{2}, \dots, w_{n}]$
- Note that the weight vector tells how important each feature are.
And it outputs a predicted class $\overset{y}{^} \in {0, 1}$

\overset{y}{^}_{i} = x_{i}^{T} w

Objective function: minimize the residual sum of squares (RSS)

J (θ) = ar g w min i = 1 \sum N (y_{i} - \overset{y}{^}_{i})^{2} = i = 1 \sum N (y_{i} - x_{i}^{T} w)^{2} = (y - Xw)^{T} (y - Xw)

TODO: I should look through the derivation of the derivative and try it by hand.

Logistic Regression (Single Class)

We do the same as above, but we add a sigmoid function → $σ (z_{i})$

σ (z_{i}) = \frac{e ^{z_{i}}}{e ^{z_{i}} + 1}

We set the threshold at 0.5

\overset{y}{^} = {10 if σ (w x + b) > 0.5 otherwise

Objective function: Cross Entropy error function

w^{*} = ar g w min - i = 1 \sum N (y_{i} lo g μ_{i} + (1 - y_{i}) lo g (1 - μ_{i}))

Logistic Regression (Multi-Classs)

Instead of sigmoid function, we use softmax

P (y_{i} = c ∣ x_{i, W}) = \frac{exp ( w _{c}^{T} x _{i} )}{\sum _{c^{'} = 1}^{C} exp ( w _{c^{'}}^{T} x _{i} )}

So our objective function is

W^{*} = i = 1 \sum N c = 1 \sum C y_{i c} lo g (softmax (w_{c}^{T} x_{i})) = ar g W max i = 1 \sum N [(c = 1 \sum C y_{i c} w_{c}^{T} x_{i}) - lo g (c^{'} = 1 \sum C exp (w_{c^{'}}^{T} x_{i}))]

Gradient Descent

Batch gradient descent: uses the entire dataset. $θ \leftarrow θ - η \cdot ▽_{θ} \frac{1}{N} \sum_{i = 1}^{N} J_{i}$
Stochastic gradient descent: uses one data point. $θ \leftarrow θ - η \cdot ▽_{θ} J_{π (i)}$
Mini-batch gradient descent combines both. $θ \leftarrow η \cdot ▽_{θ} \frac{1}{S} \sum_{i = b_{1}}^{b_{S}} J_{i}$

Backpropagation

\frac{\partial z}{\partial x} = i = 1 \sum n \frac{\partial z}{\partial y _{i}} \frac{\partial y _{i}}{\partial x}

Just slowly go back from the last layer
Remember that the weights are between $a$ and $z$

Regularization

Dropout — make the model learns a sparse representation, has an ensemble effect
L1/2 regularization — discourages the weights from being too large

ar g θ min Loss (θ) + λ θ_{p}^{p}

- where $\lambda$ is the penalty hyperparameter and $p = 2$ or $1$, where

θ_{1} = j = 1 \sum d ∣ θ_{j} ∣

Gradient clipping, we put a threshold on the gradient update. If it is above the threshold, we scale it down.