Lecture 3 -- Sequence Modelling

RNN

Training an RNN

From feedforward,

h^{(t)} y^{(t)} = g (W_{e} e^{(t)}) = softmax (U h^{(t)})

To recurrent unit,

h^{(t)} \hat{y}^{(t)} = g (W_{h} h^{(t - 1)} + W_{e} e^{(t)}) = softmax (U h^{(t)})

To find the loss function, we use cross-entropy, where $y^{(t)}$ is a one-hot vector for the ground truth,

J^{(t)} (θ) = CE (y^{(t)}, \hat{y}^{(t)}) = - w \in V \sum y_{w}^{(t)} lo g \hat{y}_{w}^{(t)}

We average it to get the overall loss of the entire training set,

J (θ) = \frac{1}{T} t = 1 \sum T J^{(t)} (θ)

Computing the loss and gradient for the entire corpus $x^{(1)}, \dots, x^{(T)}$ is to expensive, so we consider only sentence / document level.

Generating text using RNN

We can just sample a word from $\hat{y}^{(t)}$ to get the word chosen word at step $t$

Problems with RNNs

Vanishing gradient
- When we do backprop with $J^{(t)} (θ)$ w.r.t to $W_{h}$ , we need to multiply the derivative throughout the previous time steps
- When the vallues of $\frac{\partial h ^{(t)}}{\partial h ^{(t - 1)}}$ are small, then the resulting gradient will also be small
- This also means that gradient signals from faraway is effectively lost
  - RNN-LMs are better at learning from sequential recency rather than syntactic recency
Recurrent computation is slow since it is parallel

Benefits of RNNs

Can process any length input
Model size does not increase for longer input context
Same weights applied at each step, so it is easy to plug and play

LSTM

Three gates:

Forget gate: $f^{(t)} = σ (W_{f} h^{(t - 1)} + U_{f} x^{(t)} + b_{f})$
Input gate: $i^{(t)} = σ (W_{i} h^{(t - 1)} + U_{i} x^{(t)} + b_{i})$
Output gate: $o^{(t)} = σ (W_{o} h^{(t - 1)} + U_{o} x^{(t)} + b_{o})$

Now, we compute how much to forget, update and output,

New cell content, $\tilde{c}^{(t)} = tanh (W_{c} h^{(t - 1)} + U_{c} x^{(t)} + b_{c})$
Cell state, $c^{(t)} = f^{(t)} \cdot c^{(t - 1)} + i^{(t)} \cdot \tilde{c}^{(t)}$
Hidden state, $h^{(t)} = o^{(t)} \cdot tanh c^{(t)}$

Note that LSTM does not guarantee no vanishing / exploding gradient, but it just helps the model learn long-distance dependencies

Against vanishing gradient

A possible solution is to add more direct connections, i.e. in ResNet
Instead of calculating, where $F^{'} (x)$ might be very very very small

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \times F^{'} (x)

We calculate

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \times (F^{'} (x) + 1)

This is because the forward pass is $y = F (x) + x$ instead of $y = F (x)$

Sequence modelling with RNNs

Sequence tagging (POS tagging / NER tagging), just apply softmax: $\overset{y}{^} = softmax (U h^{(t)} + b)$
Sequence classification
- We can take the final hidden state, $h^{(t)}$ as the summary of the sentence → $g = softmax (U h^{(t)} + b)$
- Or we can take max/average pooling → $h^{(s)} = avg (\sum_{t = 1}^{T} h^{(t)})$

Bidirectional RNNs

backward h^{(t)} forward h^{(t)} = σ (W_{b x} x^{(t)} + W_{bh} h^{(t + 1)} + b_{b}) = σ (W_{f x} x^{(t)} + W_{f h} h^{(t - 1)} + b_{f})

We have two new matrices: $W_{b x}$ and $W_{bh}$
To get the hidden state, we concatenate the backward and forward hidden state

h^{(t)} = [forward h^{(t)}; backward h^{(t)}]

Only applicable when we have access to the entire input sequence
Bidirectionality is powerful for encoding

Stacked RNNs

Basically multi-layer RNNs
The hidden states from RNN layer $i$ are the inputs to RNN layer $i + 1$
In deeper layers (e.g. 8 layers), we need skip-connections

Machine Translation

Given a foreign sentence $F$ , find an English sentence $E$

\hat{E} = ar g E \in English max P (F ∣ E) Translation model Language model P (E)

Faithfulness modelling $P (F ∣ E)$

Goal is to compute $P (F ∣ E)$ from a bitext $(E, F)$ corpus

Consider sentence-pairs $(E, F)$ to compute $P (F ∣ E)$ ?
- Same problem with n-grams, sparsity
Consider word-pairs $(e_{i}, f_{j})$ of the sentences and then take conditional independence assumption to compute $P (F ∣ E)$ ?
- This is what we do in word alignment
Consider phrase-pairs to compute $P (F ∣ E)$ ?
- Phrasal alignment

Seq2Seq

Two models are put together: an encoder and a decoder
Encoder RNN produces an encoding of the source sentence, provides initial hidden state for decoder

h_{x}^{1} = f (W_{x e} \cdot e_{les} + W_{x h} \cdot h_{x}^{0} + b_{x})

Decoder RNN is a conditional language model that generates target sentence, conditioned on the encoding

h_{y}^{1} = f (W_{ye} \cdot e_{<START>} + W_{y h} \cdot h_{x}^{4} last hidden encoder state + b_{y})

Training objective, where $p (y ∣ x)$ is the softmax-ed output of the RNN

J_{t} = (x, y) \in D \sum - lo g p (y ∣ x)

Evaluating MT

BLEU compares machine-written translation to one or several human-written translations
Giving similarity score based on n-grams + penalty on too short system translation
Problem: n-gram is not a good method…

Pros and Cons of MT

Pros

Better performance, in terms of fluency, context, and phrase similarities
Single neural network, optimized end to end
No need for feature engineering, same for all language pairs

Cons

Less interpretable = hard to debug
Difficult to control in terms of safety

Messy Notes

Explorer

Lecture 3 -- Sequence Modelling

RNN

Training an RNN

Generating text using RNN

Problems with RNNs

Benefits of RNNs

LSTM

Against vanishing gradient

Sequence modelling with RNNs

Bidirectional RNNs

Stacked RNNs

Machine Translation

Faithfulness modelling $P (F ∣ E)$

Seq2Seq

Evaluating MT

Pros and Cons of MT

Other Seq2Seq problems

Graph View

Table of Contents

Messy Notes

Explorer

Lecture 3 -- Sequence Modelling

RNN

Training an RNN

Generating text using RNN

Problems with RNNs

Benefits of RNNs

LSTM

Against vanishing gradient

Sequence modelling with RNNs

Bidirectional RNNs

Stacked RNNs

Machine Translation

Faithfulness modelling P(F∣E)

Seq2Seq

Evaluating MT

Pros and Cons of MT

Other Seq2Seq problems

Graph View

Table of Contents

Faithfulness modelling $P (F ∣ E)$