Lecture 3 -- Sequence Modelling

RNN

Training an RNN

From feedforward,

To recurrent unit,

To find the loss function, we use cross-entropy, where is a one-hot vector for the ground truth,

We average it to get the overall loss of the entire training set,

Computing the loss and gradient for the entire corpus is to expensive, so we consider only sentence / document level.

Generating text using RNN

We can just sample a word from to get the word chosen word at step

Problems with RNNs

Vanishing gradient
- When we do backprop with w.r.t to , we need to multiply the derivative throughout the previous time steps
- When the vallues of are small, then the resulting gradient will also be small
- This also means that gradient signals from faraway is effectively lost
  - RNN-LMs are better at learning from sequential recency rather than syntactic recency
Recurrent computation is slow since it is parallel

Benefits of RNNs

Can process any length input
Model size does not increase for longer input context
Same weights applied at each step, so it is easy to plug and play

LSTM

Three gates:

Forget gate:
Input gate:
Output gate:

Now, we compute how much to forget, update and output,

New cell content,
Cell state,
Hidden state,

Note that LSTM does not guarantee no vanishing / exploding gradient, but it just helps the model learn long-distance dependencies

Against vanishing gradient

A possible solution is to add more direct connections, i.e. in ResNet
Instead of calculating, where might be very very very small

We calculate

This is because the forward pass is instead of

Sequence modelling with RNNs

Sequence tagging (POS tagging / NER tagging), just apply softmax:
Sequence classification
- We can take the final hidden state, as the summary of the sentence →
- Or we can take max/average pooling →

Bidirectional RNNs

We have two new matrices: and
To get the hidden state, we concatenate the backward and forward hidden state

Only applicable when we have access to the entire input sequence
Bidirectionality is powerful for encoding

Stacked RNNs

Basically multi-layer RNNs
The hidden states from RNN layer are the inputs to RNN layer
In deeper layers (e.g. 8 layers), we need skip-connections

Machine Translation

Given a foreign sentence , find an English sentence

Faithfulness modelling

Goal is to compute from a bitext corpus

Consider sentence-pairs to compute ?
- Same problem with n-grams, sparsity
Consider word-pairs of the sentences and then take conditional independence assumption to compute ?
- This is what we do in word alignment
Consider phrase-pairs to compute ?
- Phrasal alignment

Seq2Seq

Two models are put together: an encoder and a decoder
Encoder RNN produces an encoding of the source sentence, provides initial hidden state for decoder

Decoder RNN is a conditional language model that generates target sentence, conditioned on the encoding

Training objective, where is the softmax-ed output of the RNN

Evaluating MT

BLEU compares machine-written translation to one or several human-written translations
Giving similarity score based on n-grams + penalty on too short system translation
Problem: n-gram is not a good method…

Pros and Cons of MT

Pros

Better performance, in terms of fluency, context, and phrase similarities
Single neural network, optimized end to end
No need for feature engineering, same for all language pairs

Cons

Less interpretable = hard to debug
Difficult to control in terms of safety

Explorer