RNN
Training an RNN
From feedforward,
To recurrent unit,
To find the loss function, we use cross-entropy, where
We average it to get the overall loss of the entire training set,
Computing the loss and gradient for the entire corpus
Generating text using RNN
- We can just sample a word from
to get the word chosen word at step
Problems with RNNs
- Vanishing gradient
- When we do backprop with
w.r.t to , we need to multiply the derivative throughout the previous time steps - When the vallues of
are small, then the resulting gradient will also be small - This also means that gradient signals from faraway is effectively lost
- RNN-LMs are better at learning from sequential recency rather than syntactic recency
- When we do backprop with
- Recurrent computation is slow since it is parallel
Benefits of RNNs
- Can process any length input
- Model size does not increase for longer input context
- Same weights applied at each step, so it is easy to plug and play
LSTM
Three gates:
- Forget gate:
- Input gate:
- Output gate:
Now, we compute how much to forget, update and output,
- New cell content,
- Cell state,
- Hidden state,
- Note that LSTM does not guarantee no vanishing / exploding gradient, but it just helps the model learn long-distance dependencies
Against vanishing gradient
- A possible solution is to add more direct connections, i.e. in ResNet
- Instead of calculating, where
might be very very very small
- We calculate
- This is because the forward pass is
instead of
Sequence modelling with RNNs
- Sequence tagging (POS tagging / NER tagging), just apply softmax:
- Sequence classification
- We can take the final hidden state,
as the summary of the sentence → - Or we can take max/average pooling →
- We can take the final hidden state,
Bidirectional RNNs
- We have two new matrices:
and - To get the hidden state, we concatenate the backward and forward hidden state
- Only applicable when we have access to the entire input sequence
- Bidirectionality is powerful for encoding
Stacked RNNs
- Basically multi-layer RNNs
- The hidden states from RNN layer
are the inputs to RNN layer - In deeper layers (e.g. 8 layers), we need skip-connections
Machine Translation
Given a foreign sentence
Faithfulness modelling
- Goal is to compute
from a bitext corpus
- Consider sentence-pairs
to compute ?- Same problem with n-grams, sparsity
- Consider word-pairs
of the sentences and then take conditional independence assumption to compute ?- This is what we do in word alignment
- Consider phrase-pairs to compute
?- Phrasal alignment
Seq2Seq
- Two models are put together: an encoder and a decoder
- Encoder RNN produces an encoding of the source sentence, provides initial hidden state for decoder
- Decoder RNN is a conditional language model that generates target sentence, conditioned on the encoding
- Training objective, where
is the softmax-ed output of the RNN
Evaluating MT
- BLEU compares machine-written translation to one or several human-written translations
- Giving similarity score based on n-grams + penalty on too short system translation
- Problem: n-gram is not a good method…
Pros and Cons of MT
Pros
- Better performance, in terms of fluency, context, and phrase similarities
- Single neural network, optimized end to end
- No need for feature engineering, same for all language pairs
Cons
- Less interpretable = hard to debug
- Difficult to control in terms of safety
Other Seq2Seq problems
- Summarization (long text → short text)
- Dialogue (previous utterance → next utterance)
- Parsing (input text → output parse tree)
- Code generation (natural langauge → python code)
- Segmentation (input text → output tag sequence)