RNN

Training an RNN

From feedforward,

To recurrent unit,

To find the loss function, we use cross-entropy, where is a one-hot vector for the ground truth,

We average it to get the overall loss of the entire training set,

Computing the loss and gradient for the entire corpus is to expensive, so we consider only sentence / document level.

Generating text using RNN

  • We can just sample a word from to get the word chosen word at step

Problems with RNNs

  1. Vanishing gradient
    • When we do backprop with w.r.t to , we need to multiply the derivative throughout the previous time steps
    • When the vallues of are small, then the resulting gradient will also be small
    • This also means that gradient signals from faraway is effectively lost
      • RNN-LMs are better at learning from sequential recency rather than syntactic recency
  2. Recurrent computation is slow since it is parallel

Benefits of RNNs

  1. Can process any length input
  2. Model size does not increase for longer input context
  3. Same weights applied at each step, so it is easy to plug and play

LSTM

Three gates:

  1. Forget gate:
  2. Input gate:
  3. Output gate:

Now, we compute how much to forget, update and output,

  1. New cell content,
  2. Cell state,
  3. Hidden state,
  • Note that LSTM does not guarantee no vanishing / exploding gradient, but it just helps the model learn long-distance dependencies

Against vanishing gradient

  • A possible solution is to add more direct connections, i.e. in ResNet
  • Instead of calculating, where might be very very very small
  • We calculate
  • This is because the forward pass is instead of

Sequence modelling with RNNs

  1. Sequence tagging (POS tagging / NER tagging), just apply softmax:
  2. Sequence classification
    • We can take the final hidden state, as the summary of the sentence β†’
    • Or we can take max/average pooling β†’

Bidirectional RNNs

  • We have two new matrices: and
  • To get the hidden state, we concatenate the backward and forward hidden state
  • Only applicable when we have access to the entire input sequence
  • Bidirectionality is powerful for encoding

Stacked RNNs

  • Basically multi-layer RNNs
  • The hidden states from RNN layer are the inputs to RNN layer
  • In deeper layers (e.g. 8 layers), we need skip-connections

Machine Translation

Given a foreign sentence , find an English sentence

Faithfulness modelling

  • Goal is to compute from a bitext corpus
  1. Consider sentence-pairs to compute ?
    • Same problem with n-grams, sparsity
  2. Consider word-pairs of the sentences and then take conditional independence assumption to compute ?
    • This is what we do in word alignment
  3. Consider phrase-pairs to compute ?
    • Phrasal alignment

Seq2Seq

  • Two models are put together: an encoder and a decoder
  • Encoder RNN produces an encoding of the source sentence, provides initial hidden state for decoder
  • Decoder RNN is a conditional language model that generates target sentence, conditioned on the encoding
  • Training objective, where is the softmax-ed output of the RNN

Evaluating MT

  • BLEU compares machine-written translation to one or several human-written translations
  • Giving similarity score based on n-grams + penalty on too short system translation
  • Problem: n-gram is not a good method…

Pros and Cons of MT

Pros

  1. Better performance, in terms of fluency, context, and phrase similarities
  2. Single neural network, optimized end to end
  3. No need for feature engineering, same for all language pairs

Cons

  1. Less interpretable = hard to debug
  2. Difficult to control in terms of safety

Other Seq2Seq problems

  1. Summarization (long text β†’ short text)
  2. Dialogue (previous utterance β†’ next utterance)
  3. Parsing (input text β†’ output parse tree)
  4. Code generation (natural langauge β†’ python code)
  5. Segmentation (input text β†’ output tag sequence)