• In RNNs, we have a bottleneck in the encoder.
  • This is because the decoder never sees the input text, but only the final vector
  • This many-to-one compression results in many lost information

Attention

Core idea: We use direct connection to encoder to focus on the relevant part of the source sentence

  • We have encoder hidden states
  • On timestep , we have decoder hidden state
  • We get the attention scores for this timestep:
  • We take softmax to get the attention distribution for this step
  • We use to take a weighted sum of the encoder hidden states to get the attention output
  • Concatenate the attention output with the decoder hidden state, then compute as before

Why attention?

  1. Attention improves MT performance significantly
  2. It solves the bottleneck problem
  3. It helps with vanishing gradient problem
  4. It provides interpretability in terms of attention patterns

Attention as QKV computation

  • From the above example,
    • The encoder hidden states are the keys/values
    • The decoder hidden state are is the query

Self-attention

  • Usually attentions are applied between encoder and decoder for seq2seq learning
  • For one single sequence only, we have self-attention
  • In RNNs, we do things in sequence, which is unparallelizable
  • With self-attention, we can parallelize per layer, as all words interact at every layer

Let be the sequence of words in vocabulary . For each , let , where is an embedding matrix.

  1. Transform each word embedding with weight matrices , each in
  1. Compute pairwise similarities between keys and queries, normalize with softmax
  1. Compute output for each word as weighted sum of values

Transformer

Needs to add:

  1. Positional embeddings
  2. Non-linearity
  3. Single to multi-head self attention
  4. Multiple layers

Positional embeddings

  • We have positional embedding , and we just add this to the embedding
  • We can make these positional embedding learned parameter, but in doing so, we can’t extrapolate to indices outside .

Non-linearity

  • Self-attention is just re-averaging the value vectors, there’s no non-linearity here
  • We add a feed-forward network to post-process each output vector

Multi-head self-attention

  • We just do self-attention multiple times, but with different projection matrices
  • After we get each head’s attention score, we concatenate everything to get the overall score
  • Each head should focus on different β€œfeature” of the sentence

Add & Norm

  • We add as a residual connection to help the gradient flows better
    • We let
  • We also do layer norm to help models train faster
    • Idea: cut down uninformative variation in hidden vector values by normalizing to unit mean and standard dev within each layer