- In RNNs, we have a bottleneck in the encoder.
- This is because the decoder never sees the input text, but only the final vector
- This many-to-one compression results in many lost information
Attention
Core idea: We use direct connection to encoder to focus on the relevant part of the source sentence
- We have encoder hidden states
- On timestep
, we have decoder hidden state - We get the attention scores
for this timestep:
- We take softmax to get the attention distribution
for this step
- We use
to take a weighted sum of the encoder hidden states to get the attention output
- Concatenate the attention output
with the decoder hidden state, then compute as before
Why attention?
- Attention improves MT performance significantly
- It solves the bottleneck problem
- It helps with vanishing gradient problem
- It provides interpretability in terms of attention patterns
Attention as QKV computation
- From the above example,
- The encoder hidden states
are the keys/values - The decoder hidden state
are is the query
- The encoder hidden states
Self-attention
- Usually attentions are applied between encoder and decoder for seq2seq learning
- For one single sequence only, we have self-attention
- In RNNs, we do things in sequence, which is unparallelizable
- With self-attention, we can parallelize per layer, as all words interact at every layer
Let
- Transform each word embedding with weight matrices
, each in
- Compute pairwise similarities between keys and queries, normalize with softmax
- Compute output for each word as weighted sum of values
Transformer
Needs to add:
- Positional embeddings
- Non-linearity
- Single to multi-head self attention
- Multiple layers
Positional embeddings
- We have positional embedding
, and we just add this to the embedding
- We can make these positional embedding learned parameter, but in doing so, we can’t extrapolate to indices outside
.
Non-linearity
- Self-attention is just re-averaging the value vectors, there’s no non-linearity here
- We add a feed-forward network to post-process each output vector
Multi-head self-attention
- We just do self-attention multiple times, but with different projection matrices
- After we get each head’s attention score, we concatenate everything to get the overall score
- Each head should focus on different “feature” of the sentence
Add & Norm
- We add as a residual connection to help the gradient flows better
- We let
- We let
- We also do layer norm to help models train faster
- Idea: cut down uninformative variation in hidden vector values by normalizing to unit mean and standard dev within each layer