- In RNNs, we have a bottleneck in the encoder.
- This is because the decoder never sees the input text, but only the final vector
- This many-to-one compression results in many lost information
Attention
Core idea: We use direct connection to encoder to focus on the relevant part of the source sentence
- We have encoder hidden states
- On timestep , we have decoder hidden state
- We get the attention scores for this timestep:
- We take softmax to get the attention distribution for this step
- We use to take a weighted sum of the encoder hidden states to get the attention output
- Concatenate the attention output with the decoder hidden state, then compute as before
Why attention?
- Attention improves MT performance significantly
- It solves the bottleneck problem
- It helps with vanishing gradient problem
- It provides interpretability in terms of attention patterns
Attention as QKV computation
- From the above example,
- The encoder hidden states are the keys/values
- The decoder hidden state are is the query
Self-attention
- Usually attentions are applied between encoder and decoder for seq2seq learning
- For one single sequence only, we have self-attention
- In RNNs, we do things in sequence, which is unparallelizable
- With self-attention, we can parallelize per layer, as all words interact at every layer
Let be the sequence of words in vocabulary . For each , let , where is an embedding matrix.
- Transform each word embedding with weight matrices , each in
- Compute pairwise similarities between keys and queries, normalize with softmax
- Compute output for each word as weighted sum of values
Transformer
Needs to add:
- Positional embeddings
- Non-linearity
- Single to multi-head self attention
- Multiple layers
Positional embeddings
- We have positional embedding , and we just add this to the embedding
- We can make these positional embedding learned parameter, but in doing so, we canβt extrapolate to indices outside .
Non-linearity
- Self-attention is just re-averaging the value vectors, thereβs no non-linearity here
- We add a feed-forward network to post-process each output vector
Multi-head self-attention
- We just do self-attention multiple times, but with different projection matrices
- After we get each headβs attention score, we concatenate everything to get the overall score
- Each head should focus on different βfeatureβ of the sentence
Add & Norm
- We add as a residual connection to help the gradient flows better
- We let
- We also do layer norm to help models train faster
- Idea: cut down uninformative variation in hidden vector values by normalizing to unit mean and standard dev within each layer