Lecture 4 -- Attention & Transformer

In RNNs, we have a bottleneck in the encoder.
This is because the decoder never sees the input text, but only the final vector
This many-to-one compression results in many lost information

Attention

Core idea: We use direct connection to encoder to focus on the relevant part of the source sentence

We have encoder hidden states $h_{1}, \dots, h_{N} \in R^{h}$
On timestep $t$ , we have decoder hidden state $s_{t} \in R^{h}$
We get the attention scores $e^{t}$ for this timestep:

e^{t} = [s_{t}^{T} h_{1}, \dots, s_{t}^{T} h_{N}] \in R^{N}

We take softmax to get the attention distribution $α^{t}$ for this step

α^{t} = softmax (e^{t}) \in R^{N}

We use $α^{t}$ to take a weighted sum of the encoder hidden states to get the attention output $a_{t}$

a_{t} = i = 1 \sum N α_{i}^{t} h_{i} \in R^{h}

Concatenate the attention output $a_{t}$ with the decoder hidden state, then compute $\overset{y}{^}_{1}$ as before

\overset{y}{^}_{t} = softmax (W_{y} \cdot [a_{t}; s_{t}] + b_{y})

Why attention?

Attention improves MT performance significantly
It solves the bottleneck problem
It helps with vanishing gradient problem
It provides interpretability in terms of attention patterns

Attention as QKV computation

From the above example,
- The encoder hidden states $h_{1}, \dots, h_{N} \in R^{h}$ are the keys/values
- The decoder hidden state $s_{t} \in R^{h}$ are is the query

Self-attention

Usually attentions are applied between encoder and decoder for seq2seq learning
For one single sequence only, we have self-attention
In RNNs, we do things in sequence, which is unparallelizable
With self-attention, we can parallelize per layer, as all words interact at every layer

Let $w_{i : n}$ be the sequence of words in vocabulary $V$ . For each $w_{i}$ , let $x_{i} = E w_{i}$ , where $E \in R^{d \times ∣ V ∣}$ is an embedding matrix.

Transform each word embedding with weight matrices $Q, K, V$ , each in $R^{d \times d}$

q_{i} = Q x_{i} k_{i} = K x_{i} v_{i} = V x_{i}

Compute pairwise similarities between keys and queries, normalize with softmax

e_{ij} = q_{i}^{T} k_{j} α_{ij} = \frac{exp ( e _{ij} )}{\sum _{j^{'}} exp ( e _{i j^{'}} )}

Compute output for each word as weighted sum of values

o_{i} = j \sum α_{ij} v_{j}

Transformer

Needs to add:

Positional embeddings
Non-linearity
Single to multi-head self attention
Multiple layers

Positional embeddings

We have positional embedding $p_{i} \in R^{d}$ , and we just add this to the embedding

x_{i} = x_{i} + p_{i}

We can make these positional embedding learned parameter, but in doing so, we can’t extrapolate to indices outside $1, \dots, n$ .

Non-linearity

Self-attention is just re-averaging the value vectors, there’s no non-linearity here
We add a feed-forward network to post-process each output vector

m_{i} = MLP (o_{i}) = W_{2} \times ReLU (W_{1} o_{1} + b_{1}) + b_{2}

Multi-head self-attention

We just do self-attention multiple times, but with different projection matrices $Q, K, V$
After we get each head’s attention score, we concatenate everything to get the overall score
Each head should focus on different “feature” of the sentence

Add & Norm

We add as a residual connection to help the gradient flows better
- We let $X^{(i)} = X^{(i - 1)} + Layer (X^{(i - 1)})$
We also do layer norm to help models train faster
- Idea: cut down uninformative variation in hidden vector values by normalizing to unit mean and standard dev within each layer

Messy Notes

Explorer