❯

❯

❯

Lecture 4 Attention & Transformer

Lecture 4 -- Attention & Transformer

created Nov 22, 2025modified Nov 23, 20253 min read

In RNNs, we have a bottleneck in the encoder.
This is because the decoder never sees the input text, but only the final vector
This many-to-one compression results in many lost information

Attention

Core idea: We use direct connection to encoder to focus on the relevant part of the source sentence

We have encoder hidden states
On timestep , we have decoder hidden state
We get the attention scores for this timestep:

We take softmax to get the attention distribution for this step

We use to take a weighted sum of the encoder hidden states to get the attention output

Concatenate the attention output with the decoder hidden state, then compute as before

Why attention?

Attention improves MT performance significantly
It solves the bottleneck problem
It helps with vanishing gradient problem
It provides interpretability in terms of attention patterns

Attention as QKV computation

From the above example,
- The encoder hidden states are the keys/values
- The decoder hidden state are is the query

Self-attention

Usually attentions are applied between encoder and decoder for seq2seq learning
For one single sequence only, we have self-attention
In RNNs, we do things in sequence, which is unparallelizable
With self-attention, we can parallelize per layer, as all words interact at every layer

Let be the sequence of words in vocabulary . For each , let , where is an embedding matrix.

Transform each word embedding with weight matrices , each in

Compute pairwise similarities between keys and queries, normalize with softmax

Compute output for each word as weighted sum of values

Transformer

Needs to add:

Positional embeddings
Non-linearity
Single to multi-head self attention
Multiple layers

Positional embeddings

We have positional embedding , and we just add this to the embedding

We can make these positional embedding learned parameter, but in doing so, we can’t extrapolate to indices outside .

Non-linearity

Self-attention is just re-averaging the value vectors, there’s no non-linearity here
We add a feed-forward network to post-process each output vector

Multi-head self-attention

We just do self-attention multiple times, but with different projection matrices
After we get each head’s attention score, we concatenate everything to get the overall score
Each head should focus on different “feature” of the sentence

Add & Norm

We add as a residual connection to help the gradient flows better
- We let
We also do layer norm to help models train faster
- Idea: cut down uninformative variation in hidden vector values by normalizing to unit mean and standard dev within each layer

Graph View

Attention
Why attention?
Attention as QKV computation
Self-attention
Transformer
Positional embeddings
Non-linearity
Multi-head self-attention
Add & Norm

Backlinks

ML Cheatsheet

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community