Lecture 5 -- Pre-training

Problem with word dictionary (like in Word2Vec)

Can’t deal with unknown words → UNK. Solution: BPE
Identical words have the same vector representation. Solution: BERT

BERT

Uses Word Piece (similar to BPE) tokenization
BERT gives contextualized word representation via multi-layer self-attention
The result is words-in-context representation

Transfer Learning

Two types of transfer learning:

Transductive (same task, label from the source task)
- Domain adaptation (different domain)
- Cross-lingual learning (different languages)
Inductive (different task, label from target task)
- Multi-task learning (tasks learned simultaneously)
- Sequential transfer learning (tasks learned sequentially)

Sequential Transfer Learning

We start with pre-trained embeddings and the fine-tune on specific tasks
May or may not freeze the embedding → we can specialize the embedding to the target task and domain
May or may not have task-specific model (for task-unified models)

Pre-training

We used to only use pre-trained word embedding models
But now almost all of parameters in NLP networks are initialized via pre-training
This is good in:
1. Building strong representations of language
2. Parameter initializations for strong NLP models
3. Probability distribution over language that we can sample from

Masked Language Modelling

Idea: replace some fractions of words in the input with a special [MASK] token; predict these words
Let $\tilde{x}$ be te masked version of $x$ , then we want to learn $p_{θ} (x ∣ \tilde{x})$

Causal Langauage Modelling

Given a sequence of words $x^{(1)}, \dots, x^{(t)}$ , compute the probability distribution of the next word $x^{(t + 1)}$

P (x^{(t + 1)} ∣ x^{(t)}, \dots, x^{(1)})

where $x^{(t + 1)}$ can be any word in the vocabulary $V = {w_{1}, \dots, w_{∣ V ∣}}$

Pre-training Paradigms

Encoders (BERT)

We are doing MLM here
Predict a random 15% of (sub)word tokens
- Replace input word with [MASK] 80% of the time
- Replace input word with a random token 10% of the time
- Leave input word unchanged 10% of the time (but you gotta still predict it)
To fine-tune on sentence classification, we use the hidden representation of the [CLS] token, then add a sentiment classification head
For token classification, just add NER head to all of the tokens
Improvements on BERT
1. RoBERTa: train BERT for longer and remove next sentence prediction
2. SpanBERT: masking contiguous span of words makes a harder, more useful pre-training task
But they don’t naturally lead to nice autoregressive generation methods

Decoders

We fine-tune them by training a classifier on the last word’s hidden state

Encoder-Decoders

h_{1}, \dots, h_{T} h_{T + 1}, \dots, h_{2 T} y_{i} = Encoder (w_{1}, \dots, w_{T}) = Decoder (w_{1}, \dots, w_{T}, h_{1}, \dots, h_{T}) \sim softmax (A h_{i} + b), i > T

The encoder portion benefits from bidirectional context
The decoder portion is used to train the whole model through language modelling

Decoding Strategies

Greedy

Idea: always selects the next token with highest probability

\overset{y}{^}_{t} = ar g w \in V max P (y_{t} = w ∣ y_{< t})

Problem: cannot correct previous mistakes

Beam search

Idea: on each step of the decoder, keep track of $k$ most probable partial translations

score (y_{1}, \dots, y_{t}) = lo g P_{L M} (y_{1}, \dots, y_{t} ∣ x) = i = 1 \sum t lo g P_{L M} (y_{i} ∣ y_{1}, \dots, y_{i - 1}, x)

Then we backtrack to get the path with the highest score, divided by the number of words

Top-k sampling

Only sample from the top $k$ tokens in the probability distribution
Increase $k$ yields more diverse, but risky outputs
Decrease $k$ yields more safe but generic outputs

Top-p sampling

Sample from all tokens in top $p$ cumulative probability mass

Decoding with temperature

Recall that on timestep $t$ , the model applies softmax to create the probability distribution

P_{t} (y_{t} = w) = \frac{exp ( S _{w} )}{\sum _{w^{'} \in V} exp ( S _{w^{'}} )}

We can apply a temperature hyperparameter $τ$ to the softmax to rebalance $P_{t}$

P_{t} (y_{t} = w) = \frac{exp ( S _{w} / τ )}{\sum _{w^{'} \in V} exp ( S _{w^{'}} / τ )}

Evaluation for NLG

Content overlap metrics provide a good starting point, but not good enough on their own
Model-based metrics can be more correlated with human judgement, but behavior is non-interpretable
Human judgements are critical, but humans are inconsistent

Messy Notes

Explorer

Lecture 5 -- Pre-training

BERT

Transfer Learning

Sequential Transfer Learning

Pre-training

Masked Language Modelling

Causal Langauage Modelling

Pre-training Paradigms

Encoders (BERT)

Decoders

Encoder-Decoders

Decoding Strategies

Greedy

Beam search

Top-k sampling

Top-p sampling

Decoding with temperature

Evaluation for NLG

Graph View

Table of Contents