Problem with word dictionary (like in Word2Vec)

  1. Can’t deal with unknown words UNK. Solution: BPE
  2. Identical words have the same vector representation. Solution: BERT

BERT

  • Uses Word Piece (similar to BPE) tokenization
  • BERT gives contextualized word representation via multi-layer self-attention
  • The result is words-in-context representation

Transfer Learning

Two types of transfer learning:

  1. Transductive (same task, label from the source task)
    • Domain adaptation (different domain)
    • Cross-lingual learning (different languages)
  2. Inductive (different task, label from target task)
    • Multi-task learning (tasks learned simultaneously)
    • Sequential transfer learning (tasks learned sequentially)

Sequential Transfer Learning

  • We start with pre-trained embeddings and the fine-tune on specific tasks
  • May or may not freeze the embedding we can specialize the embedding to the target task and domain
  • May or may not have task-specific model (for task-unified models)

Pre-training

  • We used to only use pre-trained word embedding models
  • But now almost all of parameters in NLP networks are initialized via pre-training
  • This is good in:
    1. Building strong representations of language
    2. Parameter initializations for strong NLP models
    3. Probability distribution over language that we can sample from

Masked Language Modelling

  • Idea: replace some fractions of words in the input with a special [MASK] token; predict these words
  • Let be te masked version of , then we want to learn

Causal Langauage Modelling

  • Given a sequence of words , compute the probability distribution of the next word
  • where can be any word in the vocabulary

Pre-training Paradigms

Encoders (BERT)

  • We are doing MLM here
  • Predict a random 15% of (sub)word tokens
    • Replace input word with [MASK] 80% of the time
    • Replace input word with a random token 10% of the time
    • Leave input word unchanged 10% of the time (but you gotta still predict it)
  • To fine-tune on sentence classification, we use the hidden representation of the [CLS] token, then add a sentiment classification head
  • For token classification, just add NER head to all of the tokens
  • Improvements on BERT
    1. RoBERTa: train BERT for longer and remove next sentence prediction
    2. SpanBERT: masking contiguous span of words makes a harder, more useful pre-training task
  • But they don’t naturally lead to nice autoregressive generation methods

Decoders

  • We fine-tune them by training a classifier on the last word’s hidden state

Encoder-Decoders

  • The encoder portion benefits from bidirectional context
  • The decoder portion is used to train the whole model through language modelling

Decoding Strategies

Greedy

  • Idea: always selects the next token with highest probability
  • Problem: cannot correct previous mistakes
  • Idea: on each step of the decoder, keep track of most probable partial translations
  • Then we backtrack to get the path with the highest score, divided by the number of words

Top-k sampling

  • Only sample from the top tokens in the probability distribution
  • Increase yields more diverse, but risky outputs
  • Decrease yields more safe but generic outputs

Top-p sampling

  • Sample from all tokens in top cumulative probability mass

Decoding with temperature

  • Recall that on timestep , the model applies softmax to create the probability distribution
  • We can apply a temperature hyperparameter to the softmax to rebalance

Evaluation for NLG

  • Content overlap metrics provide a good starting point, but not good enough on their own
  • Model-based metrics can be more correlated with human judgement, but behavior is non-interpretable
  • Human judgements are critical, but humans are inconsistent