Lecture 2 -- Word Embeddings

History

We had WordNet, which contains synonym sets and hypernyms (“is a” relationship)
- Problem: Missing nuance, missing new meaning of words, impossible to keep up-to-date, subjective
We had discrete representation, where each word is a one-hot vector
- Problem: Cannot calculate word similarity
Word vectors: dense vector, chosen such that it is similar to the vector of words that appear in similar contexts

When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window)
Idea: Use many contexts of w to build up a representation of w

Continuous Bag of Words (CBOW): look a the words before and after the word to create embedding
Skipgram: Instead of guessing a word by its context, we guess the neighboring words based on the current word

Likelihood = L (θ) = t = 1 \prod T - m \leq j \leq m \prod P (w_{t + j} ∣ w_{t}; θ)

J (θ) = - \frac{1}{T} lo g L (θ) = - \frac{1}{T} t = 1 \sum T - m \leq j \leq m \sum lo g P (w_{t + j} ∣ w_{t}; θ)

We still need to calculate $P (w_{t + j} ∣ w_{t}; θ)$ , where we use two vectors per word w:
1. $v_{w}$ when $w$ is the center word
2. $u_{w}$ when $w$ is the context word
Then, for a center word c and a context word o (it is similar to softmax!):

P (o ∣ c) = \frac{exp ( u _{o}^{T} v _{c} )}{\sum _{w \in V} exp ( u _{w}^{T} v _{c} )}

Change the task from predicting the neighboring words (which is a softmax problem) into checking if two words are neighbors (which is a binary classification problem)
We use logistic regression, which is simpler and much faster to calculate
We need to introduce negative samples into the dataset — randomly sampled from the vocabulary
Maximize probability of real outside word appears, minimize hte probability that random words appear aroun the center words ( ${o_{1}, \dots, o_{k}}$ are random words)

J_{t} (θ) = lo g σ (u_{o}^{T} v_{c}) target = 1 + target = 0 i = 1 \sum k lo g σ (- u_{o_{i}}^{T} v_{c})

Hyperparameter: window size (default = 5), and number of negative samples (5-20, but usually also 5).

OOV
Morphology: for words with the same radicals such as “eat” and “eaten”, they don’t share the same parameter Solution: FastText → make embeddings of subwords (n-grams) instead, and then we sum everything up

Make sure that the word vector analogies make sense
- Take cosim of the word vector and check if they make sense

ar g b^{*} max (cos (b^{*}, b - a + a^{*}))

A language model takes a list of words and attempt to predicts the word that follows them

P (x^{(1)}, \dots, x^{(T)}) = t = 1 \prod T P (x^{(t)} ∣ x^{(t - 1)}, \dots, x^{(1)}) this is what our LM provides

perplexity = t = 1 \prod T (\frac{1}{P _{L M} ( x ^{(t + 1)} ∣ x ^{(t)} , \dots , x ^{(1)} )})^{1/ T}

perplexity = exp (\frac{1}{T} t = 1 \sum T - lo g \hat{y}_{x_{t + 1}}^{(t)})

Problems & Solutions

Similar to NER, but instead of predicting the class of the center word, we want to predict the next word
With this, we don’t have sparsity problem anymore, and we don’t have to store all n-grams Problems:

Fixed window size
Enlarging window enlarges $W$
Window can never be large enough
$x^{(1)}$ and $x^{(2)}$ are multiplied by different weights in $W$ → no symmetry