Matrix Multiplications
I use einsum.
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.einsum('ij, jk -> ik', A, B)Softmax
In math,
def softmax(x):
# We calculate the e^x for each element
e_x = np.exp(x - x.max()) # For stability
return e_x / e_x.sum()Attention
Beware of the dimensions. Let be the sequence of words in vocabulary . For each , let , where is an embedding matrix. Remember that is a one-hot matrix in , so is a look-up to produce .
Therefore, , where is the seq len (or ctx window), and is the dimension, or d_model.
Then, the attention mechanism.
Initially, the matrix are of the same size, . Only may be different, at .
Once we multiply with the QKV weights, we get , and . We then calculate the attention scores, with . Since is a scalar, the softmax-ed vector is of .
Softmax is applied row-wise — every row in the matrix now sums up to 1. The resulting matrix is still .
Finally, we do matrix multiplication with , to get .
def attn(Q, K, V, mask=None):
d_k = K.shape[-1]
scores = np.einsum('...ij,...kj->...ik', Q, K)
scores /= np.sqrt(d_k)
if mask is not None:
scores = np.where(mask, scores, -1e9)
weights = softmax(scores, axis=-1)
return np.einsum('...ik,...kj->ij', weights, V) Read Lecture 4 — Attention & Transformer for more details.
- Need to write about multi-headed attention and FFN, along with the dimensional changes.
- The dimensionality of the QKV weights will be slightly different, as they need one extra dimension for the
num_headsor . But and can be smaller!
- The dimensionality of the QKV weights will be slightly different, as they need one extra dimension for the
What’s important is that everything must return to , until the end, where you multiply with to get , the logits, ready for you to do sampling on.