Matrix Multiplications

I use einsum.

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
 
C = np.einsum('ij, jk -> ik', A, B)

Softmax

In math,

def softmax(x):
	# We calculate the e^x for each element
	e_x = np.exp(x - x.max()) # For stability
	return e_x / e_x.sum()

Attention

Beware of the dimensions. Let be the sequence of words in vocabulary . For each , let , where is an embedding matrix. Remember that is a one-hot matrix in , so is a look-up to produce .

Therefore, , where is the seq len (or ctx window), and is the dimension, or d_model.

Then, the attention mechanism.

Initially, the matrix are of the same size, . Only may be different, at .

Once we multiply with the QKV weights, we get , and . We then calculate the attention scores, with . Since is a scalar, the softmax-ed vector is of .

Softmax is applied row-wise — every row in the matrix now sums up to 1. The resulting matrix is still .

Finally, we do matrix multiplication with , to get .

def attn(Q, K, V, mask=None):
	d_k = K.shape[-1]
	scores = np.einsum('...ij,...kj->...ik', Q, K)
	scores /= np.sqrt(d_k)
	if mask is not None:
		scores = np.where(mask, scores, -1e9)
	weights = softmax(scores, axis=-1)
	return np.einsum('...ik,...kj->ij', weights, V) 

Read Lecture 4 — Attention & Transformer for more details.

  • Need to write about multi-headed attention and FFN, along with the dimensional changes.
    • The dimensionality of the QKV weights will be slightly different, as they need one extra dimension for the num_heads or . But and can be smaller!

What’s important is that everything must return to , until the end, where you multiply with to get , the logits, ready for you to do sampling on.