2.3 Single-attention head

Building a context-specific embedding of a token.

Transformer-attention

Diagram 2.3.0: The Transformer, Vaswani et al. (2017)

Purpose

During utilisation, attention regards finding how strongly tokens are interrelated, based on specific types of relations.
For example, in the context of translation, words cannot be translated word-by-word independently, due to differences in sentence structures across languages, such as English’s subject-verb-object pattern:
“The quick brown fox jumped over the lazy dog.”

Compare this to the subject-object-verb pattern found in other languages (e.g. Japanese):
“素早い茶色のキツネが怠け者の犬を飛び降りました”

Note that the が particle marks the subject, を marks the object and precedes the verb, and formal sentences always end in verbs, ました being the formal past tense conjugate.

As a result, the grammatical relations between the words must be comprehended. This can then be further encoded into the token vector.

Theory

The way that this encoding is done, is that the vector embedding of each token, x_i, up to this point (note the arrows in diagram 2.3.0), is multiplied by 3 weight matrices:

$\textbf{q}_i = \textbf{x}_i \textbf{W}^Q$

$\textbf{k}_i = \textbf{x}_i \textbf{W}^K$

$\textbf{v}_i = \textbf{x}_i \textbf{W}^V$

The vector q stands for query, k for key, and v for value.

Training

Note that Transformers can typically have multiple tokens input at once.

Therefore, during training, the above process can be scaled for a large quantity of vector embeddings (one embedding per token) in the training input, by fiting them into one matrix X, which results in matrices Q, K, and V.

Mathematically, in the Vaswani et al. Transformer, a single self-attention mechanism head calculates attention scores as follows:

attention(\textbf{Q}, \textbf{K}, \textbf{V}) = softmax( \frac {\textbf{Q} \cdot \textbf{K}^{T}} {\sqrt {d_{k}} } + \textbf{M}) \cdot \textbf{V}

When tokens are eventually predicted (e.g. after a large quantity of input), the weights are adjusted.

Fitting the vectors into matrices allows them, during training, to scale better to the hardware available (SIMD processors, i.e. GPUs), and operations can be performed concurrently via matrix algebra.

As a result of the above, the matrix M, seen in the attention(Q, K, V) function, becomes relevant. M stands for masking; it causes tokens that proceed q_i to be ignored during matrix algebra, by setting them to an infinitely low negative, which then causes any values for proceeding tokens calculated to be disregarded by the softmax() function. An example of M can be seen in the upcoming step-by-step example.

Utilisiation

During utilisation, tokens are generated one at time. So, although the initial full prompt can be run through the Transformer, once the tokens begin to generate, they need to be run through one at a time. So consider each vector individually (i.e. q_i instead of Q), and look at only the calculation of each x_i, and each vector that precedes it as x_j, the utilisation variant of attention() can instead be thought of as continuous instances of the following computation occuring:

$strength(\textbf{q}_i, \textbf{k}_j) = \frac { \textbf{ q }_i \cdot \textbf{k}_i ^{T}} {\sqrt {d_{k}} }$

$\alpha_{ij} = \frac{e^{strength(\textbf{q}_i, \textbf{k}_j)}}{\sum_{j=0}^{i} e^{strength(\textbf{q}_i, \textbf{k}_j)}} \qquad ∀ j \leq i$

$head(\textbf{x}_i) = \sum_{j=0}^i \alpha_{ij} \textbf{v}_j$

Where strength() regards the strength of the relation between the vector embeddings (of the tokens), and d_k is the dimensionality of the k vector.
Consider that the dot product between q_i and k_j will indicate how closely these vectors are related.

Meaning

You may be wondering: what relation strength exactly, does the dot product represent? Well, different values of W^Q, W^K, and W^V will emphasise and deemphasise different parts of vector embeddings x_i and preceding vector embeddings x_j. For example, the attention head may be oriented around identifying the subject and verb of the input sequence, via the verb data q_i querying the preceding tokens, k_j, for example by looking for metadata relating to nouns and verbs.^[1] The normalised strength() scores are then multiplied by v_j and summated, which essentially embeds data related to the vector embeddings of preceding tokens into the vector embedding of the token under focus.

It is difficult to say exactly what data is extracted from a vector embedding via W^V, without debugging a trained Transformer which uses a statistics-based dictionary of vector embeddings. During training, weights are frequently initialised at random, and adjusted solely for model output, and may not hold instrinsic meaning independently.

Diagram 2.3.1: a non-numerical example of how a weight matrix may extract relevant metadata from vectors.

Application of theory

Training

The training process of the self-attention mechanism can be broken down into a step-by-step sequence. Note that the numbers in the following example sequence are randomly generated (it is difficult to generate or find examples of specific relations).

Example 1 (click to open)

All the input vectors x_i (such as those output from the positional encoding step), are set into a matrix X; for example 6 input vectors of dimension 4 would be arranged as a 6x4 matrix.


$x_{0}$	0.1	-0.2	0.1	0.2
$x_{1}$	0.8	0.8	-0.9	0.5
$x_{2}$	-0.6	0.4	0.5	-0.9
$x_{3}$	0.9	-0.7	0.8	0.7
$x_{4}$	0.4	0.3	0.4	0.4
$x_{5}$	0.8	0.2	0.9	-0.2

Sample matrix X, above, could represent an input sequence such as the sentence “List new ideas for a song”, one word per x_i row, in the simple case of 4 dimensions per token.

The matrix X is multiplied by 3 weight matrices of equivalent dimensions (for the dimensions in this example: 4x4), one matrix W^Q representing query weights, one matrix W^K representing key weights, and one matrix W^V representing value weights, to generate three new matrices Q, K, and V.

Q = XW^Q

K = XW^K

V = XW^V


3.4	-2.1	0.8	1.9
-1.2	0.5	3.7	-0.8
2.9	1.1	-3.4	0.6
0.7	2.8	1.3	-2.5

Sample matrix W^Q, above.


$q_{0}$	1.01	0.36	-0.74	-0.09
$q_{1}$	-0.50	-0.87	4.49	-0.23
$q_{2}$	2.17	-1.99	-3.92	1.71
$q_{3}$	1.43	2.59	1.97	-3.19
$q_{4}$	0.67	0.98	0.61	-0.31
$q_{5}$	1.97	0.52	4.01	-0.51

Sample matrix Q, above.

The dot product of each query vector q_i (a query vector being a row of Q) with every key ki via transposed KT is calculated, to generate attention scores; $\ QK^{T}$ . Higher attention scores between a given q_i and a given k_i indicaŧe that the query q_i is more similar to the given k_i, and therefore that there is a stronger relation.

attention(Q, K) = QK^T


0.15	1.94	-0.98	2.53	1.24	2.18
0.16	1.44	0.64	-2.33	0.81	0.44
0.25	0.35	1.25	2.24	1.14	3.25
0.28	1.15	-0.81	1.93	-0.81	-0.28

Sample matrix K^T, above.


-0.01	2.12	-1.62	3.45	1.83	3.29
-1.37	0.51	7.15	-3.39	2.35	2.99
3.49	-6.19	-9.39	11.19	-3.19	-5.67
2.23	3.19	2.35	-0.67	1.35	4.37
0.83	1.42	0.95	0.27	0.69	1.83
3.35	1.35	10.93	-0.95	3.35	5.95

Sample matrix QK^T, above.

These attention scores are then normalised by dividing by the square root of the dimension being used (in this example, 4); $\frac {QK^{T}}{\sqrt{d_{k}}}$ .


-0.005	1.06	-0.81	1.725	0.915	1.645
-0.685	0.255	3.575	-1.695	1.175	1.495
1.745	-3.095	-4.695	5.595	-1.595	-2.835
1.115	1.595	1.175	-0.335	0.675	2.185
0.415	0.71	0.475	0.135	0.345	0.915
1.675	0.675	5.465	-0.475	1.675	2.975

Sample matrix QK^T, above, where all cells have been divided by √4.

In the decoder, a masking matrix M is used to force infinitely negative attention scores between tokens being queried and future tokens, such that only attention scores between a given token and tokens previously generated are used; $\frac {QK^{T}}{\sqrt{d_{k}}} + M$ .


-∞	-∞	-∞	-∞	-∞
0	-∞	-∞	-∞	-∞
0	0	-∞	-∞	-∞
0	0	0	-∞	-∞
0	0	0	0	-∞
0	0	0	0	0

Sample matrix M, above.

Softmax is run on the attention scores, once per column, nullifying the infinitely negative scores, and generating a probability distribution from the attention scores; $softmax ( \frac {QK^{T}}{\sqrt{d_{k}}} +M)$ .


0.156	0	0	0	0	0
0.138	0.041	0	0	0	0
0.194	0.001	0	0	0	0
0.168	0.193	0.042	0.143	0	0
0.142	0.136	0.031	0.051	0.083	0
0.201	0.629	0.927	0.806	0.917	1

Sample matrix after softmax() has been applied, note that the cells in each column add up to 1.

The probability distribution is then applied to the values within V to find the input tokens with the largest influence on the token under focus; $softmax( \frac {QK^{T}} {\sqrt {d_{k}} } + M) V$ ).


0.43	0.29	0.64	0.83
1.15	1.19	-0.28	0.67
-0.21	0.64	0.59	-0.82
1.19	-0.45	0.86	0.99
0.51	0.41	0.51	0.51
1.19	0.28	0.95	-0.15

Sample matrix V, above.


$head_{0}$	0.06708	0.04524	0.09984	0.12948
$head_{1}$	0.15874	0.15766	0.05938	0.08634
$head_{2}$	0.08382	0.05966	0.11566	0.13938
$head_{3}$	0.26734	0.06388	0.16438	0.20534
$head_{4}$	0.15538	0.13636	0.08656	0.09556
$head_{5}$	1.05279	0.68873	1.05273	0.82373

Sample matrix of final output of one self-attention head; at this point in the Transformer, each token now has both positional data and data regarding relations to other tokens, embedded into one vector.

More algebraic, implicit examples of the matrix transformations are also available.

The GPT-2 implementation of the attention mechanism, which should be slightly familiar at this point, can be seen here. Note that the weights are not fixed to the model architecture, but are instead passed in via a function parameter.

Note that, in LLM implementations, dimensionalities may be less consistent than the above example (Llama-4 provides optional hyperparameters).^[1][2]

Practice questions

Answer

References

[1] Speech and Language Processing, Chapter 8; Transformers
[2] Llama-4 documentation - HuggingFace

2.2 Positional encoding 2.4 Multi-attention head