2.10 Practice questions

1. Can an LLM accurately explain how a Transformer works? You may want to experiment with this first, such as by querying an LLM at duck.ai.

Answer

Theoretically, yes, if the Transformer (that the LLM is based on) has been trained thoroughly enough, on enough data, given that the training data meets the following conditions:

  1. the data comprehensively covers enough of the target language that the Transformer can comprehend and generate coherent sentences
  2. the data is of high enough quality, and accurate, in respect to the topic of Transformers

And the same is potentially true of any topic.

Realistically, the above two conditions may be in conflict.

For example, condition 1 can be satisfied by scraping a large quantity of content from the internet. However, quality controls may cause this to cause problems with condition 2.

Similarly, the quantity of data available to satisfy condition 2, may not be enough to satisfy condition 1.

2. Devise an 8-dimensional (or more) vector embedding for the following dictionary:

  • “Theorem”
  • “Yo”
  • “Fire”
  • “Water”

What attributes do you expect would be most important, if you were trying to categorise a larger quantity of words?

Note that this is an open-ended question.

Answer

This is a topic I have been thinking about a lot, to more closely assess how the attention head weights are generated (in an attempt to give a good teaching example).

But, it is expected that, if all the major known (by humans, traditionally) classifications of words were fit into a vector of n-dimension, and every word in the dictionary assigned values to each vector’s dimension, then potentially, after training a Transformer using these word representations, it could be observed exactly what relation the attention head weight matrices are actually observing.

3. After an vector is run through an Add & Norm layer, would you expect it to increase or decrease?

Answer
The add part would of course add to the input, but the layer normalisation subtracts and divides. And so unfortunately the answer is ambiguous and depends on the context. Better luck next time! 😸

4. How many probabilities does the final Linear layer generate?

Answer
The quantity of probabilities is equal to the size of the Transformer’s output vocabulary.

5. Why are SIMD processors relevant to Transformers?

Answer

SIMD stands for Single Instruction Multiple Data. Transformers can make particular use of matrix multiplication features, and the ability to compute the result of a set of matrices concurrently brings with it a speed improvement during both training and utilisation.

The most common alternative to an SIMD processor is an MIMD processor, which are found in all common computers. This stands for Multiple Instruction Multiple Data, which means the processor is oriented around running different Assembly instructions simultaneously (e.g. via a pipeline, or multiple cores).

6. Are Transformers always trained from scratch?

Answer
Not always. Because training a model can require huge amounts of computation resources, and energy, a process known as ’transfer learning’ involves taking a set of weights (finely tuned numerical values for parameters), fitting them to a model, and training onwards from this stage (known as a checkpoint). For example, the original GPT-2 implementation contains AWS links to a pretrained model.

7. How do programmers know how a numerical assignment to a model weights affects the loss function, during training?

Answer
Machine learning libraries typically handle data in the form of O.O.P. objects known as tensors, which are akin to matrices. Tensors are arithmetically manipulated as they flow through a model. A tensor object stores data within it regarding what the initial state was, and which operations were performed on it. Via the data within a tensor object, automatic differentiation is possible, to deduce how a specific weight has affected the output of the model.

8. What has been the open-source/closed-source approach to the GPT family of models?

Answer
GPT-2 was open-source, but GPT-3 and GPT-4 closed source. Sam Altman has openly talked of the benefits of an open-source model, and in August 2025 released a separate, open-source model, GPT-OSS. This uses nightly (beta) versions of PyTorch and the Triton kernel.