2.10 Practice questions
1. Can an LLM accurately explain how a Transformer works? You may want to experiment with this first, such as by querying an LLM at duck.ai.
Answer
Theoretically, yes, if the Transformer (that the LLM is based on) has been trained thoroughly enough, on enough data, given that the training data meets the following conditions:
- the data comprehensively covers enough of the target language that the Transformer can comprehend and generate coherent sentences
- the data is of high enough quality, and accurate, in respect to the topic of Transformers
And the same is potentially true of any topic.
Realistically, the above two conditions may be in conflict.
For example, condition 1 can be satisfied by scraping a large quantity of content from the internet. However, quality controls may cause this to cause problems with condition 2.
Similarly, the quantity of data available to satisfy condition 2, may not be enough to satisfy condition 1.
2. Devise an 8-dimensional (or more) vector embedding for the following dictionary:
- “Theorem”
- “Yo”
- “Fire”
- “Water”
What attributes do you expect would be most important, if you were trying to categorise a larger quantity of words?
Note that this is an open-ended question.
Answer
This is a topic I have been thinking about a lot, to more closely assess how the attention head weights are generated (in an attempt to give a good teaching example).
But, it is expected that, if all the major known (by humans, traditionally) classifications of words were fit into a vector of n-dimension, and every word in the dictionary assigned values to each vector’s dimension, then potentially, after training a Transformer using these word representations, it could be observed exactly what relation the attention head weight matrices are actually observing.
3. After an vector is run through an Add & Norm layer, would you expect it to increase or decrease?
Answer
4. How many probabilities does the final Linear layer generate?
Answer
5. Why are SIMD processors relevant to Transformers?
Answer
SIMD stands for Single Instruction Multiple Data. Transformers can make particular use of matrix multiplication features, and the ability to compute the result of a set of matrices concurrently brings with it a speed improvement during both training and utilisation.
The most common alternative to an SIMD processor is an MIMD processor, which are found in all common computers. This stands for Multiple Instruction Multiple Data, which means the processor is oriented around running different Assembly instructions simultaneously (e.g. via a pipeline, or multiple cores).
6. Are Transformers always trained from scratch?
Answer
7. How do programmers know how a numerical assignment to a model weights affects the loss function, during training?
Answer
8. What has been the open-source/closed-source approach to the GPT family of models?