2.5 Add & Norm

Maintaining consistency between training and utilisation stages.

Transformer-add-norm

Diagram 2.5.0: The Transformer, Vaswani et al. (2017)

There is a problem known as “covariate shift” in which the data in the training environment is significantly different enough to the data in the utilisation (after training) environment, such that the underlying distribution of the data, in each stage, is not equivalent. This results in the accuracy of an LLM’s predictions becoming reduced. The ‘add & norm’ layer attempts to address this in the following ways.

The ‘add’ within ‘add & norm’ refers to a residual connection that adds the input of each layer to the output; $f(x) + x$ .
To understand the purpose of the residual connection, we first need to understand the ‘vanishing gradient problem’.

Vanishing gradient problem

Consider backpropagation: the product of a series of partial derivatives expresses how a loss function can change in respect to a specific weight within a machine learning model. These partial derivatives feature numerical products from the following sources:

differences between the model output and the expected output during supervised training; $y \ne \hat{y}$
trained weights; $w_{i}$
initial inputs to the model; $x_{i}$

All of the above may be very small positive/negative real numbers:

$-1 \le c \le 1$ where $c \in \{ x_{i}, w_{i}, y_{i} - \hat{y_{i}} \}$

…and, therefore, as the quantity of the model’s layers increases, the products of the above may become very small.

Specifically, when training a given weight via backpropagation, the final partial derivative $\frac {\partial L}{ \partial w_{i}}$ may be very small, and, therefore, running gradient descent to train weights may become extremely slow:

$w _{i+1} = w _{i} - n \cdot ∇L(w_{i})$ ,

Where:
$n$ is the training rate hyperparameter
$w$ is a specific weight within a neural network, or weight matrix, of the Transformer
$w_{i}$ represents iteration i of the specific weight
$w_{i+1}$ represents the next iteration i of the same specific weight
$L(w_{i})$ represents how the loss function changes with respect to the given weight
$∇L(w_{i}) \approx 0$

Seemingly, other layers in a Transformer may also suffer from similar issues when training, as the quantity of layers increases and products of derivatives become small.

Add; residual connections

The idea of implementing residual connections to a neural network (a.k.a. skip connections) first formally occurred in 2016.^[1] If a layer within a Transformer is represented as $f(x)$ , for example the multi-head attention layer, where x is the input to the multi-head attention, a residual connection can be represented as $f(x) + x$ .

There are also said to be advantages in terms of retaining original input data, via residual connections, such as whilst passing through layers that need not change it, for example in the following cases:

some inputs follow a linear distribution whilst passing through a neural network layer, whilst the model is designed to split up the data in an unneeded way
an input sequence is randomly generated, and therefore there is no relation to be deduced between the tokens during the multi-head attention layer

Norm; layer normalisation

The ’norm’ within ‘add & norm’ refers to layer normalisation. Layer normalisation is a progression from a method known as batch normalisation^[2], and both attempt to address the issue of weights in one layer of a neural network being heavily affected by whatever the output is of the previous layer, within the same neural network (recall that all layers cumulatively affect the difference between the expected and actual output, in the loss function).

Batch normalisation, for each neuron within a layer of a neural network, seeks to rescale the inputs based on the variances and mean of the whole training data distribution. These variances and mean are actually estimated via samples from each small batch of training data being processed, via a probability distribution, as the computational load would be too high to use the full training dataset.^[3] Then, during the utilisation stage, the inputs are normalised via a mean and variance based on all the means and variances generated during training.^[2]

Essentially both batch normalisation and layer normalisation go back to a classical statistics means of normalising a value, but with the mean $\mu$ mu and variance $\sigma$ derived via different sources:
$z_{i} = \frac { x_{i} - \mu } { \sigma }$

Layer normalisation, specifically, sets the mean and variance after each layer within a Transformer. Again, mini-batches are used, however, no limits are placed on the minimum size of the batch (a batch size of 1 is permitted).

References

[1] Deep Residual Learning for Image Recognition
[2] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
[3] Layer Normalization, section 2

2.4 Multi-attention head 2.6 Feed-forward neural network