1.1 Backpropagation

Before discussing backpropagation (short for backward propagation), forward propagation needs to be understood.

Forward propagation means inputting data into a feed-forward neural network, running it through the functions that make up the neural network based on the current set of values of the weights, and generating output.

Neural network

Diagram 1.1.1: Example of a basic feed-forward neural network, with only 3 inputs, 1 hidden layer, and 1 output. A vector is input into the input layer (one dimension per neuron), run through the functions that make up the feed-forward neural network, and then a vector is output by the output layer (one dimension per neuron).

An example that could be associated with Diagram 1.0.1, is that the input layer could take, as input, 3 of the Big Five personality trait scores of a person, as a 3-dimensional vector, such as agreeableness, contientiousness, and neuroticism. After forward propagation of the data through the functions that make up the feed-forward neural network, the output layer could output a prediction about their age, as a 1-dimensional vector.
Note that, psychological studies have already drawn general conclusions on how these specific traits usually change with age (agreeableness and contientiousness increase with age, neuroticism decreases with age, and, as an aside, openness to new experiences/ideas also decreases).

If the prediction that the feed-forward neural network makes is unexpected, for example if the feed-forward neural network predicts an age of 21 instead of 29, then it is possible to find out how one specific weight within the feed-forward neural network affected the output, once the output has been derived. This process is known as backpropagation.

After selecting a function, L L , that takes as input the expected output and the predicted output, and outputs a numerical score, to find out how a specific weight wiw_i affects the value of LL would be represented as follows:

Lwi \frac {\partial L}{ \partial w_{i}}

A partial derivate, unlike a total derivative, depicts how a function changes in respect to one specific variable, and is generally more suitable to functions with multiple variables.

By considering the calculus chain rule, it is possible to expand Lwi \frac {\partial L}{ \partial w_{i}} out.

Neuron Diagram 1.1.2: Duplication of Diagram 1.0.2, previously described.

Consider again the example of neuron h1, in diagram 1.1.2, and the feed-forward neural network in diagram 1.1.1. It is wanted to find how the weight in h1 affects the difference, L L , between the predicted age, y^1\hat{y}_1, and expected age, y1y_1:


Lw1 \frac {\partial L} {\partial w_{1}}

=Ly^1y^1w1 = \frac {\partial L} {\partial \hat{y}_{1}} \cdot \frac {\partial \hat{y}_{1}} {\partial w_{1}}

=Ly^1y^1h1h1w1 = \frac {\partial L} {\partial \hat{y}_{1}} \cdot \frac {\partial \hat{y}_{1}} {\partial h_{1}} \cdot \frac {\partial h_{1}} {\partial w_{1}}

On the general topic of backpropagation, it is necessary to know how each neuron changes in respect to inputs from the previous layer, and finally how the functions within the neuron containing the target weight change, as the target weight changes. This can be handled algebraically, however, in practice, it is not, as feed-forward neural networks can contain thousands of neurons.

Practice questions

1. Why might it be desirable to know how all the different weights within the neural network affect the difference between the predicted output of the neural network and the expected output?

Answer
It is possible to iteratively and individually adjust each weight so as to refine the predictive capabilities of the neural network.

2. Consider the function y=x2+z3+w4+p5y = x^2 + z^3 + w^4 + p^5 , where x x , y y , w w , and p p are variables. Is the point at which yx=0 \frac {\partial y}{ \partial x } = 0 a turning point (i.e. the gradient of y y goes from negative to positive, or vice versa)?

Answer
Not necessarily. yx=0 \frac { \partial y }{ \partial x } = 0 specifically means that, when at the point at which a value of x x meets this condition, y y will not change as x x changes. However, the other variables in the equation may cause y y to change. Consider also that this equation is a 5-dimensional equation, which does not map to a human perception of visual reality.