Chapter 18: Backpropagation
A preview chapter from
**Deep Learning: From Basics to Systems**
by Andrew Glassner
**This is an early look at Chapter 18.**
Contents may change in the final book.
Items marked TK will be resolved in the final.
Follow the book's progress on Twitter.
.
Questions? Reach me by email.
Why This Chapter Is Here ==== This chapter is about training a neural network. The very basic idea is appealingly simple. Suppose we're training a categorizer, which will tell us which of several given labels should be assigned to a given input. It might tell us what animal is featured in a photo, or whether a bone in an image is broken or not, or what song a particular bit of audio belongs to. Training this neural network involves handing it a sample, and asking it to **predict** that sample's label. If the prediction matches the label that we previously determined for it, we move on to the next sample. If the prediction is wrong, we change the network to help it do better next time. Easily said, but not so easily done. This chapter is about how we "change the network" so that it **learns**, or improves its ability to make correct predictions. This approach works beautifully not just for classifiers, but for almost any kind of neural network. Contrast a feed-forward network of neurons to the dedicated classifiers we saw in Chapter TK. Each of those had a customized, built-in learning algorithm that measured the incoming data to provide the information that classifier needed to know. But a neural network is just a giant collection of neurons, each doing its own little calculation and then passing on its results to other neurons. Even when we organize them into layers, there's no inherent learning algorithm. How can we train such a thing to produce the results we want? And how can we do it efficiently? The answer is called **back-propagation**, or more commonly **backpropagation**, or simply **backprop**. Without backprop, we wouldn't have today's widespread use of deep learning, because we wouldn't be able to train our models in reasonable amounts of time. With backprop, deep learning algorithms are practical and plentiful. Backprop is a low-level algorithm. When we use libraries to build and train deep learning systems, we'll use their finely-tuned routines to get the best speed and accuracy. Except as an educational exercise, we're likely to never write our own code to perform backprop. So why is this chapter here? Why should we bother knowing about this low-level algorithm at all? There are at least four good reasons to have a general knowledge of backpropagation. First, it's important to understand backprop because knowledge of one's tools is part of becoming a master in any field. Sailors at sea understand how ropes work and why specific knots are used in specific situations, photographers understand the basics of how lenses work, and airplane pilots understand why their plane turns when they tilt the wings in a certain way. A basic knowledge of the core techniques of any field is part of the process of gaining proficiency and developing mastery. In this case, knowing something about backprop lets us read the literature, talk to other people about deep learning ideas, and better understand the algorithms and libraries we use. Second, and more practically, knowing about backprop can help us design networks that learn. When a network learns slowly, or not at all, it can be because something is preventing backprop from running properly. Backprop is a versatile and robust algorithm, but it's not bulletproof. We can unknowingly build networks where backprop won't work properly, resulting in a network that stubbornly refuses to learn. For those times when something's going wrong with backprop, understanding the algorithm helps us fix things [#Karpathy16]. Third, many important advances in neural networks rely on backprop intimately. To learn these new ideas, and understand why they work the way they do, it's important to know the algorithms they're building on. Finally, backprop is an elegant algorithm. It efficiently solves a problem that would otherwise require a prohibitive amount of time and computer resources. It's one of the conceptual treasures of the field. As curious, thoughtful people it's well worth our time to understand this beautiful algorithm. For these reasons and others, this chapter provides an introduction to backprop. Generally speaking, introductions to backprop are presented mathematically, as a collection of equations with associated discussion [#Fullér10]. As usual, we'll skip the mathematics and focus instead on the concepts. The mechanics are common-sense at their core, and don't require any tools beyond basic arithmetic and the ideas of a derivative and gradient, which we discussed in Chapter TK. A Word On Subtlety ---- The backpropagation algorithm is not complicated. In fact, it's remarkably simple, which is why it works so well and can be implemented so efficiently. But simple does not always mean easy. The backprop algorithm is subtle. In the discussion below, the algorithm will take shape through a process of observations and reasoning, and these steps may take some thought. We'll try to be clear about every step, but making the leap from reading to understanding may require some work. It's worth the effort. A Very Slow Way to Learn === Let's begin with a very slow way to train a neural network. This will give us a good starting point, which we'll then improve. Suppose we've been given a brand-new neural network consisting of hundreds or even tens of thousands of interconnected neurons. The network was designed to do classification of each input into one of 5 categories. So it has 5 outputs, numbered 1 to 5, and whichever one has the largest output is the network's prediction for an input's category. Figure [fig-block-diagram-classifier] shows the idea. ![Figure [fig-block-diagram-classifier]: A neural network predicting the class of an input sample. Starting at the bottom, we have a sample with four features and a label. The label tells us that the sample belongs to category 3. The features go into a neural network which has been designed to provide 5 outputs, one for each class. In this example, the network has incorrectly decided that the input belongs to class 1, because the largest output, 0.9, is from output number 1. ](../resources/Backpropagation/Images/block-diagram-classifier.jpg width="350px") Consider the state of our brand-new network, before it has seen any inputs. As we know from Chapter TK, each input to each neuron has an associated weight. There could easily be hundreds of thousands, or many millions, of weights in our network. Typically, all of these weights will have been initialized with small random numbers. Let's now run one piece of labeled training data through the net, as in Figure [fig-block-diagram-classifier]. The sample's features go into the first layer of neurons, and the outputs of those neurons go into more neurons, and so on, until they finally arrive at the output neurons, when then become the output of the network. The index of the output neuron with the largest value is the predicted class for this sample. Since we're starting with random numbers for our weights, we're likely to get essentially random outputs. So there's a 1 in 5 chance the network will happen to predict the right label for this sample. But there's a 4 in 5 chance it'll get it wrong, so let's assume that the network predicts the wrong category. When the prediction doesn't match the label, we can measure the discrepancy numerically, coming up with a single number to tell us just how wrong this answer is. We call this number the **error score**, or **error**, or sometimes the **loss** (if the word "loss" seems like a strange synonym for "error," it may help to think to think of it as describing how much information is "lost" if we categorize a sample using the output of the classifier, rather than the label.). The error (or loss) is a floating-point number that can take on any value, though often we set things up so that it's always positive. The larger the error, the more "wrong" our network's prediction is for the label of this input. An error of 0 means that the network predicted this sample's label correctly. In a perfect world, we'd get the error down to 0 for every sample in the training set. In practice, we usually settle for getting as close as we can. When we speak of "the network's error" with respect to a training set, we usually mean some kind of overall average that tells us how the network is doing when taking all the training samples into consideration. We call this the **training error**, since it's the overall error we get from predicting results from the training set. Similarly, the error from the test (or validation) data is called the **test error** or **validation error**. When the system is deployed, a measure of the mistakes it makes on new data is called the **generalization error**, because it represents how well (or poorly) the system manages to "generalize" from its training data to new, real-world data. A nice way to think about the whole training process is to anthropomorphize the network. We can say that it "wants" to get its error down to zero, and the whole point of the learning process is to help it achieve that goal. One advantage of this way of thinking is that we can make the network do anything we want, just by setting up the error to "punish" any quality or behavior that we don't want. Since the algorithms we'll see in this chapter are designed to minimize the error, we know that anything about the network's behavior that contributes to the error will get minimized. The most natural thing to punish is getting the wrong answer, so the error almost always includes a term that measures how far the output is from the correct label. The worse the match between the prediction and the label, the bigger this term will be. Since the network wants to minimize the error, it will naturally minimize such mistakes. This approach of "punishing" the network through the error score means we can choose to include terms in the error for anything we can measure and want to suppress. For example, another popular measure to add into the error is a **regularization term**, where we look at the magnitude of all the weights in the network. As we'll see later in this chapter, we usually want those weights to be "small", which often means between -1 and 1. As the weights move beyond this range, we add a larger number to the error. Since the network "wants" the smallest error possible, it will try to keep the weights small so that this term remains small. All of this raises the natural question of how on Earth the network is able to accomplish this goal of minimizing the error. That's the point of this chapter. Let's start with a basic error measure that only punishes a mismatch between the network's prediction and the label. Our first algorithm for teaching the network will be just a thought experiment, since it would be absurdly slow on today's computers. But the motivation is right, and this slow algorithm will form the conceptual basis for the more efficient techniques we'll see later in this chapter. A Slow Way to Learn --- Let's stick with our running example of a classifier. We'll give the network a sample and compare the system's prediction with the sample's label. If the network got it right and predicted the correct label, we won't change anything and we'll move on to the next sample. As the wise man said, "If it ain't broke, don't fix it" [#Seung05]. But if the result for a particular sample is incorrect (that is, the category with the highest value does not match our label), we will try to improve things. That is, we'll learn from our mistakes. How do we learn from this mistake? Let's stick with this sample for a while and try to help the network do a better job with it. First, we'll pick a small random number (which might be positive or negative). Now we'll pick one weight at random from the thousands or millions of weights in the network, and we'll add our small random value to that weight. Now we'll evaluate our sample again. Everything up to that change will be the same as before. But there will be a chain reaction of changes in the outputs of the neurons starting at the weight we modified. The new weight will produce a new input for the neuron that uses that input, which will change that neuron's output value, which will change the output of every neuron that uses that output, which will change the output of every neuron that uses any of *those* outputs, and so on. Figure [fig-mod-one-weight] shows this idea graphically. ![Figure [fig-mod-one-weight]: Updating a single weight causes a chain reaction that ultimately can change the network's outputs. Here we show a network of 5 layers with 3 neurons each. Data flows from the inputs at the left to the outputs at the right. For simplicity, not every neuron uses the output of every neuron on the previous layer. (a) We select one weight at random, here shown in red and marked *w*. (b) We modify the weight by adding a value *m* to it, so the weight is now *w+m*. (c) When we run the sample through the network again, the new weight causes a change in the output of the neuron it feeds into (in red). The output of that neuron (in green) changes as a result, which causes the neurons it feeds into (in green) to change their outputs, and the changes cascade all the way to the output layer. ](../resources/Backpropagation/Images/mod-one-weight.jpg width="300px") Now that we have a new output, we can compare it to the label and measure the new error. If the new error is less than the previous error, then we've made things better! We'll keep this change, and move on to the next sample. But if the results didn't get better then we'll undo this change, restoring the weight back to its previous value. We'll then pick a new random weight, change it by a newly-selected small random amount, and evaluate the network again. We can continue this process of picking and nudging weights until the results improve, or we decide we've tried enough times, or for any other reason we decide to stop. Then we just move on to the next sample. When we've used all the samples in our training set, we'll just go through them all again (maybe in a different order), over and over. The idea is that we'll improve a little bit from every mistake. We can continue this process until the network classifies every input correctly, or we've come close enough, or our patience is exhausted. With this technique, we would expect the network to slowly improve, though there may be setbacks along the way. For example, adjusting a weight to improve one sample's prediction might ruin the prediction for one or more other samples. If so, when those samples come along they will cause their own changes to improve their performance. This thought experiment isn't perfect, because things could get stuck. For example, there might be times when we need to adjust more than one weight simultaneously. To fix that, we can imagine extending our algorithm to assign multiple random changes to multiple random weights. But let's stick with the simpler version for now. Given enough time and resources, the network would eventually find a value for every weight that either predicts the right answer for every sample, or it comes as close as that network possibly can. The important word in that last sentence is *eventually*. As in, "The water will boil, eventually," or "The Andromeda galaxy will collide with our Milky Way galaxy, eventually" [#NASA12]. This technique, while a valid way to teach a network, is definitely not practical. Modern networks can have millions of weights. Trying to find the best values for all those weights with this algorithm is just not realistic. But this *is* the core idea. To train our network, we'll watch its output, and when it makes mistakes, we'll adjust the weights to make those mistakes less likely. Our goal in this chapter will be to take this rough idea and re-structure it into a vastly more practical algorithm. Before we move on, it's worth noting that we've been talking about weights, but not the bias term belonging to every neuron. We know that every neuron's bias gets added in along with the neuron's weighted inputs, so changing the bias would also change the output. Doesn't that mean that we want to adjust the bias values as well? We sure do. But thanks to the **bias trick** we saw in Chapter TK, we don't have to think about the bias explicitly. That little bit of re-labeling sets up the bias to look like an input with its own weight, just like all the other inputs. The beauty of this arrangement is that it means that as far as our training algorithm is concerned, the bias is just another weight to adjust. In other words, all we need to think about is adjusting weights, and the bias weights will automatically get adjusted along the way with all the other weights. Let's now consider how we might improve our incredibly slow weight-changing algorithm. A Faster Way to Learn ---- The algorithm of the last section would improve our network, but at a glacial pace. One big source of inefficiency is that half of our adjustments to the weights are in the wrong direction: we add a value when we should instead have subtracted it, and vice-versa. That's why we had to undo our changes when the error went up. Another problem is that we tuned each weight one by one, requiring us to evaluate an immense number of samples. Let's solve these problems. We could avoid making mistakes if we knew beforehand whether we wanted to nudge each weight along the number line to the right (that is, make it more positive) or to the left (and make it more negative). We can get exactly that information from the **gradient** of the error with respect to that weight. Recall that we met the gradient in Chapter TK, where it told us how the height of a surface changes as each of its parameters changes. Let's narrow that down for the present case. In 1D (where the gradient is also called the **derivative**), the gradient is the slope of a curve above a specific point. Our curve describes the network's error, and our point is the value of a weight. If the slope of the error (the gradient) above the weight is positive (that is, the line goes up as we move to the right), then moving the point to the right will cause the error to go up. More useful to us is that moving the point to the left will cause the error to go down. If the slope of the error is negative, the situations are reversed. Figure [fig-number-line-gradients] shows a few examples. ![Figure [fig-number-line-gradients]: The gradient tells us what will happen to the error (the black curves) if we move a weight to the right. The gradient is given by the slope of the curve directly above the point we're interested in. Lines that go up as we move right have a positive slope, otherwise they are negative. (a) If we move the round weight to the right, the error will increase, because the slope of the error is positive. To reduce the error, we need to move the round point left. The square point's gradient is negative, so we reduce the error by moving that point right. (b) The gradient for the round point is negative here, so moving to the right will reduce the error. The square point's gradient is positive, so we reduce the error by moving that point to the left. ](../resources/Backpropagation/Images/number-line-gradients.jpg width="700px") If we had the gradient for a weight, we could always adjust it exactly as needed to make the error go down. Using the gradients wouldn't be much of an advantage if they were time-consuming to compute, so as our second improvement let's suppose that we can calculate the gradients for the weights very efficiently. In fact, let's suppose that we could quickly calculate the gradient for *every* weight in the whole network. Then we could update *all of the weights simultaneously* by adding a small value (positive or negative) to each weight in the direction given by its own individual gradient. That would be an immense time-saver. Putting these together gives us a plan where we'll run a sample through the network, measure the output, compute the gradient for every weight, and then use the gradient at each weight to move that weight to the right or the left. This is exactly what we're going to do. This plan makes knowing the gradient an important issue. Finding the gradient efficiently is the main goal of this chapter. Before we continue, it's worth noticing that this algorithm makes the assumption that tweaking all the weights independently and simultaneously will lead to a reduction in the error. This is a bold assumption, because we've already seen how changing one weight can cause ripple effects through the rest of the network. Those effects could change the values of other neurons, which in turn would change their gradients. We won't get into the details now, but we'll see later that if we make the changes to the weights small enough, that assumption will generally hold true, and the error will indeed go down. No Activation Functions for Now === For the next few sections in this chapter, we're going to simplify the discussion by pretending that our neurons don't have activation functions. As we saw in Chapter TK, activation functions are essential to keep our whole network from becoming nothing more than the equivalent of a single neuron. So we need to use them. But if we include them in our initial discussion of backprop, things will get complicated, fast. If we leave activation functions out for just a moment, the logic is much easier to follow. We'll put them back in again at the end. Since we'll temporarily pretending that there are no activation functions in our neurons, neurons in the following discussions just sum up their weighted inputs, and present that sum as their output, as in Figure [fig-neuron-no-AF]. As before, each weight is named with a two-letter composite of the neuron it's coming from and the neuron it's going into. ![Figure [fig-neuron-no-AF]: Neuron D simply sums up its incoming values, and presents that sum as its output. Here we've explicit named the weights on each connection into neuron D. ](../resources/Backpropagation/Images/neuron-no-AF.jpg height="150px") Until we put explicitly put activation functions back in, our neurons will emit nothing more than the sum of their weighted inputs. Neuron Output Changes and Network Error ==== Our goal is to reduce the overall error for a sample, by adjusting the network's weights. We'll do this in two steps. In the first step, we calculate and store a number called the "delta" for every neuron. This number is related to the network's error, as we'll see below. This step is performed by the **backpropagation** algorithm. The second step uses those delta values at the neurons to update the weights. This step is called the **update** step. It's not typically considered part of backpropagation, but sometimes people casually roll the two steps together and call the whole thing "backpropagation." The overall plan now is to run a sample through the network, get the prediction, and compare that prediction to the label to get an error. If their error is greater than 0, we use it to compute and store a number we'll call "delta" at every neuron. We use these delta values and the neuron outputs to calculate an update value for each weight. The final step is to apply every weight's individual update so that it takes on a new value. Then we move on to the next sample, and repeat the process, over and over again until the predictions are all perfect or we decide to stop. Let's now look at this mysterious "delta" value that we store at each neuron. Errors Change Proportionally --- There are two key observations that will make sense of everything to follow. These are both based on how the network behaves when we ignore the activation functions, which we're doing for the moment. As promised above, we'll put them back in later in this chapter. The first observation is this: *When any neuron output in our network changes, the output error changes by a proportional amount.* Let's unpack that statement. Since we're ignoring activation functions, there are really only two types of values we care about in the system: weights (which we can set and change as we please), and neuron outputs (which are computed automatically, and which are beyond our direct control). Except for the very first layer, a neuron's input values are each the output of a previous neuron times the weight of the connection that output travels on. Each neuron's output is just the sum of all of these weighted inputs. Figure [fig-basic-network] recaps this idea graphically. ![Figure [fig-basic-network]: A small neural network with 11 neurons organized in 4 layers. Data flows from the inputs at the left to the outputs at the right. Each neuron's inputs come from the outputs of the neurons on the previous layer. This type of diagram, though common, easily becomes dense and confusing, even with color-coding. We will avoid it when possible. ](../resources/Backpropagation/Images/basic-network.jpg width="500px") We know that we'll be changing weights to improve our network. But sometimes it's easier to think about looking at the change in a neuron's output. As long as we keep using the same input, the only reason a neuron's output can change is because one of its weights has changed. So in the rest of this chapter, any time we speak of the result of a change in a neuron's output, that came about because we changed one of the weights that neuron depended on. Let's take this point of view now, and imagine we're looking at a neuron whose output has just changed. What happens to the network's error as a result? Because the only operations that are being carried out in our network are multiplication and addition, if we work through the numbers we'll see that the result of this change is that the change in the error is **proportional** to the change in the neuron's output. In other words, to find the change in the error, we find the change in the neuron's output and multiply that by some particular value. If we double the amount of change in the neuron's output, we'll double the amount of change in the error. If cut the neuron's output change by one-third, we'll cut the change in the output by one-third. The connection between any *change* in the neuron's output and the resulting *change* in the final error is just the neuron's change times some number. This number goes by various names, but the most popular is probably the lower-case Greek letter **delta** ($\delta$), though sometimes the upper-case delta ($\Delta$) is used. Mathematicians often use the delta character to mean "change" of some sort, so this was a natural (if terse) choice of name. So every neuron has a "delta", or $\delta$, associated with it. This is a real number that can be big or small, positive or negative. If the neuron's output changes by a particular amount (that is, it goes up or down), we multiply that change by that neuron's delta, and that tells us how the entire network's output will change. Let's draw a couple of pictures to show the "before" and "after" conditions of a neuron whose output changes. We'll change the output of the neuron using brute force: we'll add some arbitrary number to the summed inputs just before that value emerges as the neuron's output. We'll call that extra value $m$ (for "modification"). Figure [fig-how-delta-works-1] shows the idea graphically. ![Figure [fig-how-delta-works-1]: Computing the change in the error due to a change in a neuron's output. Here we're forcing a change in the neuron's output by adding an arbitrary amount $m$ to the sum of the inputs. Because the output will change by $m$, we know the change in the error is this difference $m$ times the value of $\delta$ belonging to this neuron. ](../resources/Backpropagation/Images/how-delta-works-1.jpg height="200px") In Figure [fig-how-delta-works-1] we placed the value $m$ inside the neuron. But we can also change the output by changing one of the inputs. Let's change the value that's coming in from neuron B. We know that the output of B will get multiplied by the weight BD before it's used by neuron D. So let's add our value $m$ right after that weight has been applied. This will have the same result as before, since we're just adding $m$ to the overall sum that emerges from D. Figure [fig-how-delta-works-2] shows the idea. We can find the change in the output like before, multiplying this change $m$ in the output by $\delta$. ![Figure [fig-how-delta-works-2]: A variation of Figure [fig-how-delta-works-1], where we add $m$ to the output of B (after it has been multiplied by the weight BD). The output of D is again changed by $m$, and the change in the error is again $m$ times this neuron's value of $\delta$. ](../resources/Backpropagation/Images/how-delta-works-2.jpg height="250px") To recap, if we know the change in a neuron's output, and we know the value of delta for that neuron, then we can predict the change in the error by multiplying that change in the output by that neuron's delta. This is a remarkable observation, because it shows us explicitly how the error changes based on the change in output of each neuron. The value of delta acts like an amplifier, making any change in the neuron's output have a bigger or smaller effect on the network's error. An interesting result of multiplying the neuron's change in output with its delta is that if the change in the output and the value of delta both have the same **sign** (that is, both are positive or negative), then the change in the error will be positive, meaning that the error will increase. If the change in the output and delta have opposite signs (that is, one is negative and one is positive), then the change in the error will be negative, meaning that the error will decrease. That's the case we want, since our goal is always to make the error as small as possible. For instance, suppose that neuron A has a delta of 2, and for some reason its output changes by -2 (say, changing from 5 to 3). Since the delta is positive and the change in output is negative, the change in the error will also be negative. In numbers, $2 \times -2 = -4$, so the error will drop by 4. On the other hand, suppose the delta of A is -2, and its output changes by +2 (say from 3 to 5). Again, the signs are different, so the error will change by $-2 \times 2 = -4$, and again the error will reduce by 4. But if the change in A's output is -2, and the delta is also -2, then the signs are the same. Since $-2 \times -2 = 4$, the error will increase by 4. At the start of this section we said there were two key observations we wanted to note. The first, as we've been discussing, is that if a neuron's output changes, the error changes by a proportional amount. The second key observation is: *this whole discussion applies just as well to the weights*. After all, the weights and the outputs are multiplied together. When we multiply two arbitrary numbers, such as $a$ and $b$, then we make the result bigger by adding something to *either* the value of $a$ or $b$. In terms of our network, we can say that *when any weight in our network changes, the error changes by a proportional amount.* If we wanted, we could work out a delta for every weight. And that would be perfect. We would know just how to tweak each weight to make the error go down. We just add in a small number with whose sign is opposite that of the weight's delta. Finding those deltas is what backprop is for. We find them by first finding the delta for every neuron's output. We'll see below that with a neuron's delta, and its output, we can find the weight deltas. We already know every neuron's outputs, so let's turn our attention to finding those neuron deltas. The beauty of backpropagation is that finding those values is incredibly efficient. A Tiny Neural Network === To get a handle on backprop, we'll use a tiny network that classifies 2D points into two categories, which we'll call class 1 and class 2. If the points can be separated by a straight line then we could do this job with just one perceptron, but we'll use a little network because it lets us see the general principles. In this section we'll look at the network and give a label to everything we care about. That will make later discussions simpler and easier to follow. Figure [fig-simple-net] shows our network. The inputs are the X and Y coordinates of each point, there are four neurons, and the outputs of the last two serve as the outputs of the network. We call their outputs the predictions P1 and P2. The value of P1 is the network's prediction of the likelihood that our sample (that is, the X and Y at the input) belongs to class 1, and P2 is its prediction of the likelihood that the sample belongs to class 2. These aren't actually probabilities because they won't necessarily add up to 1, but whichever is larger is the network's preferred choice of category for this input. We could make them into probabilities (by adding a **softmax** layer, as discussed in Chapter TK), but that would just make the discussion more complicated without adding anything useful. Remember that our neurons don't have activation functions for now. ![Figure [fig-simple-net]: A simple network. The input has two features, which we call X and Y. There are four neurons, ending with two predictions, P1 and P2. These predict the likelihoods (not the probabilities) that our sample belongs to class 1 or class 2, respectively. ](../resources/Backpropagation/Images/simple-net.jpg width="400px") Let's label the weights. As usual, we'll imagine that the weights are sitting on the wires that connect neurons, rather than stored inside the neurons. The name of each weight will be the name of the neuron providing that value at its output followed by the neuron using that value as input. Figure [fig-simple-net-weights] shows the names of all 8 weights in our network. ![Figure [fig-simple-net-weights]: Giving names to each of the 8 weights in our tiny network. Each weight is just the name of the two neurons it connects, with the starting neuron (on the left) coming first, and then the destination neuron (on the right) second. For the sake of consistency, we pretend that X and Y are "neurons" when it comes to naming the weights, so $XA$ is the name of the weight that scales the value of X going into neuron A. ](../resources/Backpropagation/Images/simple-net-weights.jpg width="300px") This is actually a tiny **deep-learning network** with two **layers**. The first layer contains neurons A and B, and the second contains neurons C and D, as shown in Figure [fig-simple-net-layers]. ![Figure [fig-simple-net-layers]: Our tiny neural network has 2 layers. The "input layer" is an abstraction that we use to collect together the input values X and Y. The "hidden layer" is made of neurons A and B. It's called "hidden" because it can't be "seen" by an observer outside the network, who can only "see" the inputs and outputs. The "output layer" contains neurons C and D, which present their outputs in the form of the values of P1 and P2. The naming system is somewhat inconsistent, because the "input layer" is just conceptual, while the "output layer" is a real thing composed of the output neurons. ](../resources/Backpropagation/Images/simple-net-layers.jpg width="400px") Let's use our deep-learning terminology from Chapter TK to refer to various pieces of the network, in Figure [fig-simple-net-layers]. The **input layer** is just a conceptual grouping of the inputs X and Y. These don't correspond to neurons, because these are just pieces of memory for storing the features in the sample that's been given to the network. When we count up the layers in a network, we don't usually count the input layer. The **hidden layer** is called that because neurons A and B are "inside" the network, and thus "hidden" from a viewer on the outside, who can see only the inputs and outputs. The **output layer** is the set of neurons that provide our **outputs**, here P1 and P2. These layer names a bit asymmetrical because the input layer is imaginary, while the output layer is real, but they're how the convention has developed. Finally, we'll want to refer to the output and delta for every neuron. For this, we'll make little two-letter names by combining the neuron's name with the value we want to refer to. So $Ao$ and $Bo$ will be the names of the outputs of neurons A and B, and $A\delta$ and $B\delta$ will be the delta values for those two neurons. Figure [fig-simple-net-od] shows these values stored with their neurons. ![Figure [fig-simple-net-od]: Our simple network with the output and delta values for each neuron. ](../resources/Backpropagation/Images/simple-net-od.jpg width="400px") We'll be watching what happens when neuron outputs change, causing changes to the error. We'll label the change in the output of neuron A as $Am$. We'll label the error simply $E$, and a change to the error as $Em$. As we saw above, if we have a change $Am$ in the output of neuron A, then multiplying that change by $A\delta$ gives us the change in the error. That is, the change $Em$ is given by $Am \times A\delta$. We'll think of the action of $A\delta$ as multiplying, or scaling, the change in the output of neuron A, giving us the corresponding change in the error. Figure [fig-error-delta-visual] shows the schematic setup we'll use for visualizing the way changes in a neuron's output are scaled by its delta to produce changes to the error. ![Figure [fig-error-delta-visual]: Our schematic for visualizing how changes in a neuron's output can change the network's error. Read the diagram roughly left to right. Here we start with a neuron A. It starts with value $Ao$, but we change one of the weights on its inputs so that the output goes up by $Am$. The arrow inside the box for $Am$ shows that this change is positive. This change is multiplied by $A\delta$ to give us $Em$, the change in the error. We show $A\delta$ as a wedge, illustrating the amplification of $Em$. Adding this change to the previous value of the error, $E$, gives us the new error $E+Em$. In this case, both $Am$ and $A\delta$ are positive, so the change in the error $Am \times A\delta$ is also positive, increasing the error. is positive, so the error increased as a ](../resources/Backpropagation/Images/error-delta-visual.jpg width="600px") Keep in mind that the delta value $A\delta$ relates a *change* in a neuron's output to a *change* in the error. These are not relative or percentage changes, but the actual amounts. So if the output of A goes from 3 to 5, that's a change of 2, so the change in the error would be $A\delta \times 2$. If the output of A goes from 3000 to 3002, that's still a change of 2, and the error would change by the same amount, $A\delta \times 2$. Now that we've labeled everything, we're finally ready to look at the backpropagation algorithm. Step 1: Deltas for the Output Neurons ==== Backpropagation is all about finding the delta value for each neuron. To do that, we'll start at the end: the output layer. The outputs of neuron C and D in our tiny network give us the likelihoods that the input is in class 1 or class 2, respectively. In a perfect world, a sample that belongs to group 1 would produce a value of 1.0 for P1 and 0.0 for P2, meaning that the system is certain that it belongs to class 1 and simultaneously certain that it does *not* belong to class 2. If the system's a little less certain, we might get P1=0.8 and P2=0.1, telling us that it's much more likely that the sample is in class 1 (remember that these aren't probabilities, so they probably won't sum to 1). We'd like to come up with a single number to represent the network's error. To do that, we'll compare the values of P1 and P2 with the label for this sample. The easiest way to make that comparison is if the label is **one-hot encoded**, as we saw in Chapter TK. Recall that one-hot encoding makes a list of numbers as long as the number of classes, and puts a 0 in every entry. Then it puts a 1 in the entry corresponding to the correct class. In our case, we have only two classes, so the encoder would always create a list of two zeros, which we can write as (0, 0). For a sample that belongs to class 1, it would put a 1 in the first slot, giving us (1, 0). A sample from class 2 would get the label (0, 1). Sometimes such a label is also called a **target**. Let's put the predictions P1 and P2 into a list as well: (P1, P2). Now we can just compare the lists. There are lots of ways to do this. For example, a simple way would be to find the difference between corresponding elements from each list and then add up those differences. Figure [fig-simple-net-error] shows the idea. ![Figure [fig-simple-net-error]: To find the error from a specific sample, we start by supplying the sample's features X and Y to the network. The outputs are the predictions P1 and P2, telling us the likelihoods that the sample is in class 1 and class 2, respectively. We compare those predictions with the one-hot encoded label, and from that come up with a number representing the error. If the predictions match the label perfectly, the error is 0. The bigger the mismatch, the bigger the error. ](../resources/Backpropagation/Images/simple-net-error.jpg width="500px") If the prediction list is identical to the label list, then the error is 0. If the two lists are close (say, (0.9, 0.1) and (1, 0)), then we'd want to come up with an error number that's bigger than 0, but maybe not enormous. As the lists become more and more different, the error should increase. The maximum error would come if the network is absolutely wrong, for example predicting (1, 0) when the label says (0, 1). There are many formulas for calculating the network error, and most libraries let us choose among them. We'll see in Chapter TK that this error formula is one of the critical choices that defines what our network is for. For instance, we'll choose one type of error formula if we're building a network to classify inputs into categories, another type of formula if our network is for predicting the next value in a sequence, and yet another type of formula if we're trying to match the output of some other network. These formulas can be mathematically complex, so we won't go into those details here. As Figure [fig-simple-net-error] shows, the formula for our simple network takes in 4 numbers (2 from the prediction and 2 from the label), and produces a single number as a result. But all the error formulas share the property that when they compare a classifier's output to the label, a perfect match will give a value of 0, and increasingly incorrect matches will give increasingly large errors. For each type of error formula, our library function will also provide us with its **gradient**. The gradient tells us how the error will change if we increase any one of the four inputs. This may seem redundant, since we know that we want the outputs to match the label, so we can tell how the outputs should change just by looking at them. But recall that the error can include other terms, like the regularization term we discussed above, so things can get more complicated. In our simple case, we can use the gradient to tell us whether we'd like the output of C to go up or down, and the same for D. We'll pick the direction for each neuron that causes the error to decrease. Let's think about drawing our error. We could also draw the gradient, but usually that's harder to interpret. As we saw in Chapter TK, when we draw the error itself, we can usually see the gradient just by looking at the shape of the error. Unfortunately, we can't draw a nice picture of the error for our little network because it would require five dimensions (four for the inputs and one for the output). But things aren't so bad. We don't care about how the error changes when the label changes, because the label can't change. For a given input, the label is fixed. So we can ignore the two dimensions for the labels. That leaves us with just 3 dimensions. And we can draw a 3D shape! So let's plot the error. Remember that we can find the gradient at any point just by imagining which way a drop of water would flow if we placed it on the surface of the error above that location. Let's draw a 3D diagram that shows the error for any set of values P1 and P2, *for a given label*. That is, we'll set the value of the label, and explore the error for different values of P1 and P2. Every error formula will give us a somewhat different surface, but most will look roughly like Figure [fig-simple-net-errors-range-3-3]. ![Figure [fig-simple-net-errors-range-3-3]: Visualizing the error of our network, given a label (or target) and our two predictions, P1 and P2. Top row: The label is (0, 1), so the error is a bowl with its bottom at P1 = 0 and P2 = 1. As P1 and P2 diverge from those values, the error goes up. Left: The error for each value of P1 and P2. Right: A top-down view of the surface at the left, showing the height using colored contours. Bottom row: The label is (1, 0), so now the error is a bowl with the bottom at P1 = 1 and P2 = 0. ](../resources/Backpropagation/Images/simple-net-errors-range-3-3.jpg width="600px") For both labels the shape of the error surface is the same: a bowl with a rounded bottom. The only difference is the location of the bottom of the bowl, which is directly over the label. This makes sense, because our whole intention is to get P1 and P2 to match the label. When they do, we have zero error. So the bottom of the bowl has a value of 0, and it sits right on top of the label. The more different P1 and P2 are from the label, the more the error grows. These plots let us make a connection between the gradient of this error surface and the delta values for the output layer neurons C and D. It will help to remember that P1, the likelihood of the sample belonging to class 1, is just another name for the output of C, which we also call $Co$. Similarly, P2 is another name for $Do$. So if we say that we want to see a specific change in the value of P1, we're saying that we want the output of neuron C to change in that way, and the same is true for P2 and D. Let's look at one of these error surfaces a little more carefully so we can really get a feeling for it. Suppose we have a label of (1,0), like in the bottom row of Figure [fig-simple-net-errors-range-3-3]. Let's suppose that for a particular sample, output P1 has the value -1, and output P2 has the value 0. In this example, P2 matches the label, but we want P1 to change from -1 to 1. Since we want to change P1 while leaving P2 alone, let's look at the part of the graph that tells us how the error will change by doing just that. We'll set P2=0 and look at the cross-section of the bowl for different values of P1. We can see it follows the overall bowl shape, as in Figure [fig-error-for-P2-0]. ![Figure [fig-error-for-P2-0]: Slicing away the error surface from the bottom left of Figure [fig-simple-net-errors-range-3-3], where the label is (1,0). The revealed cross-section of the surface shows us the values of the error for different values of P1 when P2 = 0. ](../resources/Backpropagation/Images/error-for-P2-0.jpg width="300px") Let's look just at this slice of the error surface, shown in Figure [fig-error-P1-only-4-4]. ![Figure [fig-error-P1-only-4-4]: Looking at the cross-section of the error function shown in Figure [fig-error-for-P2-0], we can see how the error depends on different values of P1, when P2 is fixed at 0. Here we've marked the value of the curve when P1 = -1 with an orange dot. The green line shows the derivative of the curve at that point. This tells us that if we make P1 more positive (that is, we move right from -1), the error in the network will decrease. But if we go too far and increase P1 beyond the value of 1, the error will start to increase again. ](../resources/Backpropagation/Images/error-P1-only-4-4.jpg width="400px") Here we've marked the value P1= -1 with an orange dot, and we've drawn the derivative at the location on the curve directly above this value of P1. The derivative is just the piece of the gradient that applies to only P1. The derivative tells us how the error changes as the value of P1 changes, *for these values of P2 and the label*. As we can see from the figure, if we get too far away from -1 the derivative no longer matches the curve, but close to -1 it does a good job. We'll come back to this idea again later: the derivative of a curve tells us what happens to the error if we move P1 *by a very small amount* from a given location. The smaller the move, the more accurate the derivative will be at predicting our new error. This is true for any derivative, or any gradient. We can see this characteristic in Figure [fig-error-P1-only-4-4]. If we move P1 by 1 unit to the right from -1, the derivative would land us at an error of 0, though it looks like the error for P1=0 (the value of the black curve) is really about 1. We *can* use the derivative to predict the results for large changes in P1, but our accuracy will go down the farther we move, as we just saw. In the interests of clear figures that are easy to read, we'll sometimes make large moves when the difference between where the derivative would land us, and where the real error curve tells us we should be, are close enough. Let's use the derivative to predict the change in the error due to a change in P1. What's the slope of the green line in Figure [fig-error-P1-only-4-4]? The left end is at about (-2, 8), and the right end is at about (0,0). Thus the line descends about 4 units for every 1 unit we move to the right, for a slope of -4/1 or -4. So if P1 changed by 0.5 (that is, it changed from -1 to -0.5), we'd predict that the error would go down by $0.5 \times -4 = -2$. That's it! That's the key observation that tells us the value of $C\delta$. Remember that P1 is just another name for $Co$, the output of C. We've found that a change of 1 in $Co$ results in a change of -4 in the error. As we discussed, we shouldn't have too much confidence in this prediction after such a big change in P1. But for small moves, the proportion is right. For instance, if we were to increase P1 by 0.01, then we'd expect the error to change by $-4 \times 0.01 = -0.04$, and for such a small change in P1 the predicted change in the error should be pretty accurate. If we increased P1 by 0.02, then we'd expect the error to change by $-4 \times 0.02 = -0.08$. If we move P1 to the left, so it changed from -1 to, say, -1.1 we'd expect the error to change by $-0.1 \times -4 = 0.4$, so the error would increase by 0.4. We've found that for any amount of change in $Co$, we can predict the change in the error by multiplying $Co$ by -4. That's exactly what we've been looking for! The value of $C\delta$ is -4. Note that this only holds for this label, and these values of $Co$ and $Do$ (or P1 and P2). We've just found our first delta value, telling us how much the error will change if there's a change to the output of C. It's just the derivative of the error function measured at P1 (or $Co$). Figure [fig-C-error-delta] shows what we've just described using our error diagram. ![Figure [fig-C-error-delta]: Our error diagram illustrating the change in the error from a change in the output of neuron C. The original output is the green bar at the far left. We imagine that due to a change in the inputs, the output of C increases by an amount $Cm$. This is amplified by multiplying it with $C\delta$, giving us the change in the error, $Em$. That is, $Em = Cm \times C\delta$. Here the value of $Cm$ is about 1/4 (the upward arrow in the box for $Cm$ tells us that the change is positive), and the value of $C\delta$ is -4 (the arrow in that box tells us the value is negative). So $Em = -4 \times 1/4 = -1$. The new error, at the far right, is the previous error plus $Em$. ](../resources/Backpropagation/Images/C-error-delta.jpg width="500px") Remember that at this point we're not going to do anything with this delta value. Our goal right now is just to find the deltas for our neurons. We'll use them later. We assumed above that P2 already had the right value, and we only needed to adjust P1. But what if they were both different than their corresponding label values? Then we'd repeat this whole process for P2, to get the value of $D\delta$, or delta for neuron D. Let's do just that. In Figure [fig-clipped-error-pair] we can see the slices of the error for an input with a corresponding label, or target, of (1,0) when P1 = -0.5, and P2 = 1.5. ![Figure [fig-clipped-error-pair]: Slicing our error function when both P1 and P2 are different from the label. Left: The error due to different values of P1 when P2=1.5. Right: The error due to different values of P2 when P1 = -0.5. ](../resources/Backpropagation/Images/clipped-error-pair.jpg width="700px") The cross-section curves are shown in Figure [fig-error-P1-P2]. Notice that the curve for P1 has changed from Figure [fig-error-P1-only-4-4]. That's because the change in P2 means we're taking the P1 cross-section at P2=1.5 instead of P2=0. ![Figure [fig-error-P1-P2]: When neither P1 or P2 match the label, we can try to reduce the error by adjusting each one individually. In this example, the label is (1, 0). Left: The error for different values of P1 when P2 is 1.5. The smallest value we can get to is a little more than 2. The derivative tells us that we can get to that lower value by making P1 larger than its current value of -0.5 (that is, moving to the right). Right: The error for different values of P2 when P1 = -0.5. The current value of P2 is 1.5. The derivative tells us that we can make the error smaller by making P2 smaller (that is, moving to the left). ](../resources/Backpropagation/Images/error-P1-P2.jpg width="600px") For the specific values in this figure, it looks like a change of about 0.5 in P1 would result in a change of about -1.5 in the error, so $C\delta$ is about -1.5/0.5 = -3. Instead of changing P1, what if we changed P2? Looking at the graph on the right, a change of about -0.5 (moving left this time, towards the minimum of the bowl) would result in a change of about -1.25 in the error, so $D\delta$ is about 1.25/-0.5 = 2.5. The positive sign here tells us that moving P2 to the right will cause the error to go up, so we want to move P2 to the left. There are some interesting things to observe here. First, although both curves are bowl shaped, the bottoms of the bowls are at their respective label values. Second, because the current values of P1 and P2 are on opposite sides of the bottom of their respective bowls, their derivatives have opposite signs (one is positive, the other is negative). The most important observation is that the minimum available error is not 0. In particular, the curves never get lower than a bit more than 2. That's because each curve looks at changing just one of the two values, while the other is left fixed. So even if P1 got to its ideal value of 1, there would still be error in the result because P2 is not at its ideal value of 0, and vice-versa. This means that if we change just one of these two values, we'll never get down to the minimum error of 0. To get the error down to 0, both P1 and P2 have to work their way down to the bottom of their respective curves. Here's the wrinkle: each time either P1 or P2 changes, we pick out a different cross-section of the error surface. That means we get different curves for the error, just as Figure [fig-error-P1-only-4-4], when P2=0, is different from Figure [fig-error-P1-P2] when P2 = 1.5. Because the error curve has changed, the deltas change as well. So if we change either value, we have to restart this whole process again from scratch before we know how to change the other value. Well, not exactly. We'll see later that we actually update both values at the same time, as long as we take very small steps. But that's about as far as we can cheat before we risk driving the error up again. Once we've taken our small steps, we have to evaluate the error surface again to find new curves and then new derivatives before we can adjust P1 and P2 again. We've just described the general way to compute the delta values for the output neurons (we'll look at hidden neurons in a moment). In practice, we often use a particular measure of error that makes things easier, because we can write down a super simple formula for the derivative, which in turns gives us an easy way to find the deltas. This error measure is called the **quadratic cost function**, or the **mean squared error** (or **MSE**) [#Neilsen15a]. As usual, we won't get into the mathematics of this equation. What matters for us now is that this choice of function for the error means that the derivative at any neuron's value (that is, the neuron's delta value) is particularly easy to calculate. The delta for an output neuron is the difference between the neuron's value and the corresponding label entry [#Seung05]. Figure [fig-C-D-delta] shows the idea graphically. ![Figure [fig-C-D-delta]: When we use the quadratic cost function, the delta for any output neuron is just the value in the label minus the output of that neuron. As shown in red, we save that delta value with its neuron. ](../resources/Backpropagation/Images/C-D-delta.jpg width="200px") This little calculation matches up nicely with our bowl-shaped pictures above. Remember that $Co$ and P1 are two names for the same value, as are $Do$ and P2. Let's consider $Co$ (or P1) when the first label is 1. If $Co=1$, then $1 - Co = 0$, telling us that a tiny change to $Co$ will have no effect on the output error. That makes sense, because we're at the bottom of the bowl where it's flat. Suppose that $Co=2$. Then the difference $1-Co = -1$, telling us that a change to $Co$ will change the error by the same amount, but with opposite sign (for example, a change of 0.2 in $Co$ would result in a change of -0.2 to the error). If $Co$ is much larger, say $Co=5$, then $1-Co= -4$, telling us that any change to $Co$ will be amplified by a factor of -4 in the change to the error. That also makes sense, because we're now at a very steep part of the bowl where a small change in the input ($Co$) will cause a big change in the output (the change in the error). We've been using large numbers for convenience, but remember that the derivative only tells us what happens if we take a very small step. The same thought process holds for neuron D, and its output $Do$ (or $P2$). We've now completed the first step in backpropagation: we've found the delta values for all the neurons in the output layer. We've seen that the delta for an output neuron depends on the value in the label and the neuron's output. If the neuron's output changes, the delta will change as well. So "the delta" is a temporary value that changes with every new label, and every new output of the neuron. Following this observation, any time we update the weights in our network, we'll need to calculate new deltas. Remember that our goal is to find the deltas for the weights. When we know the deltas for all the neurons in a layer, we can update all the weights feeding into that layer. Let's see how that's done. Step 2: Using Deltas to Change Weights ==== We've seen how to find a delta value for every neuron in the output layer. If the output of any of those neurons changes by some amount, we multiply that change by the neuron's delta to find the change to the output. We know that a change to the neuron's output can come from a change in an input, which in turn can come from a change in a previous neuron's output or the weight connecting that output to this neuron. Let's look at these inputs. We'll focus on output neuron C, and the value it receives from neuron A on the previous layer. Let's use the temporarily name $V$ to refer to the value that arrives into C from A. It has a value given by the output of A, or $Ao$, and the weight between A and C, or $AC$, so the value V is $Ao \times AC$. This setup is shown in Figure [fig-AC-1]. ![Figure [fig-AC-1]: The value coming into neuron C, which we're calling V for the moment, is $Ao$ (the output of A) times the weight $AC$, or $Ao \times AC$.](../resources/Backpropagation/Images/AC-1.jpg height="100px") If $V$ changes for any reason, we know that the output of C will change by $V$ as well (since C is just adding up its inputs and passing on that sum). Since the change in C is $V$, the network's error will change by $V \times C\delta$, since we built $C\delta$ to do just that. There are only two ways the value of $V$ can change in this setup: either the output of A changes or the value of the weight changes. Since the output of A is computed by the neuron automatically, there's not much we can do to adjust that value directly. But we can change the weight, so let's look at that. Let's modify the weight $AC$ by adding some new value to it. We'll call that new value $ACm$, so the new value of $V$ is $(AC + ACm)$. Then the value arriving at C is $Ao \times (AC + ACm)$. This is shown in Figure [fig-AC-3] ![Figure [fig-AC-3]: If we change the weight $AC$, then this will change the value of $V$. Here, the value of $V$ coming into neuron C is $Ao \times (AC + ACm)$. ](../resources/Backpropagation/Images/AC-3.jpg height="100px") If we subtract the old value $Ao \times AC$ from the new value $Ao \times (AC + ACm)$, we find that the change in the value coming into C due to our modifying the weight is $Ao \times ACm$. Since this change goes into C, the change will then get multiplied by $C\delta$, telling us the change in the network's error as a result of modifying the weight. Hold on a second. That's what we've wanted this whole time, to find out what a change in the weight would do to the error. Have we just found that? Yup, we have. We just struck gold! We discovered how a change to a weight would affect the network's error. Let's look at that again. If we change the weight $AC$ by adding $ACm$ to it, then we know the change in the network's error is given by the change in C, $(Ao \times ACm)$, times the delta for C, or $C\delta$. That is, the change in error is $(Ao \times ACm) \times C\delta$. Let's suppose we increase the weight by 1. Then $ACm=1$, so the error will change by $Ao \times C\delta$. So every change of +1 to the weight $AC$ will cause a change of $Ao \times C\delta$ to the network error. If we add 2 to the weight $AC$, we'll get double this change, or $2 \times (Ao \times C\delta)$. If we add 0.5 to the weight, we'll get half as much, or $1/2 \times (Ao \times C\delta)$. We can turn this around, and say that if we want to increase the error by 1, we should add $Ao \times C\delta$ to the weight. But we want to make the error go down. The same logic says that we can reduce the error by 1 if we instead *subtract* the value $Ao \times C\delta$ from the weight. That's the top of the mountain. We've found what to do to the weight $AC$ to decrease the error in this little network. To subtract 1 from the error, we subtract $Ao \times C\delta$ from $AC$. Figure [fig-AC-4] shows that every change of $Ao \times C\delta$ in the weight $AC$ leads to a corresponding change of 1 in the network error in our network. And subtracting that value leads to a change of -1 in the error. ![Figure [fig-AC-4]: For each change in the value of the weight $AC$, we can look up the corresponding change in our network's error. When $AC$ changes by $Ao \times C\delta$, the network error changes by 1. ](../resources/Backpropagation/Images/AC-4.jpg width="400px") We can summarize this process visually with a new convention for our diagrams. We've been drawing the outputs of neurons as arrows coming out of a circle to the right. Let's draw deltas using arrows coming out of the circles to the left, as in Figure [fig-C-o-delta]. ![Figure [fig-C-o-delta]: Neuron C has an output $Co$, drawn with an arrow pointing right, and a delta $C\delta$, drawn with an arrow pointing left. ](../resources/Backpropagation/Images/C-o-delta.jpg width="300px") With this convention, the whole process for finding the updated value for a weight is summarized in Figure [fig-AC-update-basic]. Showing subtraction in a diagram like this is hard, because if we have a "minus" node with two incoming arrows, it's not clear which value is being subtracted from the other (that is, if the inputs are $x$ and $y$, are we computing $x-y$ or $y-x$?). Our approach to computing $AC - (Ao \times C\delta)$ is to find $Ao \times C\delta$, multiply that by $-1$, and then add that result to $AC$. ![Figure [fig-AC-update-basic]: Updating the value of weight $AC$. We start with the output $Ao$ from neuron A and the delta $C\delta$ from neuron C, and multiply them together. We'd like to subtract this from the current value of $AC$. To show this clearly in the diagram, we instead multiply the product by $-1$, and then add it to $AC$. The green arrow is the update step, where this result becomes the new value of $AC$.](../resources/Backpropagation/Images/AC-update-basic.jpg width="400px") Figure [fig-AC-update-basic] is the goal of this chapter, and the reason we computed $C\delta$. This is how our network learns. Figure [fig-AC-update-basic] tells us how to change the weight $AC$ to bring down the error. The diagram says that to reduce the error by -1, we should add $-Ao \times C\delta$ to the value of $AC$. Success! If we change the weights for both output neurons C and D to reduce the error by 1 from each neuron, we'd expect the error to go down by -2. We can predict this because the neurons sharing the same layer don't rely on each other's outputs. Since C and D are both in the output layer, C doesn't depend on $Do$ and D doesn't depend on $Co$. They do depend on the outputs of neurons on previous layers, but right now we're just focusing on the effect of changing weights for C and D. It's wonderful that we know how to adjust the last two weights in the network, but how about all the other weights? To use this technique, we need to figure out the deltas for all the neurons in all the remaining layers. Then we can use Figure [fig-AC-update-basic] to improve all the weights in the network. And this brings us to the remarkable trick of backpropagation: we can use the neuron deltas at one layer to find all the neuron deltas for its preceding layer. And as we've just seen, knowing the neuron deltas and the neuron outputs tells us how to update all the weights coming into that neuron. Let's see how to do that. Step 3: Other Neuron Deltas ====== Now that we have the delta values for the output neurons, we will use them to compute the deltas for neurons on the layer just before the output layer. In our simple model, that's just neurons A and B. Let's focus for the moment just on neuron A, and its connection to neuron C. What happens if $Ao$, the output of A, changes for some reason? Let's say it goes up by $Am$. Figure [fig-AC-error-1] follows the chain of actions from this change in $Ao$ to the change in $Co$ to the change in the error. ![Figure [fig-AC-error-1]: Following the results if we change the output of neuron A. Read the diagram left to right. The change to A, shown as $Am$, is multiplied by the weight AC and is added to the values accumulated by neuron C. This raises the output of C by $Cm$. As we know, this change in C can be multiplied by $C\delta$ to find the change in the network error. In this example, $Am=5/4$, and $AC=1/5$, so $Cm=5/4 \times 1/5 = 1/4$. The value of $C\delta$ is -4, so the change in the error is $1/4 \times -4 = -1$. ](../resources/Backpropagation/Images/AC-error-1.jpg width="500px") We know that the neuron C adds together its weighted inputs and then passes on the sum (since we're ignoring the activation function for now). So if nothing else changes in C except for the value coming from A, that's the only source of any change in $Co$, the output of C. We represent that change to $Co$ as the value $Cm$. As we saw before, we can predict the change in the network's error by multiplying $Cm$ by $C\delta$. So now we have a chain of operations from neuron A to neuron C and then to the error. The first step of the chain says that if we multiply the change in $Ao$ (that is, $Am$) by the weight $AC$, we'll get $Cm$, the change in in the output of C. And we know from above that if we multiply $Cm$ by $C\delta$, we get the change in the error. So mushing this all together, we find that the error due to a change $Am$ in the output of A is $Am \times AC \times C\delta$. In other words, if we multiply the change in A (that is, $Am$) by $AC$ $\times C\delta$, we get the change in the error due to the change in neuron A. That is, $A\delta = AC \times C\delta$. We just identified the delta of A! Since the delta is the value that we multiply a neuron's change by to find the change in the error, and we just found that value is $AC$ $\times C\delta$, then we've found $A\delta$. Figure [fig-AC-error-2] shows this visually. ![Figure [fig-AC-error-2]: We can mush together the operations in Figure [fig-AC-error-1] into a more succinct diagram. In that figure, we saw that $Am$, the change in A, gets multiplied by the weight $AC$ and then the value $C\delta$. So we can represent that as a single step, where we multiply $Am$ by the two values $AC$ and $C\delta$ multiplied together. As before, the value of $Am = 5/4$, $AC = 1/5$, and $C\delta$ is -4. So $AC \times C\delta = -4/5$, and multiplying that by $Am=5/4$ gives us a change in the error of -1.0. ](../resources/Backpropagation/Images/AC-error-2.jpg width="450px") This is kind of amazing. Neuron C has disappeared. It's literally out of the picture in Figure [fig-AC-error-2]. All we needed was its delta, $C\delta$, and from that we could find $A\delta$, the delta for A. And now that we know $A\delta$, we can update all of the weights that feed into neuron A, and then... no, wait a second. We don't really have $A\delta$ yet. We just have one piece of it. At the start of this discussion we said we'd focus on neurons A and C, and that was fine. But if we now remember the rest of the network, we can see that neuron D also uses the output of A. If $Ao$ changes due to $Am$, then the output of D will change as well, and that will also have an effect on the error. To find the change in the error due to neuron D, caused by a change to neuron A, we can repeat the process above, just replacing neuron C with neuron D. So if $Ao$ changes by $Am$, and nothing else changes, the change in the error due to the change in D is given by $AC \times D\delta$. Figure [fig-ADC-error] shows these two paths at the same time. This figure is set up slightly differently from the ones above. Here, the effect of a change in A on the error due to a change in C is shown by the path from the center of the diagram moving to the right. The effect of a change in A on the error due to a change in D is shown by the path from the center of the diagram and moving left. ![Figure [fig-ADC-error]: The output of neuron A is used by both neuron C and neuron D. We've shown the effects as two separate paths, each starting at neuron A in the center. The path through neuron C is shown going to the right, where $Am$, the change in the output of A, is scaled by $AC \times C\delta$ to get a change in the error. Moving from the center to the left, $Am$ is scaled by $AD \times D\delta$ to get another change to the error. The result is two separate changes to the error. ](../resources/Backpropagation/Images/ADC-error.jpg width="600px") Figure [fig-ADC-error] shows two separate changes to the error. Since neurons C and D don't influence each other, so their effects on the error are independent. To find the total change to the error, we just add up the two changes. Figure [fig-ADC-error-added] shows the result of adding the change in error via neuron C and the change via neuron D. ![Figure [fig-ADC-error-added]: When the output of neuron A is used by both neuron C and neuron D, the resulting changes to the error add together. ](../resources/Backpropagation/Images/ADC-error-added.jpg height="300px") Now that we've handled all the paths from A to the outputs, we can finally write the value for $A\delta$. Since the errors add together, as in Figure [fig-ADC-error-added], we can just add up the factors that scale $Am$. If we write it out, this is $A\delta = (AC \times C\delta) + (AD \times D\delta)$. Now that we've found the value of delta for neuron A, we can repeat the process for neuron B to find its delta. We've actually done something far better than find the delta for just neurons A and B. We've found out how to get the value of delta for *every* neuron in *any* network, no matter how many layers it has or now many neurons there are! That's because everything we've done involves nothing more than a neuron, the deltas of all the neurons in the next layer that use its value as an input, and the weights that join them. With nothing more than that, we can find the effect of a neuron's change on the network's error, even if the output layer is dozens of layers away. To summarize this visually, let's expand on our convention for drawing outputs and deltas as right-pointing and left-pointing arrows to include the weights, as in Figure [fig-o-delta-arrows]. We'll say that the weight on a connection multiplies either the output on the left or the delta on the right, depending on which way the arrow points. ![Figure [fig-o-delta-arrows]: Drawing the values associated with neuron A. (a) Our convention is to draw the output $Ao$ as an arrow coming out of the right of the neuron, and the delta $A\delta$ as an arrow coming out of the left. (b) The output of A is multiplied by $AC$ on its way to being used by C when we're evaluating a sample. (c) The delta of C is multiplied by $AC$ on its to being used by A when we're computing delta values. ](../resources/Backpropagation/Images/o-delta-arrows.jpg height="300px") In other words, there is *one connection* with *one weight* joining neurons A and C. If the arrow points to the right, then the weight multiplies $Ao$, the output of A as it heads into neuron C. If the arrow points to the left, the weight multiplies $C\delta$, the delta of C, as it heads into neuron A. When we evaluate a sample, we use the feed-forward, left-to-right style of drawing, where the output value from neuron A to neuron C travels over a connection with weight $AC$. The result is that the value $Ao \times AC$ arrives at neuron C where it's added to other incoming values, as in Figure [fig-o-delta-arrows](b). When we later want to compute $A\delta$, we draw the flow from right-to-left. Then the delta leaving neuron C travels over a connection with weight $AC$. The result is that the value $C\delta \times AC$ arrives at neuron A where it's added to other incoming values, as in Figure [fig-o-delta-arrows](c). Now we can summarize both the processing of a sample input, and the computation of the deltas, in Figure [fig-no-af-h-o]. ![Figure [fig-no-af-h-o]: Calculating the output and delta for neuron H. Left: To calculate $Ho$, we scale the output of each preceding neuron by the weight of its connection and add the results together. Right: To calculate $H\delta$, we scale the delta of each following neuron by the connection's weight and add the results together. ](../resources/Backpropagation/Images/no-af-h-o.jpg width="500px") This is pleasingly symmetrical. It also reveals a very important practical result: when a neuron is connected to the same number of preceding and following neurons, calculating the delta for a neuron takes the same amount of work (and therefore the same amount of time) as calculating its output. So calculating deltas is as efficient as calculating output values. Even when the input and output counts are different, the amount of work involved is still close in both directions. Note that Figure [fig-no-af-h-o] doesn't require anything of neuron H except that it has inputs from a preceding layer that travel on connections with weights, and deltas from a following layer that travel on connections with weights. So we can apply the left half of Figure [fig-no-af-h-o] and calculate the output of neuron H as soon as the outputs from the previous layer are available. And we can apply the right half of Figure [fig-no-af-h-o] and calculate the delta of neuron H as soon as the deltas from the following layer are available. This also tells us why we had to treat the output layer neurons as a special case: there are no "next layer" deltas to be used. This process of finding the delta for every neuron in the network *is* the backpropagation algorithm. Backprop in Action ==== In the last section we saw the backpropagation algorithm, which lets us compute the delta for every neuron that in a network. Because that calculation depended on the deltas in the following neurons, and the output neurons don't have any of those, we had to treat the output neurons as a special case. Once all the neuron deltas for any layer (including the output layer) have been found, we can then step back one layer (towards the inputs), and find the deltas all the neurons on that layer, and then step back again, compute all the deltas, step back again, and so on until we reach the input. Let's walk through the process of using backprop to find the deltas for all the neurons in a slightly larger network. In Figure [fig-BP-net-only] we show a new network with four layers. There are still two inputs and outputs, but now we have 3 hidden layers of 2, 4, and 3 neurons. ![Figure [fig-BP-net-only]: A new classifier network with 2 inputs, 2 outputs, and 3 hidden layers.](../resources/Backpropagation/Images/BP-net-only.jpg width="400px") We start things off by evaluating a sample. We provide the values of its X and Y features to the inputs, and eventually the network produces the output predictions P1 and P2. Now we'll start backpropagation by finding the error in the output neurons, as shown in Figure [fig-BP-delta-P1-only]. ![Figure [fig-BP-delta-P1-only]: Computing the delta of the first output neuron in a new network. Using the general approach, we take the outputs of the output layer (here called P1 and P2) and compare them to the label to derive an error. From the outputs, the label, and the error value we find how much a change in P1 would change the error. That value is the delta stored at the neuron that produces P1. ](../resources/Backpropagation/Images/BP-delta-P1-only.jpg width="500px") We've begun arbitrarily with the upper neuron, which gives us the prediction we've labeled P1 (the likelihood that the sample is in class 1). From the values of P1 and P2 and the label, we can compute the error in the network's output. Let's suppose the network didn't get this sample perfectly predicted, so the error is greater than zero. Using the error, the label, and the values of P1 and P2, we can compute the value of delta for this neuron. If we're using the quadratic cost function, this delta is just the value of the label minus the value of the neuron, as we saw in Figure [fig-C-D-delta]. But if we're using some other function, it might be more complicated, so we've illustrated the general case. Once we've computed the value of delta for this neuron, we store it with the neuron, and we're done with that neuron for now. We'll repeat this process for all the other neurons in the output layer (here we have only one more). That finishes up the output layer, since we now have a delta for every neuron in the layer. Figure [fig-BP-delta-1] summarizes these two neurons getting their deltas. ![Figure [fig-BP-delta-1]: Summarizing the steps for finding the delta for both output neurons.](../resources/Backpropagation/Images/BP-delta-1.jpg height="400px") At this point we could start adjusting the weights coming into the output layer, but we usually break things up by first finding all the neuron deltas, and then adjusting all the weights. Let's follow that typical sequence here. So we'll move backwards one step to the third hidden layer (the one with 3 neurons). Let's consider finding the value of delta for the topmost of these three, as in the left image of Figure [fig-BP-delta-2]. ![Figure [fig-BP-delta-2]: Using backpropagation to find the deltas for the next-to-last layer of neurons. To find the delta for each neuron, we find the deltas of the neurons that use its output, multiply those deltas by the corresponding weights, and add the results together. ](../resources/Backpropagation/Images/BP-delta-2.jpg width="700px") To find the delta for this neuron, we follow the recipe of Figure [fig-ADC-error] to get the individual contributions, and then the recipe of Figure [fig-ADC-error-added] to add them together to get the delta for this neuron. Now we just work our way through the layer, applying the same process to each neuron. When we've completed all the neurons in this 3-neuron layer, we take a step backwards and start on the hidden layer with 4 neurons. And this is where things really become beautiful. To find the deltas for each neuron in this layer, we need only the weights to each neuron that uses this neuron's output, and the deltas for those neurons, which we just computed. The other layers are irrelevant. We don't care about the output layer any more now. All we need are the deltas in the next layer's neurons, and the weights that get us to those neurons. Figure [fig-BP-delta-3] shows how we compute the deltas for the four neurons in the second hidden layer. ![Figure [fig-BP-delta-3]: Using backprop to find the delta values for the second hidden layer. ](../resources/Backpropagation/Images/BP-delta-3.jpg width="500px") When all 4 neurons have had deltas assigned to them, that layer is finished, and we take another step backwards. Now we're at the first hidden layer with two neurons. Each of these connects to the 4 neurons on the next layer. Once again, all we care about now are the deltas in that next layer and the weights that connect the two layers. For each neuron we find the deltas for all the neurons that consume that neuron's output, multiply those by the weights, and add up the results, as shown in Figure [fig-BP-delta-4]. ![Figure [fig-BP-delta-4]: Using backprop to find the neurons for the first hidden layer.](../resources/Backpropagation/Images/BP-delta-4.jpg width="500px") When Figure [fig-BP-delta-4] is complete, we've found the delta for every neuron in the network. Now we'll adjust the weights. We'll run through the connections between neurons and use the technique we saw in Figure [fig-AC-update-basic] to update for every weight to a new and improved value. Figure [fig-BP-delta-1] through Figure [fig-BP-delta-4] show why the algorithm is called *backwards propagation*. We're taking the deltas from any layer and *propagating*, or moving, their information *backwards* one layer at a time, modifying it as we go. As we've seen, computing each of these delta values is fast. It's just one multiplication per outgoing connection, and then adding those pieces together. That takes almost no time at all. Backprop becomes highly efficient when we use parallel hardware like a GPU. Because the neurons on a layer don't interact, and the weights and deltas that get multiplied are already computed, we can multiply *all* the deltas and weights for *an entire layer* at once. Computing an entire layer's worth of deltas in parallel saves us a lot of time. If each of our layers had 100 neurons, and we had enough hardware, computing all 400 deltas would take only the same time required to find 4 deltas. The tremendous efficiency that comes from this parallelism is a key reason why backprop is so important. Now we have all of the deltas, and we know how to update the weights. Once we actually do update the weights, we should re-compute all the deltas, because they're based on the weights. We're just about done, but we need to make good on our earlier promise and put activations functions back into our neurons. Including Activation Functions ==== Including the activation function in backpropagation is a small step. But understanding that step and why it's the right thing to do takes a bit of thinking. We left it off in our earlier discussion so we wouldn't get distracted, but let's now get the activation function back in there. We'll start by thinking about a neuron during the feed-forward step, when we're evaluating a sample and data is flowing from left to right, producing neuron outputs as it goes. When we calculate the output of a neuron, the sum of the weighted inputs goes through an activation function before it leaves the neuron, as shown in Figure [fig-af-1]. ![Figure [fig-af-1]: Neuron A, with its activation function AF.](../resources/Backpropagation/Images/af-1.jpg width="300px") To make things a bit more specific, let's choose an activation function. We'll pick the sigmoid, discussed in Chapter TK, because it's smooth and makes for clear demonstrations. We won't use any particular qualities of the sigmoid, so our discussion will be applicable to any activation function. Figure [fig-sigmoid-curve] shows a plot of the sigmoid curve. Increasingly large positive values approach 1 ever more closely without quite getting there, and increasingly large negative values approach 0, but never quite get there, either. Rather than constantly refer to values with phrases like "very, very nearly 1" or "extremely close to 0", let's say for simplicity that input values that are greater than about 7 can be considered to get an output value of 1, while those less than -7 can be considered to get an output of 0. Values in the range (-7, 7) will be smoothly blended in the S-shaped function that gives the sigmoid its name. ![Figure [fig-sigmoid-curve]: The sigmoid curve, plotted from -15 to 15. Values greater than about 7 or less than about -7 are very near to 1 and 0, respectively.](../resources/Backpropagation/Images/sigmoid-curve.jpg width="400px") Let's look at neuron C in our original tiny four-neuron network of Figure [fig-simple-net]. Neuron C gets one input from neuron A, and one from neuron B. For now, let's look just at the input from A, as in Figure [fig-af-3]. The value $Ao$, or the output of A, gets multiplied by the weight $AC$ before it gets summed with all the other inputs in C. Since we're focusing just on the pair of neurons A and C, we'll ignore any other inputs to C. The input value $Ao \times AC$ is then used as the input to the activation function to find the output value $Co$. ![Figure [fig-af-3]: Ignoring the other inputs to A for the moment, the input to the activation function is given by multiplying the output of A and the weight $AC$. We can find the value of the activation function at that point, which gives us the output $Co$. Left: When the output of A is $Ao$ and the weight is $AC$, the value into the activation function is $Ao \times AC$. Right: When the output of A is $Ao$ and the weight is $AC + ACm$, the value into the activation function is $Ao \times (AC + ACm)$. Here we show the values from the left plot as dots with white in the center, and the new values as filled-in dots. Note that the change in the output is substantial, because we're on a steep part of the curve. ](../resources/Backpropagation/Images/af-3.jpg width="700px") We know that to reduce the error, we'll be adding some positive or negative number $ACm$ to the value of the weight $AC$. The right diagram in Figure [fig-af-3] shows the result of adding a positive value of $ACm$. In this case, the output $Co$ changes by a lot, because we're on a steep part of the curve. So this says that by adding $ACm$ to the weight, we're going to have a *bigger* effect on the output error than we would have had without the activation function, because that function has turned our change of $ACm$ into something larger. Suppose instead our starting value of $Ao \times AC$ put us near a shallow part of the curve, and we add the same $ACm$ to $AC$, as in Figure [fig-af-3-small-delta]. ![Figure [fig-af-3-small-delta]: In a shallow part of the curve, before (left) and after (right) adding the same amount $ACm$ as in Figure [fig-af-3] will cause a smaller change in the output of C. ](../resources/Backpropagation/Images/af-3-small-delta.jpg width="700px") Now adding the same value $ACm$ to the weight causes a smaller change in $Co$ than before. In this case, it's even less than $ACm$ itself. The smaller change to the output means there will be a smaller change in the network error. In other words, adding $ACm$ to the weight in this situation will result in a *smaller* change to the output error than we'd get if there was no activation function present. What we'd love to have is something that can tell us, for any point on the activation function curve, how steep the curve is at that point. When we're at a point where the curve is going up steeply to the right, positive changes in the input will be amplified a lot, as in Figure [fig-af-3]. When we're at a point where the curve is going up shallowly to the right, as in Figure [fig-af-3-small-delta], such changes will be amplified by a little. If we change the input with a negative value of $ACm$, and move our stating point to the left, then the amount of slope tells us how much the change in the error will decrease. We get the same situations as in Figure [fig-af-3](b) and Figure [fig-af-3-small-delta](b), but with the starting and ending positions reversed. Happily, we already know how to find the slope of a curve: that's just the derivative. Figure [fig-sigmoid-and-derivative] shows the sigmoid, and its derivative. ![Figure [fig-sigmoid-and-derivative]: The sigmoid curve and its derivative. Note that the vertical scales are different for the two plots.](../resources/Backpropagation/Images/sigmoid-and-derivative.jpg width="700px") The sigmoid is flat at the left and right ends. So if we're in the flat regions and we move left or right a little bit, that will cause little to no change in the output of the function. In other words, the curve is flat, or has no slope, or has a derivative of 0. The derivative increases as the input moves from about -7 to 0 because the curve is getting steeper. Then the derivative decreases back to 0 again even as the input keeps going up because the curve becomes more shallow off to the right, approach 1 but never quite getting there. If we were to work through the math, we'd find that this derivative is exactly what we need to fix our prediction of the change to the error based on a change to a weight. When we're on a steep part of the curve, we want to crank up the value of delta for this neuron, because changes to its inputs will cause big changes in the activation function, and thus have big changes to the network error. When we're on a shallow part of the curve, then changes to the inputs will have little effect on the change in the output, so we want to make this neuron's delta smaller. In other words, to account for the activation function, we just take the delta we normally compute, and multiply it by the derivative, evaluated at the same point that we used during the forward pass. Now the delta accounts for how the activation function will exaggerate or diminish the amount of change in the neuron's output, and hence its effect on the network error. To keep things nicely bundled, we can perform this step immediately after computing the delta as we did before. The whole business for one neuron is summarized in Figure [fig-neuron-h-o-delta]. Here we imagine we have a neuron H. The top part is the forward pass, when we're evaluating a sample and computing this neuron's output. Following a common convention, we've given the name $z$ to the result of the summation step. The value of the activation function is what we get when we look vertically upwards from the point $z$ on the X axis. The bottom part of the figure is the backward pass, where we compute this neuron's delta. Again, we use $z$ to find the derivative of the activation function, and we multiply our incoming sum by that before passing it on. ![Figure [fig-neuron-h-o-delta]: Network evaluation and backpropagation of deltas in a nutshell, for neuron H. Top: In the forward pass, the weighted inputs are added together, giving us a value we call $z$. The value of the activation function at $z$ is our output $Ho$. Bottom: In the backward pass, the weighted deltas are added together, and we use the $z$ from before to look up the derivative of the activation function. We multiply the sum of the weighted deltas by this value, giving us $H\delta$. ](../resources/Backpropagation/Images/neuron-h-o-delta.jpg height="400px") In this figure, neuron H has three inputs, coming from neurons A, B, and C, with output values $Ao$, $Bo$, and $Co$. During the forward pass, when we're finding the neuron's output, these are multiplied respectively by the weights $AH$, $BH$, and $CH$, and then added together. We've labeled this sum with the letter $z$. Now we look up $z$ in the activation function, and its value is $Ho$, the output of this neuron. Now when we run the backward pass to find the neuron's delta, we find the deltas of the neurons that use $Ho$ as an input. Let's say they're neurons J, K and L. So we multiply their deltas $J\delta$, $K\delta$, and $L\delta$ by their respective weights $HJ$, $HK$, and $HL$, and add up those results. Now we get the value of $z$ from the forward pass, and use it to find the value of the derivative of the activation function. We multiply the sum we just found with this number, and the result is $H\delta$, the delta for this neuron. This all goes for the output neurons too, if they have activation functions, only in the backward pass we use the error information rather than deltas from the next layer. Notice how compact and local everything is. The forward pass depends only on the output values of the previous layer, and the weights that connect to them. The backward pass depends only on the deltas from the following layer, the weights that connect to them, and the activation function. Now we can see why we were able to get away with ignoring the activation function throughout most of this chapter. We can pretend that we really did have an activation function all along: the **identity activation function**, shown in Figure [fig-identity-and-derivative]. This has an output that's the same as its input. That is, it has no effect. ![Figure [fig-identity-and-derivative]: The identity function as activation function. Left: The identity produces as output the same value it received as input. Right: Its derivative is 1 everywhere, because the function has a constant slope of 1.](../resources/Backpropagation/Images/identity-and-derivative.jpg width="700px") As Figure [fig-identity-and-derivative] shows, the derivative of the identify activation function is always 1 (that's because the function is a straight line with slope of 1 everywhere, and the derivative at any point is just the slope of the curve at that point). Let's think back on our discussion of backpropagation and include this identity activation within every neuron. During the forward pass, the outputs would be unchanged by this function, so it has no effect. During the backward pass, we'd always multiply the summed deltas by 1, again having no effect. We said earlier that our results weren't limited to using the sigmoid. That's because didn't use any special properties of sigmoid in our discussion, other than to assume it has a derivative everywhere. This is why activation functions are designed so that they have a derivative for every value (recall from Chapter TK that library routines automatically apply mathematical techniques to take care of any spots that don't have a derivative). Let's look at another popular activation functions, the ReLU. Figure [fig-ReLU-and-derivative] shows the ReLU function and its derivative. ![Figure [fig-ReLU-and-derivative]: The ReLU function as an activation function. Left: ReLU returns its input when that value is 0 or larger, and otherwise returns 0. Right: Its derivative is a step function, 0 for inputs less than 0 and 1 for inputs greater than 0. The sudden jump at 0 is automatically managed for us by libraries so that we have a smooth derivative everywhere. ](../resources/Backpropagation/Images/ReLU-and-derivative.jpg width="700px") Everything we did with the sigmoid can be applied to the ReLU, without change. And the same goes for any other activation function. That wraps things up. We've now put activation functions back into our neurons. That brings us to the end of basic backpropagation. Before we leave the discussion, though, let's look at a critical control that keeps things working well: the learning rate. The Learning Rate ==== In our description of updating a weight, we multiplied the left neuron's output value and the right neuron's delta, and subtracted that from the weight (recall Figure [fig-AC-update-basic] from much earlier). But as we've mentioned a few times, changing a weight by a lot in a single step is often a recipe for trouble. The derivative is only accurate for very tiny changes in a value. If we change a weight by too much, we can jump right over the smallest value of the error, and even find ourselves increasing the error. On the other hand, if we change a weight by too little, we might see only the tiniest bit of learning, slowing everything down. Still, that inefficiency is usually better than a system that's constantly over-reacting to errors. In practice, we control the amount of change to the weights during every update with a hyperparameter called the **learning rate**, often symbolized by the lower-case Greek letter eta ($\eta$). This is a number between 0 and 1, and it tells the weights how much of their newly-computed value to use when they update. When we set the learning rate to 0, the weights don't change at all. Our system will never change and never learn. If we set the learning rate to 1, the system will apply big changes to the weights, and might overshoot the mark. If this happens a lot, the network can spend its time constantly overshooting and then compensating, with the weights bouncing around and never settling into their best values. So we usually set the learning rate somewhere between these extremes by setting it to something between 0 and 1. Figure [fig-AC-update-eta] shows how the learning rate is applied. We just scale the value of $-(Ao \times C\delta)$ by $\eta$ before adding it back in to $AC$. ![Figure [fig-AC-update-eta]: The learning rate helps us control how fast the network learns by controlling the amount by which weights change on each update. Here we see Figure [fig-AC-update-basic] with an extra step that multiplies the value $-(Ao \times C\delta)$ by the learning rate $\eta$ before adding it to $AC$. When $\eta$ is a small positive number (say 0.01), then each change will be small, which often helps the network learn. ](../resources/Backpropagation/Images/AC-update-eta.jpg width="400px") The best choice of the learning rate is dependent on the specific network we've built and the data we're training on. Usually we have to hunt for the best value of eta using trial and error. Happily, there are algorithms that automate the search for a good starting value for the learning rate, and other algorithms that fine-tune the learning rate as learning progresses. We'll see such algorithms in Chapter TK. As a general rule of thumb, and if none of our other choices direct us to a particular learning rate, we often start with a learning rate of around 0.01 and then train the network for a while, watching how well it learns. Then we then raise or lower it from there and train again, over and over, hunting for the value that learns most efficiently. Exploring the Learning Rate --- Let's see how backprop performs with different learning rates. We'll build a classifier to find the boundary between the two half-moons that we used in Chapter TK. Figure [fig-plot-weights-training-data] shows our training data of about 1500 points. We pre-processed this data to give it zero mean and a standard deviation of 1 for each feature. ![Figure [fig-plot-weights-training-data]: About 1500 points generated synthetically by the `make_moons()` routine in scikit-learn. ](../resources/Backpropagation/Images/plot-weights-training-data.jpg width="400px") Because we have only two categories, we'll build a **binary classifier**. This lets us skip the whole one-hot encoding of labels and dealing with multiple outputs, and instead use just one output neuron. If the value is near 0, the input is in one class. If the output is near 1, the input is in the other class. Our classifier will have two hidden layers, each with 4 neurons. Both layers will be fully-connected, so every neuron in the first hidden layer will send its output to every neuron in the second hidden layer. Figure [fig-moons-network] shows the idea. ![Figure [fig-moons-network]: Our binary classifier takes in two values as input (the X and Y of each point). Each input goes into the 4 neurons on the first layer. Each of those 4 neurons connects to each of the 4 neurons on the next hidden layer. Then a single neuron takes the outputs of the second hidden layer and presents a single value for output. In this network, we've used ReLU activation functions for the neurons in the hidden layers, and a sigmoid activation function on the output neuron. ](../resources/Backpropagation/Images/moons-network.jpg height="200px") How many weights are in our network? There are 4 coming out of each of the 2 inputs, then $4 \times 4$ between the layers, and then 4 going into the output neuron. That gives us $(2 \times 4)+(4 \times 4)+ 4 = 28$. Each of the 9 neurons also has a bias term, so our network has $28 + 9 = 37$ weights. They start with small random numbers. Our goal is to use backprop to adjust those 37 weights so that the number that comes out of the final neuron always matches the label for that sample. As we discussed above, we'll evaluate one sample, calculate the error, compute the deltas with backprop, and then update the weights using the learning rate. Then we'll move on to the next sample. Note that if the error is 0, then the weights won't change at all. Each time we process all the samples in the training set, we say we've completed one **epoch** of training. Our discussion of backprop mentioned how much we rely on making "small changes" to the weights. There are two reasons for this. The first is that the direction of change for every weight is given by the derivative (or gradient) of the error at that weight. But as we saw, the gradient is only accurate very near the point we're evaluating. If we move too far, we may find ourselves increasing the error rather than decreasing it. The second reason for taking small steps is that changes in weights near the start of the network will cause changes in the outputs of neurons later on, and we've seen that we use those neuron outputs to help compute the changes to the later weights. To prevent everything from turning into a terrible snarl of conflicting moves, we change the weights only by small amounts. But what is "small"? For every network and data set, we have to experiment to find out. As we saw above, the size of our step is controlled by the **learning rate**, or **eta** ($\eta$). The bigger this value, the more each weight will move towards its new value. Let's start with a really large learning rate of 0.5. Figure [fig-plot-weights-decision-boundaries-eta-0.5000] shows the boundaries computed by our network for our test data. ![Figure [fig-plot-weights-decision-boundaries-eta-0.5000]: The boundaries computed by our network using a learning rate of 0.5. ](../resources/Backpropagation/Images/plot-weights-decision-boundaries-eta-0.5000.jpg width="400px") This is terrible. Everything is being assigned to a single class, shown by the light orange background. If we look at the accuracy and error (or loss) after each epoch, we get the graphs of Figure [fig-plot-weights-graphs-eta-0.5000]. ![Figure [fig-plot-weights-graphs-eta-0.5000]: Accuracy and loss for our moons data with a learning rate of 0.5. ](../resources/Backpropagation/Images/plot-weights-graphs-eta-0.5000.jpg width="700px") Things are looking bad. As we'd expect, the accuracy is just about 0.5, meaning that half the points are being misclassified. This makes sense, since the red and blue points are roughly evenly divided. If we assign them all to one category, as we're doing here, half of those assignments will be wrong. The loss, or error, starts high and doesn't fall. If we let the network run for hundreds of epochs it continues on in this way, never improving. What are the weights doing? Figure [fig-plot-weights-sorted-history-eta-0.5000] shows the values of all 37 weights during training. ![Figure [fig-plot-weights-sorted-history-eta-0.5000]: The weights of our network when using a learning rate of 0.5. ](../resources/Backpropagation/Images/plot-weights-sorted-history-eta-0.5000.jpg width="500px") Most of the weights don't seem to be moving at all, but they could be meandering a little bit. The graph is dominated by one weight that's jumping all over. That weight is one of those going into the output neuron, trying to move its output around to match the label. That weight goes up, then down, then up, jumping too far every time, then over-correcting by too much, then over-correcting for that, and so on. These results are disappointing, but they're not shocking, because a learning rate of 0.5 is *big*. Let's reduce the training rate by a factor of 10 to a more common value of 0.05. We'll change absolutely nothing else about the network and the data, and we'll even re-use the same sequence of pseudo-random numbers to initialize the weights. The new boundaries are shown in Figure [fig-plot-weights-decision-boundaries-eta-0.0500]. ![Figure [fig-plot-weights-decision-boundaries-eta-0.0500]: The decision boundaries when we use a learning rate of 0.05. ](../resources/Backpropagation/Images/plot-weights-decision-boundaries-eta-0.0500.jpg width="400px") This is *much* better! This is great! Looking at the graphs in Figure [fig-plot-weights-graphs-eta-0.0500] reveals that we've reached 100% accuracy on both the training and test sets after about 16 epochs. ![Figure [fig-plot-weights-graphs-eta-0.0500]: Accuracy and loss for our network when using a learning rate of 0.05. ](../resources/Backpropagation/Images/plot-weights-graphs-eta-0.0500.jpg width="700px") What are the weights doing? Figure [fig-plot-weights-sorted-history-eta-0.0500] shows us their history. Overall, this is way better, because lots of weights are changing. Some weights are getting pretty large. In Chapter TK we'll cover regularization techniques to keep the weights small in deep networks, and later in this chapter we'll see why it's nice to keep the weights small (say, between -1 and 1). For the moment, let's just note that the weights are taking on large values and move on. ![Figure [fig-plot-weights-sorted-history-eta-0.0500]: The weights in our network over time, using a learning rate of 0.05. ](../resources/Backpropagation/Images/plot-weights-sorted-history-eta-0.0500.jpg width="500px") So that's success. Our network has learned to perfectly sort the data, and it did it in only 16 epochs, which is nice and fast. On a late 2014 iMac without GPU support, the whole training process took less than 10 seconds. Just for fun, let's lower the learning rate down to 0.01. Now the weights will change even more slowly. Does this produce better results? Figure [fig-plot-weights-decision-boundaries-eta-0.0100] shows the decision boundary resulting from these tiny steps. The boundary seems to use more straight lines than the boundary in Figure [fig-plot-weights-decision-boundaries-eta-0.0500], but we can't really say either one is better than the other, since both boundaries are separating the sets perfectly. ![Figure [fig-plot-weights-decision-boundaries-eta-0.0100]: The decision boundaries for a learning rate of 0.01. ](../resources/Backpropagation/Images/plot-weights-decision-boundaries-eta-0.0100.jpg width="400px") Figure [fig-plot-weights-graphs-eta-0.0100] show our accuracy and loss graphs. Because our learning rate is so much slower, our network takes around 170 epochs to get to 100% accuracy, rather than the 16 in Figure [fig-plot-weights-sorted-history-eta-0.0500]. ![Figure [fig-plot-weights-graphs-eta-0.0100]: The accuracy and learning rate for our network using a learning rate of 0.01. ](../resources/Backpropagation/Images/plot-weights-graphs-eta-0.0100.jpg width="700px") These graphs show an interesting learning behavior. After an initial jump, both the training and test accuracies reach about 90% and plateau there. At the same time, the losses hit about 0.2 (for the test data) and 0.25 (for the training), and they plateau as well. Then around epoch 170, things improve rapidly again, with the accuracy climbing to 100% and the errors dropping to 0. This pattern of alternating improvement and plateaus is not unusual, and we can even see a hint of it in Figure [fig-plot-weights-graphs-eta-0.0500] where there's an imperfect plateau between epochs 3 and 8. These plateaus come from the weights finding themselves on nearly flat regions of the error surface, resulting in near-zero gradients, and thus their updates are very small. Though our weights might be getting stuck in local minima, it's more common for them to get caught in a flat region of a saddle, like those we saw in Chapter TK [#Dauphin14]. Sometimes it takes a long time for one of the weights to move into a region where the gradient (or derivative) is large enough to give it a good push. When one weight gets moving, it's common to see the others kick in as well, thanks to the cascading effect of that first weight's changes on the rest of the network. We'll see this in more detail in Chapter TK. The values of the weights follow almost the same pattern over time, as shown in Figure [fig-plot-weights-sorted-history-eta-0.0100]. The interesting thing is that at least some of the weights are not flat, or on a plateau. They're changing, but very slowly. The system is getting better, but in tiny steps that don't show up in the performance graphs until the changes become bigger around epoch 170. ![Figure [fig-plot-weights-sorted-history-eta-0.0100]: The history of our weights using a learning rate of 0.01. ](../resources/Backpropagation/Images/plot-weights-sorted-history-eta-0.0100.jpg width="500px") The weights seem to be growing a little bit even at epoch 200. If we let the system continue, these weights will continue to slowly grow over time with no apparent effect on the accuracies or errors. So was there any benefit to lowering the learning rate down to 0.01? Not really. Even at 0.05, the categorization was already perfect on both the training and test data. In this case, the smaller learning rate just meant the network took longer to learn. This investigation has shown us how sensitive the network is to our choice of learning rate. When we look for the best value for the learning rate, it can feel like we're the character of Goldilocks in recent versions of the fable *Goldilocks and the Three Bears* [#Wikipedia17]. We're searching for something that's not too big, and not too small, but "just right." When our learning rate was too big, the weights took steps that were too large, and the network never settled down or improved its performance. When the learning rate was too small, our progress was very slow, as the weights sometimes were creeping from one value to another at a glacial pace. When the learning rate was just right, training was fast and efficient, and produced great results. In this case, they were perfect. This kind of experimenting with the learning rate is part of developing nearly every deep learning network. The speed with which backprop changes the weights needs to be tuned to match the type of network and the type of data. Happily, we'll see in Chapter TK that there are automatic tools that can handle the learning rate for us in sophisticated ways. Discussion ==== Let's recap what we've seen, and then consider some of the implications of the backpropagation algorithm. Quick Recap --- To recap quickly, we start by running a sample through the network and calculate the output for every neuron, as in Figure [fig-backprop-quad-summary](a). ![Figure [fig-backprop-quad-summary]: The backprop algorithm in a nutshell. This figure collects Figure [fig-C-D-delta], Figure [fig-neuron-h-o-delta], and Figure [fig-AC-update-eta] in one place. (a) The forward pass. Neuron H's outputs are weighted and added together, giving a value $z$. We look that up in the activation function to find the output $Ho$. (b) The start of backpropagation. The deltas for the output layer neurons are found from the label and the neuron values. (c) Backpropagation of deltas. The deltas from the neurons that use neuron H's output are added together and then the result is multiplied by the value of the derivative of the activation function, measured at the same location $z$ as in the forward pass. (d) Updating the weights. We multiply the left neuron's output and the right neuron's delta, scale that by the learning rate, and subtract it from the weight. That becomes the new value of the weight. ](../resources/Backpropagation/Images/backprop-quad-summary.jpg width="700px") Then we kick off the backprop algorithm in Figure [fig-backprop-quad-summary](b). We find the delta value for each of the output neurons, telling us that if the neuron's output changes by a certain amount, the error will change by that amount times the neuron's delta. Now we take a step backwards in Figure [fig-backprop-quad-summary](c) to the previous layer of the network, and find the deltas for all the neurons in that layer. We need only the deltas from the output layer, and the weights between the two layers. Once all the deltas are assigned, we update the weights. Using the deltas and neuron outputs, we compute an adjustment to every weight. We scale that adjustment by the learning rate, and then add it to the current value of the weight to get the new, updated value of the weight, as in in Figure [fig-backprop-quad-summary](d). What Backprop Doesn't Do ---- There's a shorthand in some discussions of backprop that can be confusing. Authors sometimes say something like, "backpropagation moves the error backwards from the output layer to the input layer", or "backpropagation uses the error at each layer to find the error at the layer before." This can be misleading because backprop is *not* "moving the error" at all. In fact, the only time we use "the error" is at the very start of the process when we find the delta values for the output neurons. What's really going on involves the *change* in the error, which is represented by the delta values. These act as amplifiers of change in the neuron outputs, telling us that if an output value or weight changes by a given amount, we can predict the corresponding change in the error. So backprop isn't moving "the error" backwards. But *something* is moving backwards. Let's see what that is. What Backprop Does Do ---- The first step in the backprop process is to find the deltas for the output neurons. These are found from the **gradient** of the error. We sliced the gradient to get a look at the 2D curve for each prediction, and then we took the derivative of that curve, but that was just for ease of visualization and discussion. The derivatives are just pieces of what really matters: the gradient. As we work our way backwards, the deltas continue to represent gradients. Every delta value represents a different gradient. For instance, the delta attached to a neuron C describes the gradient of the error with respect to the output of C, and the delta for A describes the gradient of the error with respect to changes in the output of A. So when we change the weights, we're changing them in order to follow the gradient of the error. As we saw in Chapter TK, this is an example of **gradient descent**, which mimics the path that water takes as it runs downhill on a landscape. We can say that backpropagation is the algorithm that lets us efficiently update our weights using gradient descent, since the deltas it computes describe that gradient. So a nice way to summarize backprop is to say that it moves the gradient of the error backwards, modifying it to account for each neuron's contribution. Keeping Neurons Happy ---- When we put activation functions back into the backprop algorithm, we concentrated on the region where the inputs are near 0. That wasn't an accident. Let's return to the sigmoid function, and look at what happens when the value into the activation function (that is, the sum of the weighted inputs) becomes a very large number, say 10 or more. Figure [fig-big-sigmoid-and-derivative] shows the sigmoid between values of -20 and 20, along with its derivative in the same range. ![Figure [fig-big-sigmoid-and-derivative]: The sigmoid function becomes flat for very large positive and negative values. Left: The sigmoid for the range -20 to 20. Right: The derivative of the sigmoid in the same range. Note that the vertical scales of the two plots are different. ](../resources/Backpropagation/Images/big-sigmoid-and-derivative.jpg width="700px") The sigmoid never quite reaches exactly 1 or 0 at either end, but it gets extremely close. Similarly, the value of the derivative never quite reaches 0, but as we can see from the graph it gets very close. To avoid constantly referring to "nearly flat" regions and value that are "nearly 1" or "nearly 0", let's say for now that any input to the sigmoid that's greater than 7 will land in a flat region and get a value of 1.0, and any value smaller than negative 7 will also be in a float region and get a value of 0.0. Figure [fig-sigmoid-neuron-a] shows a neuron with a sigmoid activation function. The value going into the function, which we've labeled $z$, has the value 10, putting it in one of the curve's flat regions. ![Figure [fig-sigmoid-neuron-a]: When we apply a large value (say 10) to the sigmoid, we find ourselves in a flat region and get back the value of 1.](../resources/Backpropagation/Images/sigmoid-neuron-a.jpg width="400px") From Figure [fig-big-sigmoid-and-derivative], we can see that the output is basically 1. Now suppose that we change one of the weights, as shown in Figure [fig-sigmoid-neuron-b]. The value $z$ increases, so we move right on the activation function curve to find our output. Let's say that the new value of $z$ is 15. The output is still basically 1. ![Figure [fig-sigmoid-neuron-b]: A big increase to the weight AD coming into this neuron has no effect on its output, because it just pushes us further to the right along the flat region of the sigmoid. The neuron's output was 1 before we added ADm to the weight AD, and it's still 1 afterwards. ](../resources/Backpropagation/Images/sigmoid-neuron-b.jpg width="400px") If we increase the value of the incoming weight again, even by a lot, we'll still get back an output of 1. In other words, *changing the incoming weight has no effect on the output*. And because the output doesn't change, the error doesn't change. We could have predicted this from the derivative in Figure [fig-big-sigmoid-and-derivative]. When the input is 15, the derivative is 0 (actually about 0.0000003, but our convention above says that we can call that 0). So changing the input will result no change in the output. This is a terrible situation for any kind of learning, because we've lost the ability to improve the network by adjusting this weight. In fact, *none* of the weights coming into this neuron matter anymore (if we keep the changes small), because any changes to the weighted sum of the inputs, whether they make the sum smaller or larger, still lands us on a flat part of the function and thus there's no change to the output, and no change to the error. The same problem holds if the input value is very negative, say less than -10. The sigmoid curve is flat in that region also, and the derivative is also essentially zero. In both of these cases we say that this neuron has **saturated**. Like a sponge that cannot hold any more water, this neuron cannot hold any more input. The output is 1, and unless the weights, the incoming values, or both, move a lot closer to 0, it's going to stay at 1. The result is that this neuron no longer participates in learning, which is a blow to our system. If this happens to enough neurons, the system could become crippled, learning more slowly than it should, or perhaps even not at all. A popular way to prevent this problem is to use **regularization**. Recall from Chapter TK that the goal of regularization is to keep the sizes of the weights small, or close to 0. Among other benefits, this has the value of keeping the sum of the weighted inputs for each neuron also small and close to zero, which puts us in the nice S-shaped part of the activation function. This is where learning happens. In Chapter TK we'll see techniques for regularization in deep learning networks. Saturation can happen with any activation function where the output curve becomes flat for a while (or forever). Other activation functions can have their own problems. Consider the popular ReLU curve, plotted from -20 to 20 in Figure [fig-big-ReLU-and-derivative]. ![Figure [fig-big-ReLU-and-derivative]: The ReLU activation function in the range -20 to 20. Positive values won't saturate the function, but negative values can cause it to die. Left: The ReLU function. Right: The derivative of ReLU.](../resources/Backpropagation/Images/big-ReLU-and-derivative.jpg width="700px") As long as the input is positive, this function won't saturate, because the output is the same as the input. The derivative for positive inputs is 1, so the sum of the weighted inputs will be passed directly to the output without change. But when the input is negative, the function's output is 0, and the derivative is 0 as well. Not only do changes make no difference to the output, but the output itself has ceased to make any contribution to the error. The neuron's output is 0 and unless the weights, inputs, or both change by a lot, it's going to stay at 0. To characterize this dramatic effect, we say that this neuron has **died**, or is now **dead**. Depending on the initial weights and the first input sample, one or more neurons could die the very first time a sample is run through the network. Then as training goes on, more neurons can die. If a lot of neurons die during training, then our network is suddenly working with just a fraction of the neurons we thought it had. That cripples our network. Sometimes even 40% of our neurons can die off during training [#Karpathy16]. When we build a neural network we choose the activation functions for each layer based on experience and expectations. In many situations, sigmoids or ReLUs feel like the right function to use, and in many circumstances they work great. But when a network learns slowly, or fails to learn, it pays to look at the neurons and see if some or many are saturated, dying, or dead. If so, we can experiment with our initial starting weights and our learning rate to see if we can avoid the problem. If that doesn't work, we might need to re-structure our network, choose other activation functions, or both. Mini-Batches ---- In our discussion above, we followed three steps for every sample: run the sample through the network, calculate all the deltas, and then adjust all the weights. It turns out that we can save some time, and sometimes even improve our learning, by only adjusting the weights infrequently. Recall from Chapter TK that the full training set of samples is sometimes called a **batch** of samples. We can break up that batch into smaller **mini-batches**. Usually the size of our mini-batch is picked to match whatever parallel hardware we have available. For instance, if we our hardware (say a GPU) can evaluate 16 samples simultaneously, then our mini-batch size will be 16. Common mini-batch sizes are 16, 32, and 64, though they can go higher. The idea is that we run a mini-batch of samples through the network in parallel, and then we compute all the deltas in parallel. We'll average together all the deltas, and use those averages to then perform a single update to the weights. So instead of updating the weights after every sample, they're updated after a mini-batch of 16 samples (or 32, 64, etc.). This gives us a big increase in speed. It can also improve learning, because the changes to the weights are smoothed out a little by the averaging over the whole mini-batch. This means if there's one weird sample in the set, it can't pull all the weights in an unwanted direction. The deltas for that weird sample get averaged with the other 31 or 63 samples in the mini-batch, reducing its impact. Parallel Updates ---- Since each weight depends only on values from the neurons at its two ends, every weight's update step is completely independent from every other weight's update step. When we carry out the same steps for independent pieces of data, that's usually our cue to use parallel processing. And indeed, most modern implementations will, if parallel hardware is available, update all the weights in the network simultaneously. As we just discussed, this update will usually happen after each mini-batch of samples. This is an enormous time-saver, but it comes at a cost. As we've discussed, changing any weight in the network will change the output value for every neuron that's downstream from that weight. So changes to the weights near the very start of the network can have enormous ripple effects on later neurons, causing them to change their outputs by a lot. Since the gradients represented by our deltas depend on the values in the network, changing a weight near the input means that we should really re-compute all the deltas for all the neurons that consume that value that weight modifies. That could mean almost every neuron in the network. This would destroy our ability to update in parallel. It would also make backprop agonizingly slow, since we'd be spending all of our time re-evaluating gradients and computing deltas. As we've seen, the way to prevent chaos is to use a "small enough" learning rate. If the learning rate is too large, things go haywire and don't settle. If it's too small, we waste a lot of time taking overly tiny steps. Picking the "just right" value of the learning rate preserves the efficiency of backprop, and our ability to carry out its calculations in parallel. Why Backprop Is Attractive --- A big part of backprop's appeal is that it's so efficient. It's the fastest way that anyone has thought of to figure out how to most beneficially update the weights in a neural network. As we saw before, and summarized in Figure [fig-neuron-h-o-delta], running one step of backprop in a modern library usually takes about as long as evaluating a sample. In other words, consider the time it takes to start with new values in the inputs, and flow that data through the whole network and ultimately to the output layer. Running one step of backprop to compute all the resulting deltas takes about the same amount of time. That remarkable fact is at the heart of why backprop has become a key workhorse of machine learning, even though we usually have to deal with a fiddly learning rate, saturating neurons, dying neurons, and so on. Backprop Is Not Guaranteed ---- It's important to note that there's no guarantee that this scheme is going to learn anything! It's not like the single perceptron of Chapter TK, where we have ironclad proofs that after enough steps, the perceptron will find the dividing line it's looking for. When we have many thousands of neurons, and potentially many millions of weights, the problem is too complicated to give a rigorous proof that things will always behave as we want. In fact, things often do go wrong when we first try to train a new network. The network might learn glacially slowly, or even not at all. It might improve for a bit and then seem to suddenly take a wrong turn and forget everything. All kinds of stuff can happen, which is why many modern libraries offer visualization tools for watching the performance of a network as it learns. When things go wrong, the first thing many people try is to crank the learning rate to a very small value. If everything settles down, that's a good sign. If the system now appears to be learning, even if it's barely perceptible, that's another good sign. Then we can slowly increase the learning rate until it's learning as quickly as possible without succumbing to chaos. If that doesn't work, then there might be a problem with the design of the network. This is a very tricky problem to deal with. Designing a successful network means making a lot of good choices. For instance, we need to choose the number of layers, the number of neurons on each layer, how the neurons should be connected, what activation functions to use, what learning rate to use, and so on. Getting everything right can be challenging. We usually need a combination of experience, knowledge of our data, and experimentation to design a neural network that will not only learn, but do it efficiently. In the following chapters we'll see some architectures that have proven to be good starting points for wide varieties of tasks. But each new combination of network and data is its own new thing, and requires thought and patience. A Little History --- When backpropagation was first described in the neural network literature in 1986 it completely changed how people thought about neural networks [#Rumelhart86]. The explosion of research and practical benefits that followed were all made possible by this surprisingly efficient technique for finding gradients. But this wasn't the first time that backprop had been discovered or used. This algorithm, which has been called one of the 30 "great numerical algorithms" [#Trefethen15], has been discovered and re-discovered by different people in different fields since at least the 1960's. There are many disciplines that use connected networks of mathematical operations, and finding the derivatives and gradients of those operations at every step is a common and important problem. Clever people who tackled this problem have re-discovered backprop time and again, often giving it a new name each time. Excellent capsule histories are available online and in print [#Griewank12] [#Schmidhuber15] [#Kurenkov15] [#Werbos96]. We'll summarize some of the common threads here. But history can only cover the published literature. There's no knowing how many people have discovered and re-discovered backprop, but didn't publish it. Perhaps the earliest use of backprop in the form we know it today was published in 1970, when it was used for analyzing the accuracy of numerical calculations [#Linnainmaa70], though there was no reference made to neural networks. The process of finding a derivative is sometimes called *differentiation*, so the technique was known as **reverse-mode automatic differentiation**. It was independently discovered at about the same time by another researcher who was working in chemical engineering [#Griewank12]. Perhaps its first explicit investigation for use in neural networks was made in 1974 [#Werbos74], but because such ideas were out of fashion, that work wasn't published until 1982 [#Schmidhuber15]. Reverse-mode automatic differentiation was used in various sciences for years. But when the classic 1986 paper re-discovered the idea and demonstrated its value to neural networks the idea immediately became a staple of the field under the name **backpropagation** [#Rumelhart86]. Backpropagation is central to deep learning, and it forms the foundation for the techniques that we'll be considering in the remainder of this book. Digging into the Math --- This section offers some suggestions for dealing with the math of backpropagation. If you're not interested in that, you can safely skip this section. Backpropagation is all about manipulations to numbers, hence its description as a "numerical algorithm." That makes it a natural for presenting in a mathematical context. Even when the equations are stripped down, they can appear formidable [#Neilsen15b]. Here are a few hints for getting through the notation and into the heart of the matter. First, it's essential to master each author's notation. There are a lot of things running around in backpropagation: errors, weights, activation functions, gradients, and so on. Everything will have a name, usually just a single letter. A good first step is to scan through the whole discussion quickly, and notice what names are given to what objects. It often helps to write these down so you don't have to search for their meanings later. The next step is to work out how these names are used to refer to the different objects. For example, each weight might be written as something like $w^l_{jk}$, referring to the weight that links neuron number $k$ on layer $l$ to neuron $j$ on layer $l+1$. This is a lot to pack into one symbol, and when there are several of these things in one equation it can get hard to sort out what's going on. One way to clear the thickets is to choose values for all the subscripts, and then simplify the equations so each of these highly-indexed terms refers to just one specific thing (such as a single weight). If you think visually, consider drawing pictures showing just what objects are involved, and how their values are being used. The heart of the backprop algorithm can be thought of, and written as, an application of the **chain rule** from calculus [#Karpathy15]. This is an elegant way to describe the way different changes relate to one another, but it requires familiarity with multidimensional calculus. Luckily, there's a wealth of online tutorials and resources designed to help people come up to speed on just this topic [#MathCentre09] [#Khan13]. We've seen that in practice the computations for outputs, deltas, and weight updates can be performed in parallel. They can also be *written* in a parallel form using the linear algebra language of vectors and matrices. For example, it's common to write the heart of the forward pass (without each neuron's activation function) with a matrix representing the weights between two layers. Then we use that matrix to multiply a vector of the neuron outputs in the previous layer. In the same way, we can write the heart of the backward pass as the transpose of that weight matrix times a vector of the following layer's deltas. This is a natural formalism, since these computations consist of lots of multiplies followed by additions, which is just what matrix multiplication does for us. And this structure fits nicely onto a GPU, so it's a nice place to start when writing code. But this linear algebra formalism can obscure the relatively simple steps, because one now has to deal with not just the underlying computation, but its parallel structure in the matrix format, and the proliferation of indices that often comes along with it. It's fair to say that compacting the equations in this form is a type of optimization, where we're aiming for simplicity in both the equations and the algorithms they describe. When learning backprop, people who aren't already very familiar with linear algebra can reasonably feel that this is a form of *premature optimization*, because (until it is mastered) it obscures, rather than elucidates, the underlying mechanics [#Hyde09]. Arguably, only once the backprop algorithm is fully understood should it be rolled up into the more compact matrix form. Thus it may be helpful to either find a presentation that doesn't start with the matrix algebra approach, or try to pull those equations apart into individual operations, rather than big parallel multiplications of matrices and vectors. Another potential hurdle is that the activation functions (and their derivatives) tend to get presented in different *ad hoc* ways. To summarize, many authors start their discussions with either the chain rule or matrix forms of the basic equations, so that the equations appear tidy and compact. Then they explain why those equations are useful and correct. Such notation and equations can look daunting, but if we pull them apart to their basics we'll recognize the steps we saw in this chapter. Once we've unpacked these equations and then put them back together, we can see them as natural summaries of an elegant algorithm. References ==== [#Dauphin14]: Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio, "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization", 2014. http://arxiv.org/abs/1406.2572 [#Fullér10]: Robert Fullér, "The Delta Learning Rule Tutorial", Institute for Advanced Management Systems Research, Department of Information Technologies, Åbo Adademi University, 2010. http://uni-obuda.hu/users/fuller.robert/delta.pdf [#Griewank12]: Andreas Griewank "Who Invented the Reverse Mode of Differentiation?", Documenta Mathematica, Extra Volume ISMP 389–400, 2012 http://www.math.uiuc.edu/documenta/vol-ismp/52_griewank-andreas-b.pdf [#Hyde09]: Randall Hyde, "The Fallacy of Premature Optimization," ACM Ubiquity, 2009. http://ubiquity.acm.org/article.cfm?id=1513451 [#Karpathy15]: Andrej Karpathy, "Convolutional Neural Networks for Visual Recognition", Stanford CS231n course notes, 2015. http://cs231n.github.io/optimization-2/ [#Karpathy16]: Andrej Karpathy, "Yes, You Should Understand Backprop", Medium, 2016. https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b [#Khan13]: Khan Academy, "Chain rule introduction", 2013. https://www.khanacademy.org/math/ap-calculus-ab/product-quotient-chain-rules-ab/chain-rule-ab/v/chain-rule-introduction [#Kurenkov15]: Andrey, Kurenkov, "A 'Brief' History of Neural Nets and Deep Learning, Part 1", 2015. http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/ [#Linnainmaa70]: S. Linnainmaa, S., "The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors", Master’s thesis, University of Helsinki. 1970. [#MathCentre09]: Math Centre, "The Chain Rule", Math Centre report mc-TY-chain-2009-1, 2009. http://www.mathcentre.ac.uk/resources/uploaded/mc-ty-chain-2009-1.pdf [#NASA12]: NASA, "Astronomers Predict Titanic Collision: Milky Way vs. Andromeda", NASA Science Blog, Production editor Dr. Tony Phillips, 2012. https://science.nasa.gov/science-news/science-at-nasa/2012/31may_andromeda [#Neilsen15a]: Michael A. Nielsen, "Using Neural Networks to Recognize Handwritten Digits", Determination Press, 2015. http://neuralnetworksanddeeplearning.com/chap1.html [#Neilsen15b]: Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015. http://neuralnetworksanddeeplearning.com/chap2.html [#Rumelhart86]: D.E. Rumelhart, G.E. Hinton, R.J. Williams, "Learning Internal Representations by Error Propagation", in "Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1", pp. 318-362, 1986. http://www.cs.toronto.edu/~fritz/absps/pdp8.pdf [#Schmidhuber15]: Jürgen Schmidhuber, "Who Invented Backpropagation?", Blog post, 2015. http://people.idsia.ch/~juergen/who-invented-backpropagation.html [#Seung05]: Sebastian Seung, "Introduction to Neural Networks", MIT 9.641J course notes, 2005. https://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-641j-introduction-to-neural-networks-spring-2005/lecture-notes/lec19_delta.pdf [#Trefethen15]: Nick Trefethen, "Who Invented the Great Numerical Algorithms?" Oxford Mathematical Institute, 2015. https://people.maths.ox.ac.uk/trefethen/inventorstalk.pdf [#Werbos74]: P. Werbos, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences", PhD thesis, Harvard University, Cambridge, MA, 1974. [#Werbos96]: Paul John Werbos, "The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting", Wiley-Interscience, 1994 [#Wikipedia17]: Wikipedia, "Goldilocks and the Three Bears", 2017. https://en.wikipedia.org/wiki/Goldilocks_and_the_Three_Bears
All contents, text, and Images copyright (c) 2017 by Andrew Glassner