Big Heading

**Chapter 18: Backpropagation**
A preview chapter from **Deep Learning: From Basics to Systems**
by Andrew Glassner
**This is an early look at Chapter 18.**
Contents may change in the final book.
Items marked TK will be resolved in the final.
This page is located at
Why This Chapter Is Here ==== This chapter is about training a neural network. The very basic idea is appealingly simple. Suppose we're training a categorizer, which will tell us which of several given labels should be assigned to a given input. It might tell us what animal is featured in a photo, or whether a bone in an image is broken or not, or what song a particular bit of audio belongs to. Training this neural network involves handing it a sample, and asking it to **predict** that sample's label. If the prediction matches the label that we previously determined for it, we move on to the next sample. If the prediction is wrong, we change the network to help it do better next time. Easily said, but not so easily done. This chapter is about how we "change the network" so that it **learns**, or improves its ability to make correct predictions. This approach works beautifully not just for classifiers, but for almost any kind of neural network. Contrast a feed-forward network of neurons to the dedicated classifiers we saw in Chapter TK. Each of those had a customized, built-in learning algorithm that measured the incoming data to provide the information that classifier needed to know. But a neural network is just a giant collection of neurons. Even when we organize them into layers, there's no inherent learning algorithm. That network is just a bunch of neurons, each doing its own little calculation and then passing on its results to other neurons. How can we train such things to produce the results we want? And how can we do it efficiently? The answer is called **back-propagation**, or more commonly **backpropagation**, or simply **backprop**. Without backprop, we wouldn't have today's widespread use of deep learning, because we wouldn't be able to train our models in reasonable amounts of time. With backprop, deep learning algorithms are practical and plentiful. Backprop is a low-level algorithm. When we use libraries to build and train deep learning systems, we'll use their finely-tuned routines to get the best speed and accuracy. Except as an educational exercise, we're likely to never write our own code to perform backprop. So why is this chapter here? Why should we bother knowing about this low-level algorithm at all? There are at least four good reasons to have a general knowledge of backpropagation. First, it's important to understand backprop because knowledge of one's tools is part of becoming a master in any field. Sailors at sea understand how ropes work and why specific knots are used in specific situations, photographers understand the basics of how lenses work, and airplane pilots understand why their plane turns when they tilt the wings in a certain way. A basic knowledge of the core techniques of any field is part of the process of gaining proficiency and developing mastery. In this case, knowing something about backprop lets us read the literature, talk to other people about deep learning ideas, and better understand the algorithms and libraries we use. Second, and more practically, knowing about backprop can help us design networks that learn. When a network learns slowly, or not at all, it can be because something is preventing backprop from running properly. Backprop is a versatile and robust algorithm, but it's not bulletproof. We can unknowingly build networks where backprop won't work properly, resulting in a network that stubbornly refuses to learn. For those times when something's going wrong with backprop, understanding the algorithm helps us fix things [#Karpathy16]. Third, many important advances in neural networks rely on backprop intimately. To learn these new ideas, and understand why they work the way they do, it's important to know the algorithms they're building on. Finally, backprop is an elegant algorithm. It efficiently solves a problem that would otherwise require a prohibitive amount of time and computer resources. It's one of the conceptual treasures of the field. As curious, thoughtful people it's well worth our time to understand this beautiful algorithm. For these reasons and others, this chapter provides an introduction to backprop. Generally speaking, introductions to backprop are presented mathematically, as a collection of equations with associated discussion [#Fullér10]. As usual, we'll skip the mathematics and focus instead on the concepts. The mechanics are common-sense at their core, and don't require any tools beyond basic arithmetic and the ideas of a derivative and gradient, which we discussed in Chapter TK. A Word On Subtlety ---- The backpropagation algorithm is not complicated. In fact, it's remarkably simple, which is why it works so well and can be implemented so efficiently. But simple does not always mean easy. The backprop algorithm is subtle. In the discussion below, the algorithm will take shape through a process of observations and reasoning, and these steps may take some thought. We'll try to be clear about every step, but making the leap from reading to understanding may require some work. It's worth the effort. A Very Slow Way to Learn === Let's begin with a very slow way to train a neural network. This will give us a good starting point, which we'll then improve. Suppose we've been given a brand-new neural network consisting of hundreds or even tens of thousands of interconnected neurons. The network was designed to do classification of each input into one of 5 categories. So it has 5 outputs, numbered 1 to 5, and whichever one has the largest output is the network's prediction for an input's category. Figure [fig-block-diagram-classifier] shows the idea. ![Figure [fig-block-diagram-classifier]: A neural network predicting the class of an input sample. Starting at the bottom, we have a sample with four features and a label. The label tells us that the sample belongs to category 3. The features go into a neural network which has been designed to provide 5 outputs, one for each class. In this example, the network has incorrectly decided that the input belongs to class 1, because the largest output, 0.9, is from output number 1. ](Images/block-diagram-classifier.jpg width="350px") ![Figure [fig-block-diagram-classifier2]: FIG 2 A neural block-diagram-classifier ]( width="350px") Consider the state of our brand-new network, before it has seen any inputs. As we know from Chapter TK, each input to each neuron has an associated weight. There could easily be hundreds of thousands, or many millions, of weights in our network. Typically, all of these weights will have been initialized with small random numbers. Let's now run one piece of labeled training data through the net, as in Figure [fig-block-diagram-classifier]. The sample's features go into the first layer of neurons, and the outputs of those neurons go into more neurons, and so on, until they finally arrive at the output neurons, when then become the output of the network. The index of the output neuron with the largest value is the predicted class for this sample. Since we're starting with random numbers for our weights, we're likely to get essentially random outputs. So there's a 1 in 5 chance the network will happen to predict the right label for this sample. But there's a 4 in 5 chance it'll get it wrong, so let's assume that the network predicts the wrong category. When the prediction doesn't match the label, we can measure the discrepancy numerically, coming up with a single number to tell us just how wrong this answer is. We call this number the **error score**, or **error**, or sometimes the **loss** (if the word "loss" seems like a strange synonym for "error," it may help to think to think of it as describing how much information is "lost" if we categorize a sample using the output of the classifier, rather than the label.). The error (or loss) is a floating-point number that can take on any value, though often we set things up so that it's always positive. The larger the error, the more "wrong" our network's
All contents, text, and images copyright (c) 2017 by Andrew Glassner