Big Heading
**Chapter 18: Backpropagation**
A preview chapter from **Deep Learning: From Basics to Systems**
by Andrew Glassner
**This is an early look at Chapter 18.**
Contents may change in the final book.
Items marked TK will be resolved in the final.
This page is located at https://dlbasics.com/bogus-foo-temp
Why This Chapter Is Here
====
This chapter is about
training a neural network.
The very basic idea is
appealingly simple.
Suppose we're training
a categorizer,
which will tell us which
of several given labels
should be assigned to a
given input.
It might tell us what
animal is featured in a photo,
or whether a bone in an
image is broken or not,
or what song a particular
bit of audio belongs to.
Training this neural network
involves handing it a sample,
and asking it to **predict**
that sample's label.
If the prediction matches
the label that we previously
determined for it,
we move on to the next sample.
If the prediction is wrong,
we change the network
to help it do better
next time.
Easily said,
but not so easily done.
This chapter is about
how we "change the network"
so that it **learns**,
or improves its ability
to make correct predictions.
This approach works beautifully
not just for classifiers,
but for almost any kind of
neural network.
Contrast a feed-forward network
of neurons to the dedicated
classifiers we saw in
Chapter TK.
Each of those had a
customized, built-in learning
algorithm that measured
the incoming data to provide
the information that classifier
needed to know.
But a neural network is just
a giant collection of neurons.
Even when we organize them into
layers,
there's no inherent learning algorithm.
That network is just a bunch of neurons,
each doing its own little calculation
and then passing on its results to
other neurons.
How can we train such things
to produce the results we want?
And how can we do it efficiently?
The answer is called
**back-propagation**,
or more commonly
**backpropagation**,
or simply **backprop**.
Without backprop,
we wouldn't have today's
widespread use of deep learning,
because we wouldn't be able to
train our models in reasonable
amounts of time.
With backprop,
deep learning algorithms are
practical and plentiful.
Backprop is a low-level algorithm.
When we use libraries to build
and train deep learning systems,
we'll use their finely-tuned
routines to get the best
speed and accuracy.
Except as an educational exercise,
we're likely to never write
our own code to perform backprop.
So why is this chapter here?
Why should we bother knowing about
this low-level algorithm at all?
There are at least four
good reasons to have
a general knowledge of
backpropagation.
First,
it's important to understand
backprop because
knowledge of one's tools is
part of becoming a master
in any field.
Sailors at sea understand how ropes
work and why
specific knots are used
in specific situations,
photographers understand the basics
of how lenses work,
and airplane pilots understand why
their plane turns when they tilt
the wings in a certain way.
A basic knowledge of the core techniques
of any field is part of the process
of gaining proficiency and
developing mastery.
In this case,
knowing something about backprop
lets us read the literature,
talk to other people about
deep learning ideas,
and better understand the algorithms
and libraries we use.
Second,
and more practically,
knowing about backprop
can help us design networks that learn.
When a network learns slowly,
or not at all,
it can be because something
is preventing backprop from
running properly.
Backprop is a versatile and
robust algorithm,
but it's not bulletproof.
We can unknowingly build
networks where backprop won't
work properly,
resulting in a network that
stubbornly refuses to learn.
For those times when something's
going wrong with backprop,
understanding the algorithm helps
us fix things [#Karpathy16].
Third,
many important advances in
neural networks rely on
backprop intimately.
To learn these new ideas,
and understand why they work
the way they do,
it's important to know the
algorithms they're building on.
Finally,
backprop is an
elegant algorithm.
It efficiently
solves a problem that would
otherwise require a prohibitive
amount of time and computer resources.
It's one of the conceptual
treasures of the field.
As curious, thoughtful people
it's well worth our time to
understand this beautiful algorithm.
For these reasons and others,
this chapter provides an
introduction to backprop.
Generally speaking,
introductions to backprop are presented
mathematically,
as a collection of equations with
associated discussion [#Fullér10].
As usual,
we'll skip the mathematics and
focus instead on the concepts.
The mechanics are common-sense
at their core,
and don't require
any tools beyond basic arithmetic
and the ideas of a derivative and gradient,
which we discussed in Chapter TK.
A Word On Subtlety
----
The backpropagation algorithm
is not complicated.
In fact, it's remarkably simple,
which is why it works so well and
can be implemented so efficiently.
But simple does not always mean easy.
The backprop algorithm is subtle.
In the discussion below,
the algorithm will take shape through
a process of observations and reasoning,
and these steps may take some thought.
We'll try to be clear about every step,
but making the leap from reading
to understanding may require some work.
It's worth the effort.
A Very Slow Way to Learn
===
Let's begin with a very slow
way to train a neural network.
This will give us a good starting point,
which we'll then improve.
Suppose we've been given a
brand-new neural network consisting of
hundreds or even tens of
thousands of interconnected
neurons.
The network was designed to
do classification of each
input into one of 5 categories.
So it has 5 outputs,
numbered 1 to 5,
and whichever one has the largest
output is the network's prediction
for an input's category.
Figure [fig-block-diagram-classifier] shows the idea.
![Figure [fig-block-diagram-classifier]: A neural
network predicting the class
of an input sample.
Starting at the bottom,
we have a sample with four
features and a label.
The label tells us that
the sample belongs to category 3.
The features go into a neural network
which has been designed to
provide 5 outputs,
one for each class.
In this example,
the network has incorrectly
decided that the input belongs
to class 1,
because the largest output,
0.9, is from output number 1.
](Images/block-diagram-classifier.jpg width="350px")
![Figure [fig-block-diagram-classifier2]: FIG 2 A neural
block-diagram-classifier
](https://dlbasics.com/resources/Backpropagation/Images/block-diagram-classifier.jpg width="350px")
Consider the state of
our brand-new network,
before it has seen any inputs.
As we know from Chapter TK,
each input to each neuron has
an associated weight.
There could easily be hundreds of
thousands, or many millions,
of weights in our network.
Typically,
all of these
weights will have been initialized with
small random numbers.
Let's now run one piece of labeled
training data through the net,
as in Figure [fig-block-diagram-classifier].
The sample's features go into the
first layer of neurons,
and the outputs of those neurons
go into more neurons,
and so on,
until they finally arrive
at the output neurons,
when then become the
output of the network.
The index of the output
neuron with the largest value
is the predicted class for this sample.
Since we're starting with random
numbers for our weights,
we're likely to get essentially
random outputs.
So there's a 1 in 5 chance the network
will happen to predict
the right label for this sample.
But there's a 4 in 5 chance it'll
get it wrong,
so let's assume that the network
predicts the wrong category.
When the
prediction doesn't match the label,
we can measure the discrepancy
numerically,
coming up with a single number
to tell us just how wrong this
answer is.
We call this number
the **error score**,
or **error**, or sometimes the **loss**
(if the word "loss" seems like a
strange synonym for "error,"
it may help to think to think
of it as describing how much information
is "lost" if
we categorize a sample
using the output of the classifier,
rather than the label.).
The error (or loss) is a floating-point number
that can take on any value,
though often we set things up so that
it's always positive.
The larger the error,
the more "wrong" our network's
All contents, text, and images copyright (c) 2017 by Andrew Glassner