Introduction
Before we learn how neural networks work, it's important to first understand why we use them at all. Let's look at computer vision, more specifically digit classification. Our goal is to take a greyscale image of a digit and figure out what is being depicted. Remember, an image is just a matrix(or grid) of pixels. How can we define a set of rules such that we can feed this matrix in as input and receive the correct number as output? This task may at first seem trivial, and in some sense it is, as our brain unconsciously does this thousands of times a day. But think about how you might go about programming this. It very quickly turns into an incredibly daunting task. The best solution, in the end, is to try to mimic the brain. A neural network is exactly that, a very crude model of the human brain. Let's see how they work.
Overview
In general, a neural net will consist of two things: nodes, and weights. You can think of a node as a function, meaning it has an input and an output. A weight is simply a number. We organize our nodes into layers, which are just vectors(or arrays). These layers are connected to each other via weights. This means that each node in a given layer has a weight linking it to every node in the subsequent layer.
The training of a neural network refers to the process of slowly tweaking these weights such that our network will have a high accuracy, but we will save that for part two.
Neuron/Node
In practice, a node is not a single function, but a sequence of smaller functions, where the output of one is the input of the next. In our case, a node is made up of two functions, but in reality it can have any number of sub-functions. Our node’s functions are quite simple: a dot product, and an activation function. Those may sound complicated to you, but I assure you they are not. Let's look at each one:
A dot product is just a way of multiplying two vectors to get a scalar(a single number). To perform a dot product, you sum the element wise multiplication for each element in the vector.
In the domain of our network, a dot product will be multiplying a layer of nodes and the weights that connect it to the next layer. A dot product for a single node will look like this:
Activation functions can get to be a little more complicated, since there is no single function used in all cases. The goal of an activation function is to provide some nonlinearity to our network, which allows it to learn more complex patterns.
Traditionally, the sigmoid is the go-to, although it has fallen out of favor in recent years. The sigmoid is just a squashing function which maps the input to somewhere between 0 and 1. The sigmoid looks like this:
Unfortunately, the sigmoid has some fairly major issues, which is why it has become far less popular in the last 10 years. These problems include vanishing gradients, non-zero centering, and expensive computation, all of which are a little out of the scope of this article. If you'd like, you can read more about them here. Instead, it's much more common nowadays to opt for the Rectified Linear Unit(ReLU). This has a bit of a scary name, but it's actually much simpler than the sigmoid.
The ReLU function is simply the max of 0 and the input. This means that all negative inputs become 0, and positive outputs remain unchanged. The ReLU function looks like this:
ReLU fixes practically all of the sigmoids issues, and was popularized by its use in AlexNet in 2012.
While the sigmoid and the ReLU can be used interchangeably, some problems require different kinds of activation functions. In the last layer, we typically use something called a Softmax.
Instead of a single input, the Softmax takes in an entire vector and returns a vector of probabilities. This means that if you take the sum of the vector, you will get 1. For our purposes, each number in the output of the Softmax represents one class, meaning our output will have a length of 10(one for each digit). From here, the class with the largest probability is our network's guess for the digit. The Softmax looks like this:
Complete Architecture
Now that we have all of the functions we need for a node, we can begin to construct a more concrete picture of our network. One thing to keep in mind is that the first layer of the network doesn’t consist of actual nodes, it’s just our input data. In our case, each first layer “node” represents one pixel of our image, so the first layer is 784 nodes long. This is because the dimension of our input image will be 28 by 28 pixels, which we flatten into a one-dimensional vector. The lengths of the hidden layers are arbitrary, and are some of the many hyperparameters that we set before training begins. The last layer will have a length of 10, because there are 10 digits that we must classify.
Forward Propagation
We can now begin with the first half of the learning process: forward propagation(or the forward pass). During this step, we propagate our input data through the network, and get our network’s guess as output. We begin with an image of a digit, our input, which we flatten out into one-dimension. We will substitute this vector as our first layer. Then, we move to the second layer(the first hidden layer), where each node will compute a dot product from our input data and the weights connecting them. Keep in mind that our weights are just random numbers at this point. Side note–weight initialization is actually quite a complicated subject, which you can read more about here.
Each node in our second layer now contains a scalar, which we will perform an activation function on, in our case a ReLU. The output of this layer is now the input of the next hidden layer, so we again compute a dot product, as well as another ReLU. We now have the output of our second hidden layer(the third layer in total). Finally, we will compute one more dot product for the last layer. Once we compute a Softmax, the output layer’s activation function, we will have our final vector of probabilities. To get our network’s guess, we just find the class that has the highest probability.
Closing remarks + further information
We are now finished with forward propagation! Because our weights are initialized randomly, we can only expect our network to guess correctly about 10% of the time. In Part 2, I will get into the real meat of neural nets: backpropagation. This is the process in which our network learns, and is much more math heavy. It is vital you understand forward propagation before diving into back propagation, so I highly recommend you read this article again or look elsewhere on the web. You can find an excellent series on YouTube by 3blue1brown here. Look out for Part 2 and have a nice day!