Neural Network Overview
These notes are from a while ago, but I’m posting them here in hopes that they might be helpful for others. The video where these notes come from can be found here. The corresponding notes (which are probably better than mine) can be found here.
Motivating Example
It is very easy for our brains to be given images of handwritten numbers and classify them, even if the images are poorly written.
But how, in all that is holy, would you write a program that could do this?
The problem is a lot harder now, isn’t it? Let’s say we have a 4 and a 5 and want to classify them separately. One thought is that since we’re dealing with images, and images are just pixels, maybe we look at the pixel intensity distributions of each number they will be different enough that we can classify them that way?
But there’s a problem: A 4 and a 5 can look pretty similar if written poorly, so if we just look at pixel intensity distributions from the image, they might be pretty similar.
So what do we do?
We need some way to construct a program that is flexible and adapts to the data that we give it. This is different than how we normally think about writing software. If we were to take the pixel intensity distribution, we would find the pixel intensity distributions for the digits 0-9, then chain up 10 if
statements. That process is fixed, and our sotware can’t adapt to new data without us needing to go in there and find all of the pixel intensity distrubitons by hand, again and again.
This is the problem that machine learning (ML) solves.
Neural networks are a popular form of ML, with deep learning being one application of neural networks (e.g. deep neural networks).
Neural Network
The name leads to two natural questions:
- What are the neurons (the neural part)?
- How are they connected (the network part)?
The Neuron
Right now let’s think of a neuron as a thing that holds a number between 0 and 1.
For our number-recognition problem, each image is 28x28=784 pixels, with each pixel being a neuron. Each of these neurons hold a value ranging from 0 (black) to 1 (white). This number specifically is called the neuron’s activation. All 784 neurons make up the first layer of our network. Here, we do something called flattening to go from a grid of 28x28 pixels to a vector of 784x1 pixels. As a small example, let’s say we’re working with a 2x2 grid of pixels:
\[ \begin{bmatrix} 0.15 & 0.24 \\ 0.41 & 0.38 \end{bmatrix} \xrightarrow{\text{Flatten}} \begin{bmatrix} 0.15 \\ 0.24 \\ 0.41 \\ 0.38 \end{bmatrix} \]
But in our example we’re working with images that are 28x28 instead of 2x2, but the approach is the same. This is just an image pre-processing step and is extremely common. Some networks don’t require image flattening (like convolutional neural networks), but most multi-layer perceptrons (MLPs, the technical name for this network) do.
If we want our network to predict the numbers 0-9, how many neurons should the final/output layer have? 10. One for each number we want to predict.
- The activation of the neurons in this last layer corresponds to how much the network thinks a number corresponds to that neuron.
Example: Output layer neuron 1 has an activation of 0.45, meaning that the network is assigning a 45% chance the image passed to it is a 1.
Between the input and output layers we have the hidden or deep layers.
The architecture of the hidden layers is arbitrary right now, but it is important.
Why Layers/Networks?
What are we expecting from this architecture? What about layers intuitively means they will behave “intelligently”?
We can think about layers as taking an input and breaking it up, with later layers being combinations of the smaller pieces given by the previous layers.
Example: The first layer (after the input layer) might split digits up into horizontal and vertical edges. Later layers might combine these and then become activated when given some input that looks like, say, a 4.
To give our network knobs to turn, we assign a weight (or strength, \(w\)) between the connections between neurons of one layer and the neurons in the following layer. Then take the activations (\(a\)) of each of the connections and multiply them by their corresponding weights (so a weighted sum).
From this weighted sum we can get any real value (i.e. \(-\infty\) to \(\infty\)), but here we want any number between 0 and 1. What we typically do is pass our raw output into a function that squises our output into the range of [0, 1] (inclusive). The common function is the sigmoid, \(\sigma\), which is defined as
\[ \sigma (x) = \frac{1}{ 1 + e^{-x} } \] which takes any real number \(x\) and compresses it between 0 and 1. This gives us something like this (for a single neuron):
\[ \sigma \left( w_1 a_1 + w_2 a_2 + \ldots w_n a_n \right) \] where \(n\) is the number of connections. So for our network, focusing on the first non-input layer, \(n=784\), since we have neurons numbered 1-784 from the input layer.
But maybe we want some additional flexibility here. Maybe we don’t want this neuron to only turn on when it evaluates to something greater than 0. Maybe we want that activation value to be 10.
Solution: Tack on a new value, called the bias (\(b\)), at the end:
\[ \sigma \left( w_1 a_1 + w_2 a_2 + \ldots w_n a_n + b \right) \] This bias tells the neuron how high the weighted sum needs to be before the neuron starts to activate. It gives us a bias to control the neuron’s inactivity.
Because we’re working with a multi-layered perceptron (a kind of fully-connected network, meaning that each neuron in the previous layer connects to every neuron in the next layer), we have 784 weights and activations per neuron and one bias per neuron.
Linear Algebra Notation
Put all of the activations of layer 0 in a column, then all of their weights in a matrix, wrapping the sigmoid function around the whole thing:
\[ \sigma \left( \begin{bmatrix} w_{0,0} & w_{0,1} & \ldots & w_{0, n} \\ w_{1,0} & w_{1,1} & \ldots & w_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k,0} & w_{k,1} & \ldots & w_{k,n} \end{bmatrix} \begin{bmatrix} a_0^{(0)} \\ a_1^{(0)} \\ \vdots \\ a_n^{(0)} \end{bmatrix} + \begin{bmatrix} b_0^{(0)} \\ b_1^{(0)} \\ \vdots \\ b_n^{(0)} \end{bmatrix} \right) = \sigma \left( \textbf{W} a^{(0)} + b^{(0)} \right) \] where the sigmoid function is applied element-wise, e.g.
\[ \sigma \left( \begin{bmatrix} x \\ y \\ z \end{bmatrix} \right) = \begin{bmatrix} \sigma(x) \\ \sigma(y) \\ \sigma(z) \end{bmatrix} \] This would all be the activation for the entire first layer, \(a^{(1)}\):
\[ a^{(1)} = \sigma \left( \textbf{W} a^{(0)} + b^{(0)} \right) \]
NVIDIA graphics processing units (GPUs) can do this kind of matrix multiplication EXTREMELY fast.
Let’s update our definition of a neuron a bit. Earlier, a neuron was defined as this thing that just holds a number. That’s still true, but a more complete picture is more like a function. A function that takes in all the outputs of the neurons in the previous layer, and spits out a number between 0 and 1.
Our entire network is actually a function. A huge one, but one that fundamentally takes in 784 numbers and spits out 10 values:
\[ f(a_0, \ldots , a_{783}) = \begin{bmatrix} y_0 \\ \vdots \\ y_9 \end{bmatrix} \]
This is why neural networks are sometimes called “universal function approximators”. The building blocks we’ve discussed are pretty much the same network to network, and this approach of creating a function with these buildings blocks is (so far) universal, and approximates the solutions to many problems.
Citation
@online{gregory2024,
author = {Gregory, Josh},
title = {Neural {Network} {Overview}},
date = {2024-12-31},
url = {https://joshgregory42.github.io/posts/2024-12-31-nn-overview/},
langid = {en}
}