Demystifying Artificial Neural Networks

Abstract: The following document represents a technical description of the anatomy and training of a fully-connected neural network with one hidden layer accompanied by the problem of classifying images of handwritten digits. The document begins with an introduction to the general class of pattern recognition algorithms and their purpose, along with a snapshot of the history and relevance of the field today. The document concludes with some limits of the model described herein, a better alternative for future reading, and why this specific algorithm composition matters.

Artificial neural networks, or ANNs for short, fall under a class of pattern recognition algorithms. These kinds of algorithms find patterns in high dimensional datasets, and are optimized to perform well in settings such as regression, where the goal is to forecast a response variable given some future observation, or classification, where the goal is to predict the class of a future observation, as elementary examples. Here, an observation could stand for an image, or a collection of numbers whose entries stand for quantifiable things hopefully related to the response variable. To put things into perspective, an ANN is capable of distinguishing between dogs and cats given a picture containing one of them (all without human supervision), and when presented tabular data, say features of homes along with their prices, can predict prices given the features of a new house in the neighborhood without assuming the shape of the underlying function.

One of the first people to have introduced the robust capability of a neural network was Frank Rosenblatt, a research psychologist from Cornell University, who demonstrated a machine the size of a room that was able to distinguish punch cards using a single neuron neural network (Lefkowitz). With the onset of stronger computing power over the years and the boom of data available to analyze, neural networks now can thrive on hard problems and are widely used across many different applications. Some of these applications include detecting skin cancer better than human doctors, language translation, fraud detection in check deposits, and item recommendation based one’s previous purchasing history. The following description will elucidate the anatomy of a fully-connected neural network (a particular type of neural network frequently used in the architectures of more complicated architectures), the computations involved in the “forward” and “backward” stages of its learning, a term frequently used within the pattern recognition community to describe the process of optimizing the algorithm to fit the demands of the task at hand, accompanied by an example application on image classification.

The fully-connected neural network is best described by a composition of its layers, its neurons and its weights. A layer is essentially a collection of neurons at a certain level and it can be either the input, hidden, or the output. To keep things simple in description, only one hidden level is assumed. Taking image classification as a central example, the neurons in the input layer are just the pixel values of the image. Specifically, the way computers interpret images is by representing the color in each pixel by intensity values in three color categories: red, green and blue. Together, these pixel representations can be packaged into a rectangular array of numbers, called a tensor. Therefore, the shape dimensions of this tensor would be 3 by M by N, where M represents the width, and N the height of the image. This is how an image would be stored, but not how it gets passed into a fully-connected neural network. To do that, the image is first “flattened” so that these pixel values no longer make up a three-dimensional box, but instead a one-dimensional list of numbers. Each of these pixel values is actually an input neuron.

Next, we take the product of each of these input neurons with variables called the weights, and add them up. We don’t assign a number to these weights just yet, because this is actually what is learned! After this addition is performed, the value gets passed into a nonlinear bounded function, called the neuron’s activation, and typically the sigmoid function is used in practice, which resembles stretched “S” that is in between the horizontal lines about y=1 and y=0. Note that the procedure just described only represents a single value, or one hidden neuron. We repeat the same computation of element-wise multiplication with a different set of weights, then add them up and submit the sum to the activation for a total of how many hidden neurons are preferred. If there are L hidden neurons, there are a total of 3*M*N*L different weights between the input layer and the (only) hidden layer. One sometimes also assigns bias terms, or a weight variable which is not multiplied with any neuron from the previous layer, but is instead added to the sum previously described, before evaluating the activation. Each bias term is unique to each hidden neuron for a certain layer. If bias terms would be desired, the amount of trainable weights now amounts to 3*M*N*L+L. Together, these hidden neurons make up the hidden layer.

The computations involved between the hidden layer and the output layer are identical to that of the input layer and the hidden layer. Except now, since the output layer is the final stage of the network, it is important to remark on its dimensions. In the image classification example, if we wanted classify images of handwritten digits from zero to nine, it would make sense to have ten neurons in the output layer. The activations of each of these neurons, which are some numbers between zero and one, can be thought of as a class probabilities. In a learned neural network, where all the weights and bias terms are adjusted appropriately, if we passed in an image to the network, the weights would now be fixed and so the architecture just described would be how computation would proceed in between, finalizing at the output layer, where we would look at which output neuron has the highest activation among the other nine, and that is precisely how we would know what number the image represents from this trained network.

Figure 1: Anatomy of a fully-connected neural network composed of a single hidden layer for handwritten image classification. (Nielsen, Michael. Neural Networks and Deep Learning. N.p., 2019. http://neuralnetworksanddeeplearning.com. Web. 22 Sept 2020)

So how exactly are these weights learned? To introduce learning would mean to introduce a means by which the weights are deemed good or bad. Neural networks learn by example, often hundreds of thousands of examples, to perform optimally. In image classification, images are paired with the labels that these images correspond to, and the network learns by minimizing a penalty or cost function, starting from the output layer, in an efficient, recursive procedure known as backpropagation. One example of a loss function in the image classification context would be the mean squared error loss, where the sum of squared differences between activations in the output layer, which is produced by the network, and what is desired, or the value of one corresponding to the actual class for that particular image and zero everywhere else is obtained, then averaged over all images in the dataset. Note that this loss function is not only a function of the output activations, as an output neuron’s activation produced by the network is in terms of all the (hidden layer) weights (and bias term) leading to that output neuron, and all the weights leading to those hidden neurons … leading to just the weights at the input layer (fig 1). Thus, the loss is effectively a function of all the weights and bias terms that constitute the entire network! Minimizing this loss function would mean to encourage the network to tune the weights in such a way that given an image, the output activation for the correct class is as close as possible to one, while all the other output activations are as close as possible to zero.

Figure 2: How a loss function would look like, only here we see it based on two axes. In reality, the loss function lies over a high dimensional weight space. (Goldstein, Tom. “Visualizing the Loss Landscape of Neural Nets.” Computer Science Deparment. U of Maryland, N.p, 2018. cs.umd.edu. Web. 22 Sept 2020)

Backpropagation is an algorithm that enables us to obtain the partial derivatives of the cost function with respect to each of the weights through repeated use of the chain rule, resulting in a set of fundamental equations which govern how they are found. Partial derivatives are a mathematical construct which tell you in which direction to tweak the weight, and by how much. They essentially act as a “rate of change,” so if they are multiplied by a small value, the units become the amount of change we need to make to the weight, and we subtract these changes from the current values of the weights in the training procedure. The first time these weights are assigned any value happens in the very beginning of learning, where the weights are initialized with random numbers. The reason why we subtract instead of add is because the goal is to minimize, not maximize the loss function! The collection of these partial derivatives defines something called a gradient, which corresponds to the “direction of change” along the high dimensional loss function (fig 2). The gradient points in the direction of steepest descent along this loss function, until the values settle in some “valley,” or local minima of the loss function.

Unfortunately describing how these fundamental equations are derived involves comfort with multivariate calculus and some linear algebra, which is outside the scope of this technical description. As a small remark, it is in these equations where we see correspondence between Hebbian-like learning in the brain, thus the name. The partial derivatives described earlier are usually collected over a small batch of examples, then the change to the weights is averaged over this subset of examples. A couple of batches are passed in, until we exhaust all training examples, where we start again for some iterations called epochs. We continue learning until we are satisfied with the weights acquired, and typically that is done by inspecting whether the accuracy of the network is no longer increasing, but instead decreasing, a phenomenon called overfitting.

From the formula for the number of weights shown earlier, it is seen that the number of weights grows dramatically with the size of the image, and with the number of hidden layers and neurons. This makes the model more complex, meaning it is harder to train and prone to overfitting. Luckily, in the handwritten images example, it should be intuitive that we really don’t need color to describe the identity of a handwritten number, so we can instead supply greyscale values which cuts the number of weights needed for the input layer roughly by three. With the right number of neurons in the hidden layer, and the addition of artificially generated image examples through image manipulation (inducing things like rotational invariance) we could potentially reach an accuracy of around 90-95% with this basic framework. The reason why this specific composition of stacked nonlinear functions, as described, is key to being highly performant is due to a mathematical theorem called the Universal Approximation Theorem. This theorem states that neural networks with the right number of hidden layers can approximate any function in a high dimensional real space! Luckily with a more sophisticated architecture like a convolutional neural network (which contains a fully-connected network as a part of its architecture), we can do even better, and build deeper networks (shown in practice to be an instrumental part in achieving better performance) without the dramatic increase in the weights. This basic architecture yields surprising results given the difficulty of image recognition, and is key in understanding more sophisticated architectures.

_____________________________________________________________________________________________________

Works Cited

Lefkowitz, Melanie. “Professor’s Perceptron Paved the Way for AI – 60 Years Too Soon.” Cornell Chronicle. Cornell University, 25 Sept. 2019. Web. 22 Sept. 2020

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Writing for the Sciences Portfolio – Matthew K

Need help with the Commons?