Artificial Intelligence — A Brief and Simplified Introduction to Convolutional Neural Networks and their Practical Application — Part 1.

by Dr. Iain J. Attwater

16 min readMar 3, 2021

Background

This is part 1 of a 3-part series that takes a generic approach to understanding the imaging side of Artificial Intelligence (AI).

The field of Artificial Intelligence has enjoyed an increasing role in our everyday lives but what is AI? Many experts have developed succinct or “pre-canned” phrases that attempt to capture the essence of AI. For the purposes of this Article, we define AI as the ability of a computer to learn from data that is separated from the logic that controls how the data is used. In other words, decisions are not programmed into the system ahead of time, but such decisions are learned from the data consumed. Some have defined that as learning the rules to derive answers rather than learning the answers.

There are many fields in the study of AI. However, they are usually divided into 2 key groups: Shallow and Deep learning. Shallow Learning finds its role in areas such as regression analysis; tools to combine known facts or features of a problem to find a pattern that connects them. Deep Learning learns the features and the relationships that exist between them. This article focuses on the application of the latter.

The basic entry point for Deep Learning is a construct known as Artificial Neural Networks (ANNs). The basic building blocks and their functionality are explored later.

Convolutional Neural Networks (CNNs), a subset of ANNs, have been around for a few years now, but with the advent of increasing on-device performance, their power is now being utilized in more everyday situations. Their particular specialty resides in the field of computer vision; a high intensity, high processing field.

Raw processing power and the ability to provide compute services locally on a device is only part of the story. The ability to process image data and perform predictive, real-time analysis locally or “on-device” brings with it significant benefits when considering privacy and avoiding the reliance on network connectivity or cloud-based solutions.

This article explores an area that is receiving very focused attention recently, the field of medical imaging, and more specifically in this article, the prediction of either viral or bacterial pneumonia from chest x-rays. However, rather than utilizing the “black box” approach to the creation of a CNN model, the author chose to return to basics and develop the model using Tensorflow 2 and the conversion tools available to create a Core ML, apple specific, model. The use of this model in a mobile application will form the subject matter in part 3 of this series. The author also chose not to repeat, as best he can, the many articles already written on the subject but rather to present a practical approach to CNN deployment for the medical professional, business and computing science communities and for the mathematical novice or the simply curious.

The mathematics of Artificial Neural Networks is rigorous and draws upon areas such as Linear Algebra, Partial Differential Calculus and Statistics. The in-depth study of each of these is beyond the scope of this article. However, the author presents some basic mathematical intuition with some references, to allow a more in-depth understanding, to some of the excellent content available online. The reader, though, should not ignore the reliance on mathematics in being able to achieve such results; the author is a strong believer in the old adage “Consider mathematics as a language to communicate abstract, logical ideas with precision and unambiguity, and it will set you free!”.

AI Mathematics Intuition

The following paragraphs provide a brief and intuitive introduction to some of the mathematics at the core of how ANNs work.

Let’s start with Calculus. Calculus can be thought of, in the context of this article, as the study of how quantities vary over time; this is often described as the rate of change of a quantity. Normally real-world quantities vary over a period of time where they reach a maximum and minimum value. For example, consider the simple act of throwing a football. It’s easy to rationalize that the football will reach a maximum height, maximum speed and also attain their minimum counterparts. Calculus allows us to derive equations that govern the motion of the football and apply an algorithm to derive these minimum and maximum values. Through rigorous practice, Quarterbacks do this intuitively every time they throw a football.

The basic principle of an ANN is that the network is tuned or more accurately, trained so that a set of known outcomes are correctly predicted. For example, feeding in an image of a cat, and the network predicting it is indeed a cat (A more detailed explanation of this process is described later in the article.)

A key part of this process is understanding how to adjust the tunable parameters so that the network becomes more accurate. Minimizing error is known as Backpropagation and uses calculus to inform how the tunable parameters should be adjusted to increase the overall accuracy of the network.

Turning now to Linear Algebra. It is the study of vectors and linear functions; this possibly scary statement can be easily distilled down to 2 simple concepts. Firstly, the world is full of quantities that vary in either magnitude (size), direction or over time. A domain term for these quantities is a Vector. Vectors conveniently map these quantities into a common format that allows relationships to be formed between similar types (other vectors). This first process is known as vectorization and is the cornerstone of modelling data in a manner that is amenable for consumption by a computer. The second concept is a package that defines vectors in a manner that can be operated on, in an extremely efficient manner. ANNs make use of linear algebra to compile millions of vectors that can be manipulated by a computer during the Backpropagation step. Think of this as being able to express vast amounts of quantitative data in a form that is readily consumable by computers.

That wasn’t too bad was it?

Real World Application and Motivation

According to the US Centers for Disease Control the following are some key points about Pneumonia.

Pneumonia is an infection of the lungs that can cause mild to severe illness in people of all ages.
It is the leading cause of death due to infection in children younger than 5 years of age worldwide.
Pneumonia and influenza together are ranked as the eighth leading cause of death in the U.S.
Those at high risk for pneumonia include older adults, the very young, and people with underlying health problems.

More recent studies have indicated that pneumonia induced by the disease COVID-19 has significantly higher mortality rate.

Pneumonia

You are more likely to get pneumonia if you smoke or have underlying medical conditions, like diabetes or heart…

www.cdc.gov

General Introduction to AI

Artificial Intelligence (AI) is a term in popular culture that has been used for decades, in fact almost as long as computers have been in existence. Recently though, AI has received some negative publicity, mostly centered around privacy and social media.

With the rise of computing power that can be held in the palm of your hand, the term Machine Learning (ML being strictly a subset of AI) has become more common place and is more akin to devices or machines therefore being devoid of some of the more negative connotations associated with AI.

The ML ecosystem is wide and varied. For example, the modeling of continuous systems has received significant attention in the field of regression and the ability to predict the value of a dependent variable, y, based upon the relative value of independent variables, or features, X. An example would be the prediction of house price (y) based upon the value of features (X) such as number of bedrooms, commute time, distance to schools etc. More on features later.

The financial industry has placed significant store in the use of more binary methods, such as logistic regression, to predict whether a certain transaction is fraudulent or not. A more relatable example for most of us would be the identification of spam emails and probably more controversially for the young amongst us, the use of automated parental controls over internet content. An important distinction of logistic regression is its ability to represent probability based upon the non-linear relationships that may exist between features; this powerful aspect is a key element in Artificial Neural Network’s success.

This more traditional approach to “training” a mathematical model relies on the practitioner understanding the various aspects, or features, that influence the response of the independent variable (or label). ANNs are at their most powerful in “learning” these parameters (or features) without recourse to human influence. This is truly, arguably, fundamental to intelligence.

Artificial Neural Networks

Turning now to Artificial Neural Networks, or ANNs, the origins of which found its roots in the study of the human brain. Many articles are available online that describe the human brain neuron construct and its ability to send or receive electrical or chemical signals.

ANNs model this human arrangement and can be better understood by refence to the Figure 1 below. Here the blue circle is termed a neuron, analogous to one of the billions of neurons found in the human brain. The neuron is excited by the inputs X through a weighted system, identified by w, and is augmented by a bias term b; the key aspect of which is whether the neuron’s “energy” reaches a threshold value in order to “fire” or transmit a non-zero result forward to the next neuron. The neuron’s excitement threshold is “calculated” by the activation function, denoted f(x).

The function of the bias term together with the system of weights, is to provide a set of tuning interfaces (think knobs and dials) that allow an activation function to either fire or to stay dormant.

By extending this simple construct, and creating many neurons in the vertical direction, or creating a “layer” of neurons, then additionally replicating this structure to create more layers, a powerful representation of the network can be created. The final layer of neurons is called the output layer with the intermediate ones being hidden layers. They are deemed hidden, because they create relationships that are abstracted away from human interpretation. Recall from the previous section where features are learned rather that defined. This is also commonly called a feed forward network in many ANN publications.

This is known as the Neural Network and is shown in Figure 2 below. The problem, then, is to “tune” the network, known as training. The network is fed a predetermined set of inputs so the desired output will be predicted.

Returning to the concept of features as described above, it’s the job of the Data Scientist, when faced with a more traditional regression problem, to identify the features that are predicted to influence the outcome. With neural networks, however, the network learns these features through the tuning of the weights and various biases. The hidden layer approach lends itself well to “finding” or learning these relationships.

Many beginning practitioners in this field have grappled with trying to find an algorithm to help determine the optimum number of hidden layers and the number of neurons each hidden layer should contain. Until recently this has not been scientific and has relied on a methodology not that far removed from trial and error.

Convolutional Neural Networks

The author has written various papers regarding how the advent of the digital computer has aided a wide swath of engineering disciplines. For example, structural analysis of complex geometry subjected to non-linear loading are now commonplace. They rely on computers solving millions of equations that take advantage of a process known as vectorization.

This process also lends itself well to tuning (learning) an ANN where the process relies on Backpropagation, to recursively learn the sets of biases and weights that derive the correct output from the input. Backpropagation relies heavily on partial differential calculus and a study of its mechanism falls outside the scope of this article.

While this approach has shown to be useful in the application of computer vision, this vanilla ANN approach is far less efficient for more complex imaging problems. This is because the input to an ANN is essentially a very long single dimension vector. The size of this vector grows proportionally with the images size, shape and color depth (consider a grayscale image being a single layer width to that of a color sister that has 3 layers, one for each primary color). This dramatically affects computing time and what motivated significant research in the 90s to obtain a more efficient solution.

Researchers discovered that defining a filter-type of approach, that essentially scrolls a filter (a 3 x 3 matrix for example) across the top of an image, then moving down a row and repeating the process provided significant improvements. At each step a simple element-wise multiplication is performed in order to identify a value for that step; this is usually either the maximum value or an average thereof. This has the effect of reducing the image size based upon the filter dimensions (i.e. essentially down samples the image by a factor dependent on the filter parameters — usually 2)

If the user selects the filter carefully, and for arguments sake in this case chooses a 3x3 filter with the top row filled with 1s and everywhere else 0s, then particular features of the image can be “recognized”. In this example we have developed an “edge detector” which can be used to locate horizontal edges in an image.

This process is not new and is known as convolving an image; particular applications in image processing have ranged from edge detection to blurring and image. A hidden advantage of this process, when coupled with another operation known as Pooling, is to provide a mechanism that promotes Translational Invariance. This is best described as being able to recognize a red circle in an image no matter where that circle exists in an image.

It turns out that this property of Translational Invariance is extremely useful in the field of computer vision. For example, to humans, a left facing cat is still a cat or a rotated cat is still a cat. Translational Invariance solves the problem of having to relearn a shape, in this example a cat, just because it exists elsewhere in an image and is translated rotationally or linearly.

Chaining these convolutions and poolings together forms a Convolutional Neural Network, which then acts as the input to a traditional ANN. The ANN then has the job, through its hidden layers, of learning the features in an image and predicting the dominant object. Atypical CNN is shown below in Figure 3. The ANN is sometimes referred to as the fully connected layer to the CNN.

Convolutional Neural Network Considerations

CNNs must be voracious consumer of images in order to be accurate and to be able to generalize well. The term generalization, in this context, refers to the ability of the CNN to accurately identify the dominant object in an image that it has not encountered during training. A phenomenon that occurs frequently with CNNs is known as overfitting, which manifests itself with the network becoming an “expert” in memorizing the training set. Then, when presented with a previously unseen image, it is unable to recognize the image.

Essentially the most effective way to counteract overfitting is to supply more images (or data). This can be challenging in that many thousands of images are required to avoid overfitting and generalize well. A recognized way of short-circuiting this problem is to utilize data augmentation techniques. Here a batch of images are “altered” by various image operations and include filters such as blurring, skewing, reflecting or rotating the base image. Many leading CNN applications include this feature as standard and are highly automated, lending themselves to a typical image processing workflow. While this can be a powerful tool it should be used from a common sense perspective, i.e. in this article’s case to “augment” an X-Ray of a human requires thought as to the type of augmentation adopted.

The author chose, initially, not to include data augmentation for this reason and as the results in part 2 will show, overfitting did rear its unwelcome head and prompted some other considerations for alternate solutions. The author has written this article as a beginner’s introduction to the deep and complex subject of image recognition; future articles will build upon this one and present additional work including avoiding some pitfalls such as the one described here.

In addition, and as alluded to above, the selection of the number of hidden layers for the fully connected layers is now compounded with the added choices available for the filter sizes and/or pooling layers. The output of the second and subsequent convolutions are often referred to as feature maps. This creates the challenge of being able to choose the number of filters that generate feature maps efficiently so they initially recognizing diverse primitive shapes and then later learn more complex objects.

The author chose, for this first part, a brute force approach and a trial-and-error approach; albeit it from a naïve traditional, or hardcoded, approach. Later parts to this article series will explore more approaches such as hyperparameter optimization, image augmentation and transfer learning.

Pneumonia CNN Structure

The author selected the well-known software package TensorFlow 2 from Google as the platform for building and training a CNN for the prediction of Pneumonia from a 100 x 100 pixel grayscale image. This grayscale image was converted to a color (RGB) image as part of the data preparation process.

The author has not paid much attention, in this article, to the sheer computing horsepower required to process images through convolutions and pooling operations in a reasonable length of time. High-end Machine Learning platforms utilize Graphics Processing Units (GPUs) and can cost many thousands of dollars. The author has successfully run vanilla ANNs quite successfully using a 2018 MacBook Pro, however moving to CNN analysis requires a much higher specification system.

Google, and other organizations, offer a programming environment based upon the Jupyter Notebook (essentially a powerful text editor with an integrated computer language compiler) configuration that has free access to GPU runtimes. This environment is fully online and is known as Google CoLab; the free version provides access to very capable hardware and for a relatively low fee provides more dedicated access. An added benefit of this construct being that Jupyter Notebooks support Python, the de facto standard language for machine learning, natively.

The author, in this initial brute-force approach, took advantage of TensorFlow’s tight-knit operability with Python to write a function to automatically generate models that varied the convolutional layers, the number of fully connected layers (referred to with TensorFlow as Dense) and the number of neurons within each hidden dense layer. This approach yielded 9 different models that were run in a serial fashion. The results were written to a log file and the TensorFlow utility TensorBoard were used to graphically visualize the results.

The Data

A well-known website, at least in AI circles, called Kaggle provides a rich source of data sets for the hobbyist or even seasoned professionals. The particular dataset used in this article was obtained through the Kaggle site. This link is the formal citation for the dataset.

“The dataset is organized into 3 folders (train, test, val) and contains subfolders for each image category (Pneumonia/Normal). There are 5,863 X-Ray images (JPEG) and 2 categories (Pneumonia/Normal).

Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. All chest X-ray imaging was performed as part of patients’ routine clinical care.

For the analysis of chest x-ray images, all chest radiographs were initially screened for quality control by removing all low quality or unreadable scans. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. In order to account for any grading errors, the evaluation set was also checked by a third expert.”

A typical x-ray image for a Normal (no Pneumonia present), Bacterial and Viral Pneumonia are presented in Figure 5.

Results — Brute Force Approach

Typically, as the learning process continues within a CNN, a number of runs or cycles are made where the data (in this case the images) is passed through the network in batches (known as an iteration) until all images are exhausted. At this point the analysis has completed what is known as an Epoch. Also, at this point in the process, the associated error is calculated, and the weights and biases are tuned (“intelligently” using differential calculus to identify the likely better values for the tunable parameters) using the Backpropagation process discussed earlier. Images are then again passed through the updated network to complete another epoch. Example output from Tensorflow after an Epoch is shown in Figure 6.

The process is completed when the accuracy is met and the model has either converged to a value or is continuing to diverge. The model run length is a controllable hyperparameter and is usually constrained by assigning the maximum number of Epochs to complete.

The following charts show the results after 20 epochs and show convergence for both the accuracy and the loss with asymptotic behavior. In the part 2 of this series of articles we will explore what these results tell us, how we can improve them, how we evaluate and finally perform predictions against a test dataset that the network has not seen. We will then explore practical application of the network to predict the presence of Pneumonia or not.

Before reaching out to Part 2, take a few minutes to watch the video below, which provides an excellent insight to what a CNN is actually “seeing”.

Jason Yosinski. (2015, July 7). Deep Visualization Toolbox [Video]. YouTube. https://www.youtube.com/watch?v=AgkfIQ4IGaM