Convolutional Neural Networks in Biomedical Image Processing: A Review
Abstract
Purpose of Review
The goal of this review is to investigate the current developments of Convolutional Neural Networks (CNNs) within biomedical imaging applications. The review will begin with an overview of what CNNs are and how they are different from other machine learning (ML) techniques.
Recent Findings
The most cutting-edge research involves implementing the attention mechanism in combination with other networks, such as UNet (creating so-called “ensemble models”). These aim to take the best versions of two or more neural networks, leveraging the strengths of each in creating one better model. The two leading ensemble networks in this space are UNETR and SwinUNETR. These techniques apply CNNs in combination with attention mechanisms to yield results that are currently state-of-the art.
Summary
From this review, we can see that CNNs have demonstrated effectiveness in aiding radiologists and clinicians in a wide range of biomedical imaging tasks. Furthermore, we can see that even with the advent of new ML architectures, most feature a CNN backbone, highlighting the effectiveness of CNNs, both currently and likely for the future.
Introduction
Imaging forms a crucial part of our medical system. Both computed tomography (CT) and magnetic resonance (MR) imaging modalities allow clinicians to obtain a vast amount of data, such as imaging of tumors throughout the body and within various organs. While the sheer number of images are helpful, they also pose a challenge for the industry. On the one hand, the vast amounts of data mean that radiologists quickly become fatigued and overwhelmed. Up to 30% of radiological errors are attributed to radiologist fatigue1.
On the other hand, such large quantities of image data have been able to provide researchers with large numbers of testing datasets. Many imaging datasets, such as The Cancer Imaging Archive (TCIA)2 and the Medical Segmentation Decathlon (MSD)3. The result of the development of these large datasets has also fostered large competitions, leading to increased attention from researchers and the broader medical imaging community as a whole.
Convolutioal Neural Networks
Motivation
Traditional machine learning (ML) techniques include support vector machines (SVMs), decision tree learning, and random forests4–6. The main disadvantage of these techniques is that they require the necessary features to be extracted and placed into the network before training. A challenge arises when ample data is available, but the extraction of such features is either unknown or difficult. An example of this is the usage of libraries such as PyRadiomics7. This approach involves passing images through many different feature extractors, such as first-order features, shape features, and the Gray Level Co-occurrence Matrix. These features are then passed into a network, such as a multi-layered perceptron8.
However, this process has a significant weakness: it assumes that the filters the images are passed through capture learnable features (i.e. portions of the image that help the model help make predictions). Additionally, each image would require significant pre-processing, which can add computational time that may not be used for learning. The filters that may be used also run the risk of not being relevant for the final network at all, leading to wasted computational time.
This is where convolutional neural networks (CNNs) come in. They aim to solve these weaknesses by taking in the raw image only and creating filters dynamically. This leads to less image pre-processing time, as only the image is passed to the network instead of image features. Image features are also extracted during the learning process, as opposed to extracting many image features a priori and hoping that some of them capture image features.
CNN Architecture
In CNNs, the only input to the network is an image. Typically, these images are split up into tensors with dimension \(n \times m \times d\), with each picture being \(n \times m\) pixels and a color channel of \(d\), which is one for a grayscale image and three for typical red, green, blue (RGB) image, as can be seen in the popular VGG-x variety of CNNs10.
The Convolutional Layer
As seen in the image above, each image has a convolutional layer, whereby a “kernel” or “filter” (a learnable parameter) is passed over each image during the forward pass step of model training. During this convolution step, each filter is passed over the height and width of each image. Since both the images and kernels are passed in as 2D arrays, a dot product is computed between the image and the kernel during this convolution step. Once a single filter has been passed over a single image, this output is fed into an activation layer.
While the preceding image only shows a single image, it is important to note that there are many convolutional layers, and therefore, many different filters, that are passed over each image during the forward pass. This in turn is directly solving one of the key issues brought up during the previous discussion on feature engineering. Namely, that of the filters during pre-processing being representative of the dataset. Since the filters are learned throughout the training process, not only does no pre-processing need to be computed, the specific filters do not need to be known a priori. Instead, one can specify the number of convolutional layers in a network, which would roughly correspond to the number of filters the network would learn10.
Activation Functions
To capture nonlinearities in data, various activation functions are used. Several are given here for completeness. One of the steps during model training is hyperparameter optimization, whereby various models would be trained simultaneously, but with slight changes. An example would be an identical model, but each with different activation functions, such as those listed below.
Sigmoid
The sigmoid function is given as Equation 1:
\[ f(x) = \frac{1}{ 1 + e^{-x} } \tag{1}\] which takes a real number \(x\) and compresses it between 0 and 1.
Hyperbolic Tangent
The hyperbolic tangent (tanh) function is given as Equation 2:
\[ f(x) = \frac{1 - e^{-2x}}{ 1 + e^{-2x} } \tag{2}\] which takes any real number \(x\) and compresses it to the range between -1 and 1.
Rectified Linear Unit (ReLU)
\[ f(x) = \max \left( 0, x \right) \tag{3}\]
which takes any real number \(x\) and sets the output to be \(x\) if \(x\) is negative, otherwise sets it to \(x\). The ReLU is the most common activation function in CNNs due to its fast computation time11.
Softmax
This activation function is primarily used for multi-class classification. It takes an output and assigns \(y\) a probability that it applies to each class \(i\):
\[ P (y = i) = \frac{ e^{z_i} }{ \sum_{j=1}^{K} e^{z_j} } \tag{4}\] where \(z_i\) is the raw neural network output (i.e. output from the previous layer), and \(K\) is the total number of classes.
Fully Connected Layer
The fully-connected layer is similar to that in the multilayer perceptron (MLP)8. That is, each input node is connected to each output node multiplied by the weights and biases of the input node. This fully-connected layer is then connected to either 1.) a softmax activation function if performing multi-class classification, or 2.) a linear activation function if performing regression12,13.
Data Preprocessing
While CNNs do not require the kind of feature engineering commonly found in MLPs, some amount of image preprocessing is typically done. Such pre-processing steps include random flips, rotations, scaling, and cropping14. When these so-called data augmentation pipelines are applied probabilistically, one is also able to effectively make copies of a single dataset with very little additional effort.
Award-winning CNN Architectures
Typically when a CNN–or any neural network–is applied to a new space, existing models are used and then given data for a specific use case. In this section, a brief overview of various award-winning CNN architectures are examined.
Architecture | Year | Architecture | Year |
---|---|---|---|
LeNet-515 | 1998 | ResNet16 | 2016 |
AlexNet citation14 | 2012 | Xception citation17 | 2017 |
VGGNet citation18 | 2014 | DenseNet citation19 | 2017 |
GoogLeNet20 | 2014 | U-Net21 | 2015 |
It is important to note that this list is by no means exhaustive, and more network architectures emerge constantly. What follows are both improvements of the architectures listed in the table above as well as entirely unique architectures.
CNN Applications in Medical Imaging–Classification
COVID-19
Most of the studies in this section aim to solve a major issue with testing that was experienced during the COVID-19 pandemic. Namely, the reverse transcriptase-polymerase (RT-PCR) test was the most ordered test22. However, this test is expensive, slow, and was in extremely high demand during the pandemic. As such, many of the following CNN papers attempted to circumvent this issue by using imaging, typically chest X-Rays or CT scans of the chest, in order to diagnose patients much faster than the RT-PCR test is able to achieve.
Twice Transfer Learning
The first study23 under examination proposed a fine-tuned CNN pretrained on the ImageNet dataset using the DenseNet architecture19. After pretraining, the model was fine-tuned on the NIH ChestX-ray14 dataset24, with COVID-19-specific data coming from the authors’ own database that is not publicly available. As is typical in machine learning, the authors explored many variations of the network, most specifically different fine-tuning schemes (e.g. training with only ImageNet, ImageNet and the NIH ChestX-ray14 database, etc.). From this empirical experimentation, the authors showed that transfer learning on the NIH ChestX-ray14 database yielded the best results. While the authors claim that they were able to achieve nearly 100% on their training dataset, it does show good convergence. The authors do note that more data should be used to better analyze the resulting model’s generalizability.
Transfer Learning with AlexNet
Here we examine one of the first studies on using X-Ray and CT imaging to help diagnose COVID-1925. This paper used the AlexNet architecture14 with transfer learning to allow the network to learn how to differentiate between patient X-Rays and CT scans that both do and do not have clinically diagnosed COVID-19. Unlike the previous study, the authors here also release their training dataset as part of their work. In addition, the authors also propose a custom CNN that is simpler than AlexNet, ideally to allow for increased throughput. The accuracy of both networks were good ($>$90%), however the dataset featured a small number of X-ray images (100), which could lead to the model overfitting and memorizing training data.
Heart Disease
CNN and LSTM
Two or more networks that are combined to make one larger model are called “ensemble networks”26. Here, the authors used a 24-layer CNN in combination with a Bidirectional Long Short-Term Memory (BiLSTM) network to extract features from electrocardiogram (ECG) data. The network was designed to detect atrial fibrillation, which has been shown to increase the risk of heart failure27. The authors used the 2017 PhysioNet/CINC dataset28 and were able to achieve an accuracy of 89.3%. It is important to note that the PhysioNet/CINC dataset is synthetic and the authors employed a custom loss function. While synthetic datasets can be realistic, synthetic datasets could lead to model results that are not representative of reality. Additionally, custom loss functions are generally not advised due to the resulting difficulty in comparing performance between models (e.g. one with custom loss function and one with a standard loss function).
ResNets and S12L-ECG
While the previous section only dealt with single-lead ECGs, the authors in the this study29 utilized the short-duration, standard, 12-lead ECG (S12L-ECG), which is the typical ECG found in clinical settings. The authors also utilized a dataset with over 2 million labeled S12L-ECG exams. The fundamental network is a variation of ResNet described in16, with the model capable of classifying six types of ECG abnormalities. The authors were able to achieve and F1 score (a measure of accuracy for multi-class classification tasks) of above 80%. While this result is significant, and the authors did utilize a much larger (and realistic) dataset than the authors of28, the authors used their validation dataset exclusively for tuning the network. Typically, it is best practice to utilize a different portion of the dataset for training, tuning, and testing to ensure that the model is not over-fitting and the performance metrics are accurate30.
CNN Applications in Medical Imaging–Segmentation
Broadly, segmentation in medical imaging is defined as being able to isolate a region of interest (ROI) from a patient scan or image. Unlike classification, segmentation focuses on where something is in addition to whether it is present31. An example is that from the breast tumor papers that were examined earlier. The papers presented in the previous section can only tell you whether or not an image of a breast contains a tumor, not where that tumor may be. Segmentation, on the other hand, goes farther. Not only does it tell you whether there is a tumor, it outlines (or segments) the tumor from the rest of the image. This is extremely helpful, as combinations of different segmentation planes from the same patient (e.g. axial, sagittal, and coronal scans from CT or MRI) allow for the region of interest (like a breast tumor) to be visualized as a 3D model, which can then be used in other analysis techniques, such as finite element analysis or computational fluid dynamics.
Note: Thus far most of the studies considered have utilized architectures that were not specifically designed for medical image segmentation. An example of this can be found in the previous section, where the utilized a ResNet architecture29. While this approach is valid, many of the following architectures utilize U-Net as a backbone, which is a CNN that was designed from the ground up to be used for biomedical image segmentation21. Many of these networks also utilize transformers, which were first described in a landmark paper32 and are the basis for the significant public interest in ML. While the description of this network is outside the scope of this review, the interested reader is directed to the corresponding breakthrough paper, as well as the first vision transformer (ViT) paper33.
UNETR
The UNETR architecture34is one of the first to combine a CNN-based network (U-Net) with a ViT. One of the weaknesses that the authors found in architectures such as U-Net is the inability to learn long-range spatial structures for 3D medical image segmentation. The structure of the ViT33 leads to it naturally being context aware. Therefore, the authors elected to use image patch embeddings (similar to tokens in current large language models) and merge them with a version of U-Net-based decoder to produce the segmentation output. This allows for the U-Net portion to capture local image dependencies, while the transformer portion of the network is able to capture global trends. The authors showed significant progress on multiple datasets for various organ segmentations. While the jump in accuracy is important, transformers do require more compute than traditional CNNs, as highlighted by the significant number of computations in the transformer network32.
Swin UNETR
Shifted window (Swin) transformers were proposed as a hierarchical ViT that allows for more efficient computations than traditional ViTs35,36. Here, the authors proposed utilizing a Swin ViT to replace the conventional ViT37. The main advantage of using a Swin ViT is that it was specifically designed to handle 3D MRI images. This in turn allows the network as a whole to segment brain tumors more effectively. The authors highlighted the various morphological and textural inhomogeneities of brain tumors, necessitating the utilization of Swin ViTs. The “shifted window” portion of the Swin ViT allows the model to capture local details with small windows (sections) of an image and global dependencies via larger windows. These are then passed to the U-Net portion of the network, which helps preserve small details and construct the segmentation mask. Additionally, Swin UNETR is designed to handle 3D MRI images, meaning that 3D transformers and 3D convolutions are utilized throughout the network. The authors demonstrated that the Swin UNETR model was able to outperform many other models on the BRATS 2021 dataset38. While Swin UNETR’s performance is exceptional, this is a specialized network designed to perform 3D brain tumor segmentation from 3D MRI images. A potential limitation to this network is its inability to generalize to other organs or imaging modalities. In this case, a network such as UNETR34 would likely be better suited to this task.
MedSAM
The so-called Medical Segment Anything Model (MedSAM)39 takes inspiration from Meta AI’s “Segment Anything Model” (SAM)40, but with application specific to medical imaging. Inspiration from SAM also includes prompt-based segmentation, where the user can provide an input to the model besides an image, allowing for bounding boxes, points, or text throughout the segmentation process. This user-in-the-loop interaction is in stark contrast to nearly all of the studies thus far, highlighting the flexibility of the model in handling various medical tasks. MedSAM was trained on a large (and crucially, image-modality-diverse) dataset of over 1 million image-mask pairs, including modalities such as CT, MRI, and X-rays. Dataset annotations also include anatomical and pathological features in addition to the raw imaging. The authors did point out that its performance exceeds many existing state-of-the-art segmentation foundation models, but also outperforms many specialist models with over 50 internal and external validation tasks. While this is impressive, the authors also point out MedSAM’s imaging modality imbalance, as a majority of the dataset consists of MRI, CT, and endoscopy images. All things considered, the overall performance of MedSAM showcases that it has the capability across a wide breadth and depth of segmentation tasks, and does give confidence that further fine-tuning may lead to even better results.
The Data and Reproducibility Problems
While all of these studied highlighted the application of CNNs (and other networks, such as ViTs in later sections), it is also crucial to highlight the importance of data in this process, as well as some of the potential issues in this current wave of, bluntly, hype in the ML space. There are two main issues that are worth noting in this section: datasets and reproducibility.
Dataset Issues
One of the primary issues faced in the ML space is that of the datasets. Where the datasets come from, how they are curated, and the potential bases that they contain41. There have been multiple examples of accidental bias in large-level ML applications. A famous example is Google Photos classifying people with dark skin as “gorillas”42. While this is a relatively harmless example, it is not difficult to extend this to harmful areas, such as mortgage companies turning to ML algorithms to approve or deny borrowers based on their credit score, predictive policing systems inadvertently perpetuating racial bias, or annotation disparities in medical imaging datasets. These are not toy examples; both situations have occurred already43–46. As a result, knowing details of datasets that are widely used is extremely important.
Fortunately, there are some techniques in place to help combat such biases. Multiple studies have examined potential standards for categorizing and managing biases in ML41,47. To The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology was used to identify types of biases (specifically biases in data, algorithms, and user interaction) in the literature from 2017-202241. Furthermore, the authors pointed out other tools, such as Aequitas and AI Fairness 360 that allow for bias mitigation.
On the government side, authors from the National Institute of Standards and Technology (NIST)47 proposed a socio-technical framework to mitigate bias in ML systems. This framework combines both societal and technical aspects, which is crucial. Bias was categorized into several categories: systemic bias, statistical bias, and human bias. Guidance on reducing these biases and their origins is also given, providing tools for both technical and non-technical individuals. Furthermore, as many technical areas only focus on technical solutions, systems that touch both social and technological aspects (which ML systems do) should have a corresponding socio-technical solution, which these authors provide.
Reproducibility
One other major issue with the current ML landscape is that of reproducibility. An example of this was in response to a study from Google Health48 by the Haibe-Kains and colleagues49. The study’s lead author, Benjamin Haibe-Kains, noted that the Google Health team gave so little information about their code and its implementation that it seemed more like an advertisement than a paper. Furthermore, the Google Health study was not an isolated incident. Rather, it is part of an observed trend in the industry that many experts agree is worrying50. This, in combination with the fact that many large models are considered black boxes, leads to a difficulty in reproducibility and trust, especially when many datasets are proprietary50.
There are some potential solutions on the horizon. Yoshua Bengio, widely considered a leader in the field, organized a reproducibility challenge, where participants try to replicate studies from the top conferences in the ML field (e.g. NeurIPS, ACL, ICML, etc.)50,51. This is also starting to make an impact. As an example, the Swin UNETR paper37 has a page on Papers With Code, an online repository that many researchers use to host competitions and datasets52.
Future Outlooks
The cutting edge of ML in the medical imaging space is currently ViTs, typically in combination with CNNs37. The advent of the transformer architecture has drastically transformed the field as a whole, and as such, it is being applied as frequently as possible. And for good reason, the Swin UNETR network is currently the top-ranked multi-organ CT segmentation algorithm of this writing53.
The field is also at a crossroads in regards to large, foundation models vs. smaller, more specialized ones. There are popular large language models, such as OpenAI’s ChatGPT and Anthropic’s Claude (among others), that are able to perform a stunning number of tasks. We also saw the emergence of the segmentation equivalent of such a model with MedSAM39. On the other hand, we have models such as Swin UNETR37 that are more specialized. There is still no consensus as to what the better direction is, and is an active area of research54.
Conclusions
Overall, CNNs form a crucial backbone in biomedical image processing. Both from their inception to today’s state-of-the-art models, they continue to be used, and due to their demonstrated effectiveness, it does not appear that their use will be going down, from basic classification to more advanced 3D image segmentation. Even with the breakthrough of the transformer architecture, we see CNNs still being used.
The ability of CNNs to learn hierarchical features from raw pixel data removes the need for feature engineering, as the network itself learns important features from its training data. This has led to their widespread adoption in various medical areas, such as imaging, pathology, radiology, and (more recently) radiomics. The versatility of CNNs have also led to many specialized architectures tailored to specific tasks, as was seen in previous sections. Examples include networks tuned to process CT, MRI, or X-ray images. These allow for more personalized medicine, to say nothing of the drastic increase in disease diagnosis.
In summary, the use of CNNs in biomedical image processing has only expanded over time, and it most likely will continue to expand. As we have seen, the best uses of CNNs have been in ensemble models combined with other network architectures such as vision transformers. While CNNs are only going to be used more frequently in this space, their usage in ensemble models specifically will be a large driver of their adoption, ultimately to help improve the lives of all.
References
Citation
@online{gregory2024,
author = {Gregory, Josh},
title = {Convolutional {Neural} {Networks} in {Biomedical} {Image}
{Processing:} {A} {Review}},
date = {2024-12-24},
url = {https://joshgregory42.github.io/posts/2024-12-24-ml-review/},
langid = {en}
}