Perspectives on Artificial Intelligence

Artificial intelligence

Thesis

Thoughts

Writings

Reflections on how we can apply AI responsibly

Author

Josh Gregory

Published

September 17, 2025

My Master’s thesis focused heavily on applying state-of-the art artificial intelligence models to problems faced in the biomedical space, specifically segmentation of vasculature (that is, creating 3D models of blood vessels from CT or MRI scans) and predicting how blood flows through a clot (similar to what would be encountered in a stroke patient). Instead of creating new AI models, I took existing ones and applied them as a tool, a framework I have seen many people also apply. My thesis spanned the entire AI workflow: I did everything from find data to pick which models to use, figured out how to use the same NVIDIA hardware OpenAI uses to train ChatGPT, to examining the explainability of my models. What follows are some reflections on what I learned in the trenches. My thesis was the first applied-AI project in my lab. Because of that, there was no hand-holding in this project; there was no assignment sheet, no data, no models to pick from. This is written from the perspective of someone who had to learn all of that from scratch, most of the time the hard way.

Data is The Most Important Part

I spent a vast amount of time during my thesis finding data. Especially during my segmentation project, I needed to find patient MRI data that was also labeled with the segmented areas. That is, the areas of the MRI scan that had a box around the blood vessels of interest. Much of these data were either (understandably) unavailable due to patient privacy concerns, only for internal hospital use, or of extremely poor quality. This necessitated using a technique called data augmentation, where I took the clean data I did have, made a copy of it, then applied random crops, rotations, noise injection, etc. While techniques like this work to a degree, deep neural networks currently need a large amount of data in order to generalize well. I think this is something that many people overlook. They believe that a few hundred data points will suffice. This is not the case. Ideally, a good starting point is several thousand examples, ideally much more (e.g., on the order of tens of thousands or hundreds of thousands at minimum). While there are techniques that do not rely on deep learning that require less data (e.g., Random Forest, multivariate linear regression, \(k\) nearest neighbors, among others), the flexibility that deep neural networks provide come at the requirement of data scale.

The Compute and Energy Required

I would be remiss if I didn’t mention the popular techniques of fine-tuning and transfer learning. These are techniques that leverage existing models, with the advantage being that they need less data to converge. Said another way, I would need significant amounts of both data and compute to train a large network from scratch. I might not have either of these. Instead, I could take a pre-trained model from the likes of OpenAI or Google DeepMind. These models have already seen vast amounts of data. I can then take those models and use them as a starting point for my domain-specific task, needing (typically) less data and compute than if I were to train these models from scratch.

There is also the aspect of model architecture design itself. Many of the deep neural networks I used were designed by researchers at Microsoft and Google. I could have tried creating my own networks, but odds are multiple teams at Google and Microsoft spending years designing networks have created something I can’t compete with. My time was better spent finding data and using their models than trying to create my own model architecture.

The chart below also highlights the incredible amounts of compute required to train some of these models.

This compute (and therefore energy) also translates into incredible amounts of capital needed to train these models:

This further highlights how beneficial it can be to use a pre-trained model. A company like Google can shoulder the extreme capital, compute, and energy investments to train these models initially, and then we can take those models and adapt them for our own domain-specific applications.

Do Not Pick One Model

This was a mistake that I made during the first pass of my thesis. I read a bunch of academic papers, compared benchmakrs, and eventually settled on a model that performed the best on the benchmarks. How could it not work great for me? The benchmarks were similar to my work and the network used all of the fanciest new tools (it was a combination of a vision transformer and a convolutional neural network). I couldn’t get it to work well on our dataset, no matter how hard I tried.

During the second pass of my thesis, I had a different approach. I started with several different networks and put them each through the training, tuning, and evaluation process, only making a recommendation at the very end. When I did this, something interesting happened: the best model for my case performed worse on some benchmarks than one of the other models I was training. In more technical language, the EfficientNet family of models was created to outperform the ResNet families while being much smaller¹. This was shown to work on many different benchmarks. However, for my specific use case, the ResNet architectures outperformed EfficientNet. Had I not explored many models, this small, but unintuitive discovery would not have occurred. And it’s an important one. Had I trained exclusively EfficientNets, I would have left significant–but unrealized–performance gains on the table.

Design for Scale in Both Directions

As we saw in the above charts, current models use incredible amounts of compute and cost millions of dollars to train, with the training process done in parallel across multiple Graphics Processing Units (GPUs) and other hardware accelerators such as Google’s Tensor Processing Units (TPUs). By itself, PyTorch (TensorFlow to a lesser extent) does not easily parallelize across multiple GPUs, instead requiring cutom code to be written to allow for the training process to happen in parallel. The same can be said for data storage formats. Initially, I stored many tabular datasets in comma separated values (CSV) format, which can become a bottleneck due to the high number of random operations the network has to perform during training (e.g., random open, access file, load into memory, close file). Other datasets, such as images, are usually stored in a giant folder, which can lead to even more significant delays in training.

The point here is to think about scale before one starts the training process, typically during the data exploration phase. This is when decisions like switching from CSV format to something like Apache Parquet is easy, or something like the WebDataset format. On the code side, I ended up using PyTorch Lightning to ensure that my code was hardware agnostic and able to train on any hardware accelerators that were available, parallelizing automatically (e.g., my code is able to detect the number of GPUs available and dynamically allocate resources during training). It’s also important to design for small-scale applications when it comes to running inference on the models (that is, using the model to make predictions). For my thesis, I wanted to make sure that every one of the models could run locally on a laptop, something every hospital would have access to. This also preserves patient privacy, something that is both close to my heart and required by law with HIPAA.

Designing for Ease

During my segmentation work, it became necessary for my models to read in imaging formats common in the medical imaging space. CT and MRI machines typically store images in formats like DICOM or NIfTI. This is different than typical images we find in everyday life, like JPEG or PNG. This led me to using the Medical Open Network for Artificial Intelligence (MONAI) library, which is built from the ground up to be compatible with both DICOM and NIfTI formats. This also led to an added benefit: my networks could read patient imaging directly from the machine, no conversion to JPEG necessary. This dramatically lowers the barrier to entry for a medical professional (not an “AI person”) to use my networks, which is something that is really important. These networks are fundamentally designed to help people. Because of that, I felt a responsibility to ensure the networks are as easy as possible for others not in the AI space to use. The onus is on me to meet the users where they are, not to have the user become a software developer to use my tool.

Additionally, I found it important to add explainability to my models. This is something that is becoming more recognized in the space, but as of now is (in my opinion) not as important as it should be. Dario Amodei, the CEO of Anthropic, wrote about the need for interpretability in large language models, and I believe there is a similar need for interpretability work in all AI models that touch healthcare. My models predicted a single value (permeability, which has a specific mathematical definition in this context). But having a model that takes in an image and just produces a value is not enough. I wanted a radiologist to be able to not only take that number, but be able to question it and have some kind of dialogue with the model. The last thing I want is for a radiologist to disregard their intuition, built up over years, and begin to trust something from a model. The healthcare professional and AI model have to work together and supplement each other.

To that end, I used the Captum library–and the GradCAM algorithm specifically–to create heatmaps of which regions my model focuses on to make its decisions. This allows for a model that is more “auditable”, that is, one that an end-user can examine further than just wondering why it spit out a specific number.

Keeping an Open Dialogue

Training of AI models has brought up concerns from environmental groups regarding the environment and greenhouse gas emissions. And for good reason, training these models produces a lot of greenhouse gases². But as someone who is creating these technologies, it is vital to keep an open dialogue with people outside this space as to the development and usage of such models. I believe that training a model like Google DeepMind’s AlphaFold is well worth the greenhouse gas emissions due to its transformative potential on human health. But I’m also biased and shouldn’t be the only voice in the room. I think it is incredibly important that those of us who create models, both large and small, have an open dialogue with people of different backgrounds and priorities. We should be talking with artists, patients, English teachers, parents, and everyone else who has a stake in this technology as much as we do. Because an emotional understanding of the people who use these technologies is just as important as the technology itself.

References

Tan, M. & Le, Q. V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. (2020).

Luccioni, A. S., Viguier, S. & Ligozat, A.-L. Estimating the carbon footprint of BLOOM, a 176B parameter language model. J. Mach. Learn. Res. 24, (2023).

Citation

BibTeX citation:

@online{gregory2025,
  author = {Gregory, Josh},
  title = {Perspectives on {Artificial} {Intelligence}},
  date = {2025-09-17},
  url = {https://joshgregory42.github.io/posts/2025-09-17-emotions/},
  langid = {en}
}

For attribution, please cite this work as:

Gregory, J. Perspectives on Artificial Intelligence. https://joshgregory42.github.io/posts/2025-09-17-emotions/ (2025).