Skip to main content

In this week’s blog, my focus is on bias exhibited by artificial intelligence (AI) systems deployed in health care. Bias in AI solutions has been one of the main targets for AI regulations. We will explore the legal framework envisioned to address bias and research efforts to characterize bias in AI systems.

Regulations addressing bias

Over the past few years, there has been increasing scrutiny surrounding AI-based solutions. The incredible rate at which AI has progressed has spurred great optimism and a deep-seated anxiety about negative impacts. The European Union took the lead in passing some landmark legislation, which is now going into effect. The U.S. has followed suit with a range of efforts, though actual legislation has yet to pass. Let’s examine what these efforts say about dealing with bias in AI solutions and in health care.

EU AI Act

The EU AI Act has much to say about handling bias. First and foremost, models should be transparent about how they are trained, validated, deployed and monitored. Decisions reached must be explainable. AI governance should ensure that the data used to train models represent of the population the model intends to serve. Bias can creep in from the data (data bias) and from the algorithm used (algorithmic bias). Biased outcomes may lead to discriminatory behavior, which is outlawed. These strictures particularly apply to “high risk AI systems” that can impact real people. Health tech applications certainly are that!

Blueprint for an AI Bill of Rights

In the fall of 2022, the Biden administration promulgated the AI Bill of Rights. One of the main pillars of this declaration was “Algorithmic Discrimination Protections.” The potential for AI solutions to enable discrimination against a disadvantaged populace was recognized as a real threat. But all the other rights in this document addressed themes like the EU AI act – promoting transparency, data governance, building safe and effective systems and building systems that support humans.

Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence

Last fall, following up on the blueprint, the executive order (EO) was released by the White House on the development of safe, secure and trustworthy AI systems. Again, the notion of safety and trust is targeted to avoid tech use that leads to discrimination. The EO established aggressive timelines for various actions by government agencies. This fact sheet from the U.S. Department of Health and Human Services (HHS) summarizes the impact of the EO on the various agencies such as the National Institutes of Health (NIH), Centers for Medicare & Medicaid Services (CMS), U.S. Food and Drug Administration (FDA), etc. HHS has been tasked with identifying how to avoid discrimination in AI systems and establish a framework for identifying and capturing clinical errors in deployed AI systems.

2020-2024 Progress Report: Advancing Trustworthy Artificial Intelligence Research and Development

This progress report chronicles the progress to date on the EO mentioned above. Indeed, a lot of progress has been made. HHS published its guidelines for AI that identify and mitigate biases in health care. The actual article published in the Journal of the American Medical Association (JAMA) lays out the guidelines:

  • Promote health care equity in all phases of development
  • Transparent and explainable
  • Engage all stakeholders – patients, providers and community; explicitly identify fairness issues and trade-offs 
  • Establish accountability for equity in outcomes

Guidelines are great. Adherence?

NIH has and continues to invest in programs that investigate bias in algorithms. One of the more recent challenges they have conducted is called “Bias detection tools in healthcare challenge.”

The National Institute of Standards and Technology (NIST) is developing guidelines to manage risk, particularly its recent publication “AI Risk Management Framework (RMF): Generative AI Profile.” The generative AI profile is based on the previously (2023) published RMF. The word “bias” occurs 140 times in this draft document! Broadly, they talk about systemic bias, statistical bias (data-based) and human-cognitive bias.

All these efforts can be labelled as works in progress. A lot of agencies (and people) are clearly invested
in addressing concerns such as bias and equity in health care applications.

Research studies investigating bias

An incredible amount of research has gone into studying bias in AI models. This recent survey “The pursuit of fairness in AI models: A survey,” notes that about 2,000 papers were published on this topic in 2023! In this blog, I am just focusing on one very recent publication.

The paper published this summer in Nature, “The limits of fair medical imaging AI in real-world generalization,” is entirely focused on bias and fairness of AI systems. MIT computer science researchers, in cooperation with Emory University School of Medicine, did a comprehensive study exploring the nature and existence of bias in AI models. Their study involved interpreting radiology images using AI models that trained themselves with datasets available in the public domain. Not just one or two models, but a grand total of 3,456 models! Let’s dive in to see what they did and what conclusions they came up with.

Their primary focus was chest X-ray (CXR) analysis. Secondarily, they also investigated models that can interpret dermatology and ophthalmology scans. In the CXR domain, they used six different public datasets. Gender and race parameters were available for a few datasets, and all datasets had age information. They trained the models to predict four binary classifications: No finding (normal), presence of “effusion,” finding of “pneumothorax,” and diagnosis of “cardiomegaly.” The goal of this study is to determine the impact of demographic factors such as race, sex, age and the intersection of gender and race in the model’s interpretation. If a CXR is interpreted as normal but the CXR is abnormal, this is a false positive for the “No finding.” These false positives can lead to delays in the treatment of the abnormal condition. Similarly, for the other three conditions, if the model fails to recognize the condition, it is a false negative for that condition (e.g., pneumothorax). That can result in the patient’s condition not being treated. The popular approach to train neural models to interpret X-ray images is to use deep convolution neural networks (CNNs). They tried six different variations in the training process (for instance, not incorporating demographic information in some variations) and a dozen hyperparameter settings:

4 diagnostic classification tasks X 4 demographic parameters X 6 algorithmic variations X 12 hyperparameters = 3,456 models

That is a lot of models to train!

The researchers were exploring how bias creeps into neural models. Do the models learn to use demographic markers as a shortcut to prediction? How do you assess the fairness of a model performance? It turns out the models they trained can predict demographic factors from the image. To answer the question of whether that, in turn, leads to biased prediction, they computed the “fairness gap.” The fairness gap is based on computing the false positive for the “no finding” category and the false negative for the other three abnormality prediction tasks. If there is a gap in the performance centric on specific demographic groups, then the model has become biased. They observed in their models that ones that are better able to predict the actual demographics like age, gender and sex, also have a larger fairness gap. Essentially, it is an indication the model is potentially using demographic data as shortcuts to predicting patient state ignoring the image content.

To reduce the fairness gap, they tried a variety of de-biasing strategies and managed to train models that performed well with reduced fairness gap when tested with samples similar to how it was trained (in domain (ID) samples). However, when these models were tested with real world sample images (out of domain (OOD) samples), the fairness gap didn’t transfer. The distribution shift (going from ID to OOD) affects fairness in unpredictable ways. Selecting models that have been tuned to reduce fairness gap with ID data does not perform well with OOD data.

So, what should be done? The researchers found that instead of selecting models that have a minimum fairness gap, picking models that are worst at predicting demographic data does better in the fairness metric. Essentially, models that do not encode information about demographics can perform more reliably in the fairness scall with OOD data (realistic deployment scenario).

What is the overall upshot of this comprehensive research on studying bias? Reducing and eliminating bias in models is a complex endeavor. Training models will have to carefully vet the data and ensure there is enough data to cover the deployment scenario population. Models should be continually monitored and evaluated in real world settings. It’s not enough to accept the model developer assurances (which is what is done now). Good advice, but it needs attention from the FDA to make it a reality.

I will end this blog with a pointer to Stanford University Human-Centered AI Institute’s “Artificial Intelligence Index Report 2024” released earlier this year. It has a whole chapter devoted to responsible AI, which dives into the challenges of addressing “fairness” in AI systems.


Acknowledgement: The paper on medical imaging was sent to me by Detlef Koll, which got me
investigating what is happening on this front.

V. “Juggy” Jagannathan, PhD, is an AI evangelist with four decades of experience in AI and computer science research.