Study finds bias in AI diagnoses
Suggests doctors not rush to use GPT-4
Diagnosis is an especially tantalizing application for generative AI: Even when given tough cases that might stump doctors, the large language model GPT-4 has solved them surprisingly well.
But a new study points out that accuracy isn’t everything — and shows exactly why health care leaders already rushing to deploy GPT-4 should slow down and proceed with caution. When the tool was asked to drum up likely diagnoses, or come up with a patient case study, it in some cases produced problematic, biased results.
“GPT-4, being trained off of our own textual communication, shows the same — or maybe even more exaggerated — racial and sex biases as humans,” said Adam Rodman, a clinical reasoning researcher who co-directs the iMED Initiative at Beth Israel Deaconess Medical Center and was not involved in the research.
‘GPT-4, being trained off of our own textual communication, shows the same — or maybe even more exaggerated — racial and sex biases as humans.’
ADAM RODMAN, clinical reasoning researcher, Beth Israel Deaconess Medical Center
If those leanings were left unchecked by a physician using GPT-4, “it’s hard to know whether there might be systemic biases in the response that you give to one patient or another,” said Emily Alsentzer, a postdoctoral fellow at Brigham and Women’s Hospital and Harvard Medical School — and whether the AI might amplify existing health disparities.
In the study, which has not yet been peer-reviewed, researchers led by Alsentzer threw case studies from the New England Journal of Medicine Healer tool at GPT-4, asking the model to present a list of possible diagnoses and treatment recommendations for each situation. They pitched a range of patient complaints at the bot, including chest pain, difficulty breathing, sore throat, and a variety of emergency department complaints. But each time, they changed two things in the write
up: the patient’s gender and race.
On the whole, the model’s guesses weren’t significantly different between those groups. But it did show a more subtle form of bias — one that could easily be missed by a clinician trying to use GPT-4 as a tool — when it came to the ranking of those possible diagnoses.
When GPT-4 was told the patient with shortness of breath was a woman, it ranked panic and anxiety disorder higher on its list of differential diagnoses — reflecting known biases in the clinical literature that likely fed into the AI’s training data. When the sore throat patient was presented to GPT-4, it put the correct diagnosis — mono — in the top slot for white patients 100 percent of the time. But it prioritized mono only 86 percent, 73 percent, and 74 percent of the time for Black, Hispanic, and Asian men, respectively — placing gonorrhea at the top of the likely causes instead.
Experts would expect and even want some variation in these lists of differentials. “There are known differences in disease prevalence and clinical presentations across demographic groups,” explained Alsentzer, thanks to a combination of genetic and socioeconomic factors.
But the researchers found that the model’s diagnostic outputs often exaggerated those real-world disease prevalence trends. While it’s important that large language models capture biologically meaningful relationships between demographics and disease, these results show GPT-4 tends to “overfit” those correlations in a way that could amplify them when applied to clinical practice or training.
In another part of the study, treatment suggestions varied by race, too: For all 10 emergency department cases presented to GPT-4, the model was significantly less likely to suggest a CT scan if the patient was Black, and less likely to rate two cardiovascular tests — stress tests and angiography — of high importance for women compared with men.
While the bias uncovered is relatively subtle, especially compared with blatantly racist outputs from previous generations of large language models, it’s still meaningful in the world of medicine.
“Despite years of training these things to be less terrible, they still reflect many of these more subtle biases,” Rodman said. “It still reflects the biases of its training data, which is concerning given what people are using GPT for right now.”
Electronic health record giant Epic is currently integrating GPT-4 into its products — including a tool that drafts responses to patient messages. Nuance has rolled GPT-4 into its clinical note-generating tool, Dragon Ambient eXperience, allowing AI-generated notes to bypass the human review step it had previously used. And more informally, many doctors have started using ChatGPT as an adjunct to their practice — not by asking it to make diagnoses, necessarily, but by asking discrete questions about which antibiotics fight certain bacteria, or what chest X-ray findings are common for certain conditions.
Those uses may well help clinicians make better decisions. But without further research, it’s impossible to know whether the AI’s output may express subtle biases that ultimately manifest in different treatment decisions for different patient groups. “While their ability to help patients and physicians is exciting, ‘hardwiring’ bias into decision making is very concerning,” Ateev Mehrotra, a health policy researcher at Harvard Medical School, said in an email.
When the researchers prompted the model to generate examples of patient stories, like those commonly used to train medical students, they found that it exaggerates known differences in disease prevalence by demographic group. For example, when asked to generate clinical vignettes of a sarcoidosis patient, it described a Black woman 98 percent of the time.
“Sarcoidosis is more prevalent both in African Americans and in women,” explained Alsentzer, “but it’s certainly not 98 percent of all patients.”
These forms of bias are hardly a surprise to artificial intelligence researchers, including Rodman. “But it’s really, really concerning to me,” he said. “Medical students are using GPT-4 to learn right now.” If they use a large language model to help them study — by coming up with a sample case to help them understand a disease, for example — they may easily reflect or exaggerate the biases humans already have. “How are they going to second-guess an LLM if they use an LLM to train their own brains?”
What does this all mean? First, and most obviously, there’s a long way to go before GPT-4 or other large-scale language models should be applied to patient care management. “No one should be relying on it to make a medical decision at this point,” Rodman said. “I hope it hammers home the point that doctors should not be relying on GPT-4 to make management decisions.”
The research also highlights how large language models can have unintended consequences, even when a clinician is standing by to catch its mistakes. “I don’t think anyone is suggesting that we deploy large language models in isolation without any human oversight,” Alsentzer said. But research suggests that clinicians can be influenced by biased algorithms’ suggestions.
“Doctors tend to be very thoughtless when it comes to how we use machines,” Rodman said — which makes it all the more important to build safeguards into technology before it’s used in clinical decisions. “Things are moving quickly, and doctors need to get on top of this.”
“I don’t think we can necessarily rely on clinicians in the loop to be able to correct potential biases that might occur in models,” Alsentzer said. She suggested the next step would be to repeat a similar experiment with clinical notes from patient health records.
Stamping out biased clinical outputs from generalized large language models will be no easy task. Simply removing a patient’s race from clinical prompts wouldn’t suffice, because it can often be inferred from clinical text even when explicit mentions of race and ethnicity are deleted. And training a language model to be agnostic to demographics would shield it from some truly meaningful clinical information.
The models still also have to be tested in real-world workflows, where researchers can vet their fairness when applied to each case — and ideally, pit them against humans making the same choices.
“When you’re comparing against flawed — very flawed — humans, that’s one of the first questions that comes to my mind,” Rodman said. “Are these reflecting exaggerated human biases and is it better or worse than us?”