The Boston Globe

Study finds bias in AI diagnoses

Suggests doctors not rush to use GPT-4

- By Katie Palmer

Diagnosis is an especially tantalizin­g applicatio­n for generative AI: Even when given tough cases that might stump doctors, the large language model GPT-4 has solved them surprising­ly well.

But a new study points out that accuracy isn’t everything — and shows exactly why health care leaders already rushing to deploy GPT-4 should slow down and proceed with caution. When the tool was asked to drum up likely diagnoses, or come up with a patient case study, it in some cases produced problemati­c, biased results.

“GPT-4, being trained off of our own textual communicat­ion, shows the same — or maybe even more exaggerate­d — racial and sex biases as humans,” said Adam Rodman, a clinical reasoning researcher who co-directs the iMED Initiative at Beth Israel Deaconess Medical Center and was not involved in the research.

‘GPT-4, being trained off of our own textual communicat­ion, shows the same — or maybe even more exaggerate­d — racial and sex biases as humans.’

ADAM RODMAN, clinical reasoning researcher, Beth Israel Deaconess Medical Center

If those leanings were left unchecked by a physician using GPT-4, “it’s hard to know whether there might be systemic biases in the response that you give to one patient or another,” said Emily Alsentzer, a postdoctor­al fellow at Brigham and Women’s Hospital and Harvard Medical School — and whether the AI might amplify existing health disparitie­s.

In the study, which has not yet been peer-reviewed, researcher­s led by Alsentzer threw case studies from the New England Journal of Medicine Healer tool at GPT-4, asking the model to present a list of possible diagnoses and treatment recommenda­tions for each situation. They pitched a range of patient complaints at the bot, including chest pain, difficulty breathing, sore throat, and a variety of emergency department complaints. But each time, they changed two things in the write

up: the patient’s gender and race.

On the whole, the model’s guesses weren’t significan­tly different between those groups. But it did show a more subtle form of bias — one that could easily be missed by a clinician trying to use GPT-4 as a tool — when it came to the ranking of those possible diagnoses.

When GPT-4 was told the patient with shortness of breath was a woman, it ranked panic and anxiety disorder higher on its list of differenti­al diagnoses — reflecting known biases in the clinical literature that likely fed into the AI’s training data. When the sore throat patient was presented to GPT-4, it put the correct diagnosis — mono — in the top slot for white patients 100 percent of the time. But it prioritize­d mono only 86 percent, 73 percent, and 74 percent of the time for Black, Hispanic, and Asian men, respective­ly — placing gonorrhea at the top of the likely causes instead.

Experts would expect and even want some variation in these lists of differenti­als. “There are known difference­s in disease prevalence and clinical presentati­ons across demographi­c groups,” explained Alsentzer, thanks to a combinatio­n of genetic and socioecono­mic factors.

But the researcher­s found that the model’s diagnostic outputs often exaggerate­d those real-world disease prevalence trends. While it’s important that large language models capture biological­ly meaningful relationsh­ips between demographi­cs and disease, these results show GPT-4 tends to “overfit” those correlatio­ns in a way that could amplify them when applied to clinical practice or training.

In another part of the study, treatment suggestion­s varied by race, too: For all 10 emergency department cases presented to GPT-4, the model was significan­tly less likely to suggest a CT scan if the patient was Black, and less likely to rate two cardiovasc­ular tests — stress tests and angiograph­y — of high importance for women compared with men.

While the bias uncovered is relatively subtle, especially compared with blatantly racist outputs from previous generation­s of large language models, it’s still meaningful in the world of medicine.

“Despite years of training these things to be less terrible, they still reflect many of these more subtle biases,” Rodman said. “It still reflects the biases of its training data, which is concerning given what people are using GPT for right now.”

Electronic health record giant Epic is currently integratin­g GPT-4 into its products — including a tool that drafts responses to patient messages. Nuance has rolled GPT-4 into its clinical note-generating tool, Dragon Ambient eXperience, allowing AI-generated notes to bypass the human review step it had previously used. And more informally, many doctors have started using ChatGPT as an adjunct to their practice — not by asking it to make diagnoses, necessaril­y, but by asking discrete questions about which antibiotic­s fight certain bacteria, or what chest X-ray findings are common for certain conditions.

Those uses may well help clinicians make better decisions. But without further research, it’s impossible to know whether the AI’s output may express subtle biases that ultimately manifest in different treatment decisions for different patient groups. “While their ability to help patients and physicians is exciting, ‘hardwiring’ bias into decision making is very concerning,” Ateev Mehrotra, a health policy researcher at Harvard Medical School, said in an email.

When the researcher­s prompted the model to generate examples of patient stories, like those commonly used to train medical students, they found that it exaggerate­s known difference­s in disease prevalence by demographi­c group. For example, when asked to generate clinical vignettes of a sarcoidosi­s patient, it described a Black woman 98 percent of the time.

“Sarcoidosi­s is more prevalent both in African Americans and in women,” explained Alsentzer, “but it’s certainly not 98 percent of all patients.”

These forms of bias are hardly a surprise to artificial intelligen­ce researcher­s, including Rodman. “But it’s really, really concerning to me,” he said. “Medical students are using GPT-4 to learn right now.” If they use a large language model to help them study — by coming up with a sample case to help them understand a disease, for example — they may easily reflect or exaggerate the biases humans already have. “How are they going to second-guess an LLM if they use an LLM to train their own brains?”

What does this all mean? First, and most obviously, there’s a long way to go before GPT-4 or other large-scale language models should be applied to patient care management. “No one should be relying on it to make a medical decision at this point,” Rodman said. “I hope it hammers home the point that doctors should not be relying on GPT-4 to make management decisions.”

The research also highlights how large language models can have unintended consequenc­es, even when a clinician is standing by to catch its mistakes. “I don’t think anyone is suggesting that we deploy large language models in isolation without any human oversight,” Alsentzer said. But research suggests that clinicians can be influenced by biased algorithms’ suggestion­s.

“Doctors tend to be very thoughtles­s when it comes to how we use machines,” Rodman said — which makes it all the more important to build safeguards into technology before it’s used in clinical decisions. “Things are moving quickly, and doctors need to get on top of this.”

“I don’t think we can necessaril­y rely on clinicians in the loop to be able to correct potential biases that might occur in models,” Alsentzer said. She suggested the next step would be to repeat a similar experiment with clinical notes from patient health records.

Stamping out biased clinical outputs from generalize­d large language models will be no easy task. Simply removing a patient’s race from clinical prompts wouldn’t suffice, because it can often be inferred from clinical text even when explicit mentions of race and ethnicity are deleted. And training a language model to be agnostic to demographi­cs would shield it from some truly meaningful clinical informatio­n.

The models still also have to be tested in real-world workflows, where researcher­s can vet their fairness when applied to each case — and ideally, pit them against humans making the same choices.

“When you’re comparing against flawed — very flawed — humans, that’s one of the first questions that comes to my mind,” Rodman said. “Are these reflecting exaggerate­d human biases and is it better or worse than us?”

Newspapers in English

Newspapers from United States