If AI was a student, it would flunk out of med school
The most difficult lesson to teach a physician-in-training is to admit the limits of their knowledge. Physicians are continuously asked questions that exceed their knowledge. A physician may research the question for an answer or send the patient to a specialist who can give the answer. “I don’t know” reminds physicians of their humility in the face of mortality.
Medical students have worked all their lives to give the A+ answers to all questions. To admit that they do not know the answer, can be a foreign concept to a medical student. It is not unusual to have a student propose physiologic pathways to explain a medical condition only to find themselves at the end of a dark alley with no place to go. The humbled student is brought back into the light by the professor who explains the errors of their ways. They know not to chance further hypotheses without a confident understanding of physiology. This is a lesson that should only be taught once. If the lesson requires a repeat, the student may not be well suited to be a physician.
We participate in a consortium that is researching the long-term complications of the COVID-19 pandemic, including the possibility of an association between COVID-19 infection and bone disease. As we gathered the medical studies that associate the inflammatory reaction to COVID-19 and bone mineral loss, we decided to publish a review article.
The group decided to use ChatGPT 4.0 to write a similar article on the topic for comparison. These two articles would contrast the research compiled by medical researchers and by an AI program.
The two teams worked independently on the review article and on the AI-generated article. In completing our publication plans, we decided to have both teams work together to edit the AI-generated article to make an AI-assisted article.
The group working on the AI-generated article quickly hit a wall. In reviewing the medical literature cited by the ChatGPT 4.0, we found 70% of the citations were either completely fabricated; incorrectly cited with errors in author, title, journal, etc.; or were incorrect to the topic. The AI community claims the incorrect citations are only “hallucinations” but in medicine we call them “lies.” Failure to understand the article shows ignorance of the topic.
The AI-generated article also had a similarity index of 25%, meaning a quarter of the article was plagiarized.
Between the lies, ignorance and plagiarism, only 5% of the article could be considered a trustworthy original work.
Finally, ChatGPT 4.0 was not current on the medical literature published in the last several years. Up-to-date medical information is vital to the advancement of medical knowledge.
Because of these extensive errors, the group decided not to publish this article. If this work had been produced by a student, extensive remediation would be required before they would be allowed to continue their education.
Unfortunately, ChatGPT 4.0 is not a student and there is no opportunity to remediate. Instead, AI is on the internet and provides medical advice using the same algorithm that produced the medical review we rejected. Feedback to an AI program could alter the algorithm to avoid these errors.
For AI to improve, programmers need to perform remediation to remove the tendency to lie and plagiarize. Medical experts need to assist AI programmers in improving the program’s ability to understand medical literature.
Angela Toepp, Ph.D., is an assistant professor of medicine at Eastern Virginia Medical School and director of system research at Sentara Health. Xian Qiao, M.D., is a pulmonologist and intensivist with Sentara Pulmonary, Critical Care and Sleep Specialists and an assistant professor of medicine at EVMS who started the Sentara Post-COVID Clinic. Thomas McCune, M.D., is a professor of medicine at EVMS and a nephrologist at Nephrology Associates of Tidewater, LTD.