AI Bot ChatGPT Passes US Medical Licensing Exams Without Cramming – Unlike Students
Alicia Ault
January 26, 2023
ChatGPT can pass parts of the US medical licensing exam, researchers have found, raising questions about whether the AI chatbot could one day help write the exam or help students prepare for it.
Victor Tseng, MD, and his colleagues at Ansible Health, a company that manages mostly homebound patients with chronic lung disease, initially wanted to see whether ChatGPT could aggregate all the communications regarding these patients, which would allow Ansible to better coordinate care.
“Naturally, we wondered how ChatGPT might augment patient care,” Tseng, Ansible’s vice president and medical director, told Medscape. A group of volunteers at the company decided to test its capabilities by asking it multiple choice questions from the US Medical Licensing Examination (USMLE), given that so many of them had taken the medical licensing exam.
“The results were so shocking to us that we sprinted to turn it into a publication,” said Tseng. The results were published as a preprint on medRxiv.They were so impressed that they allowed ChatGPT to collaborate as a contributing author.
ChatGPT wrote the abstract and results sections “with minimal prompting and largely cosmetic adjustments from the human co-authors,” said Tseng. The bot also contributed large sections to the introduction and methods sections. The authors “frequently asked it to synthesize, simplify, and offer counterpoints to drafts in progress,” Tseng said. He likened it to how co-authors might interact over email. They decided they would not credit ChatGPT as an author, however.
San Francisco–based OpenAI developed ChatGPT, a large language model. The tech giant Microsoft considers ChatGPT and OpenAI’s other applications so promising that it has already invested $3 billion and is reportedly poised to put another $10 billion into the company.
ChatGPT’s algorithms are “trained to predict the likelihood of a given sequence of words based on the context of the words that come before it.” Theoretically, it is “capable of generating novel sequences of words never observed previously by the model, but that represent plausible sequences based on natural human language,” according to Tseng and his co-authors.
Released to the public in November 2022, ChatGPT has been used to write everything from love poems to high school history papers to website editorial content. The bot draws on a data store that includes everything that has been uploaded to the internet through 2021.
Tseng and colleagues tested ChatGPT on hundreds of multiple-choice questions covered in the three steps of the USMLE exam.
For each step, the researchers prompted the chatbot in three ways. First, it was given a theoretical patient’s signs and symptoms and asked to pontificate on what might be the underlying cause or diagnosis.
Next, after ChatGPT was refreshed to eliminate potential bias from any retained information from the previous exercise, it was given the questions from the exam and asked to pick an answer. After again refreshing ChatGPT, the researchers asked it to “please explain why the correct answers are correct and why the incorrect answers are incorrect.”
The answers were reviewed and scored by three board-certified, licensed physicians.
For the open-ended format, ChatGPT’s accuracy for Step 1 ranged from 43% when “indeterminate” answers were included in the analysis to 68% when those responses were excluded. An indeterminate answer is one in which the chatbot either gave a response that was not available among the multiple choices that were presented or said it could not commit to an answer. For Step 2, the pass rate was 51%/58%, and for Step 3, it was 56%/62%.
When asked the questions verbatim, ChatGPT’s accuracy was 36/55% for Step 1, 57%/59% for Step 2CK, and 55%/61% for Step 3. When asked to justify its responses, its accuracy rate was 40%/62% for Step 1, 49%/51% for Step 2, and 60%/65% for Step 3.
The pass rate for students varies according to whether it’s a first exam or a repeat exam and whether the test-taker is from the United States or a different country. In 2021, for Step 1, the pass rate ranged from a low of 45% for repeaters to a high of 96%. For Step 2, the range was 62% to 99%, and for Step 3, the range was 62% to 98%.
“What’s fascinating is that in the Step 2 and 3, which are more clinically advanced, only around 10% of [ChatGPT’s] responses were indeterminate,” said Tseng.
Bot Not Tested on Crucial Parts of Exam
USMLE’s Mechaber noted that ChatGPT was only given a sampling of questions, not an actual practice test. And it did not attempt questions that use images or sounds or the case-based computer simulation studies administered in Step 3, he said.
Tseng suggests in his article that ChatGPT could potentially be used as a study aide for students preparing for the USMLE or to write questions for the exam.
“We’re thinking about that,” Mechaber said about its use as a study tool. But since ChatGPT still produces so many wrong answers, the technology is not likely “ready for prime time,” he said. As to whether ChatGPT could write test questions, the NBME has shown interest in “automated item generation,” he said.
“We’re investigating [ChatGPT] with excitement and curiosity” for its potential use in medicine, Mechaber said.
Chatbot Says USMLE Is Here to Stay
An NBME staff member decided to query ChatGPT about whether it was a threat to the USMLE. The bot said that while it is a “powerful tool for natural language processing,” it “is not a threat to the United States Medical Licensing Examination (USMLE).”
In a lengthy response, the bot added, “ChatGPT, while impressive in its ability to generate human-like text, is not specifically designed to test medical knowledge and is not a substitute for the rigorous training and education required to become a licensed physician.”
In addition, ChatGPT “does not have the ability to think critically or solve problems in the way that a human physician would,” it said.
The bot also brought up ethical considerations, noting that since AI models “are based on machine learning which can be biased, hence the results generated by the model may not be accurate and unbiased.
“ChatGPT is an impressive tool for natural language processing, but it is not a replacement for the specialized knowledge, critical thinking and ethical considerations that are essential for the practice of medicine,” it said. “The USMLE remains an important and valid way to evaluate the knowledge and abilities of aspiring physicians,” said the bot.
The study was conducted by volunteers and was not funded by any source. Tseng is a full-time employee of and writes test questions for U World, a USMLE test prep company.
Alicia Ault is a Saint Petersburg, Florida–based freelance journalist whose work has appeared in publications including JAMA and Smithsonian.com. You can find her on Twitter @aliciaault.