Extraction of sleep information from clinical notes of Alzheimer’s disease patients using natural language processing

Sleep-related keywords used to retrieve relevant clinical note documents.

“snore,” “snoring,” “wheeze,” “wheezing,” “sleep,” “sleepiness,” “sleeping,” “sleepless,” “sleeplessness,” “apnea,” “hypopnea,” “osa,” “insomnia,” “nap,” “napping,” “narcolepsy,” “nocturnal," “somnolence,” “somnolent,” “dizziness,” “hypersomnia,” “rem,” “nrem,” “wake,” “wakefulness,” “waking,” “polysomnography”

Gold standard data annotation

We randomly sampled 570 clinical note documents from the adSLEEP corpus for manual annotation to create the gold standard dataset. This selection was determined by the annotators’ availability and the aim to create a representative sample of clinical notes with sleep-related concepts, drawing 570 documents from a unique subset of patients to offer a comprehensive view of the clinical corpus. On average, clinical notes contain about 1197 tokens, indicating a diverse range of content and varying lengths within the selected sample.

Two health informatics students annotated this sampled dataset. The annotator was directed to annotate mentions of 7 sleep-related categories in each clinical note, including snoring, napping, sleep problem, sleep quality, daytime sleepiness, night wakings, and sleep duration. These sleep-related concept categories are defined based on previous sleep studies for patients with AD.⁷^,¹³^,³⁹ Table 2 shows the detailed definitions for each of the classes. Then a judge aggregated the concept annotations in a clinical note to a document-level label. The document labels often contained multiple concepts, reflecting the complexity of clinical notes. For example, a document might be labeled with both “snoring” and “bad sleep quality,” which were aggregated separately. The detailed annotation guideline is provided in Supplementary Table A1.

Table 2.

Definitions and examples of sleep-related concepts.

Concept category	Definition	Example
Snoring	Snoring or snoring synonyms	“snoring”
(Yes or No)		“snored”
Napping	Napping during daytime	“napping”
(Yes or No)		“doze”
Sleep problem	Sleep problem	“sleep disorder”
(Yes or No)	Specific sleep disorder/condition/disease mentioned in the note	“insomnia”
		“hypersomnia”
Bad sleep quality	Any mention related to bad sleep quality in the note	“sleeplessness”
(Yes or No)		“couldn’t sleep during night”
		“staying up all night”
Daytime sleepiness	Sleepiness during daytime	“sleep a lot throughout the day”
(Yes or No)		“excessive daytime sleepiness”
Night wakings	Night time wakings	“frequent night waking”
(Yes or No)		“waking up in the middle”
		“waking up 3-5 times”
Sleep duration	Duration of night time sleep	“sleeps 4-5 hours” —> Short
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h)		“sleep more than 12 hours” —> Long

Concept category	Definition	Example
Snoring	Snoring or snoring synonyms	“snoring”
(Yes or No)		“snored”
Napping	Napping during daytime	“napping”
(Yes or No)		“doze”
Sleep problem	Sleep problem	“sleep disorder”
(Yes or No)	Specific sleep disorder/condition/disease mentioned in the note	“insomnia”
		“hypersomnia”
Bad sleep quality	Any mention related to bad sleep quality in the note	“sleeplessness”
(Yes or No)		“couldn’t sleep during night”
		“staying up all night”
Daytime sleepiness	Sleepiness during daytime	“sleep a lot throughout the day”
(Yes or No)		“excessive daytime sleepiness”
Night wakings	Night time wakings	“frequent night waking”
(Yes or No)		“waking up in the middle”
		“waking up 3-5 times”
Sleep duration	Duration of night time sleep	“sleeps 4-5 hours” —> Short
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h)		“sleep more than 12 hours” —> Long

Table 2.

Definitions and examples of sleep-related concepts.

Concept category	Definition	Example
Snoring	Snoring or snoring synonyms	“snoring”
(Yes or No)		“snored”
Napping	Napping during daytime	“napping”
(Yes or No)		“doze”
Sleep problem	Sleep problem	“sleep disorder”
(Yes or No)	Specific sleep disorder/condition/disease mentioned in the note	“insomnia”
		“hypersomnia”
Bad sleep quality	Any mention related to bad sleep quality in the note	“sleeplessness”
(Yes or No)		“couldn’t sleep during night”
		“staying up all night”
Daytime sleepiness	Sleepiness during daytime	“sleep a lot throughout the day”
(Yes or No)		“excessive daytime sleepiness”
Night wakings	Night time wakings	“frequent night waking”
(Yes or No)		“waking up in the middle”
		“waking up 3-5 times”
Sleep duration	Duration of night time sleep	“sleeps 4-5 hours” —> Short
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h)		“sleep more than 12 hours” —> Long

Concept category	Definition	Example
Snoring	Snoring or snoring synonyms	“snoring”
(Yes or No)		“snored”
Napping	Napping during daytime	“napping”
(Yes or No)		“doze”
Sleep problem	Sleep problem	“sleep disorder”
(Yes or No)	Specific sleep disorder/condition/disease mentioned in the note	“insomnia”
		“hypersomnia”
Bad sleep quality	Any mention related to bad sleep quality in the note	“sleeplessness”
(Yes or No)		“couldn’t sleep during night”
		“staying up all night”
Daytime sleepiness	Sleepiness during daytime	“sleep a lot throughout the day”
(Yes or No)		“excessive daytime sleepiness”
Night wakings	Night time wakings	“frequent night waking”
(Yes or No)		“waking up in the middle”
		“waking up 3-5 times”
Sleep duration	Duration of night time sleep	“sleeps 4-5 hours” —> Short
(Short ≤6 h, Medium 6-8 h, or Long ≥ 8 h)		“sleep more than 12 hours” —> Long

Initially, a batch of 20 documents was given to the annotators to refine the annotation guidelines and to discuss the discrepancies to reach consensus on the concept definition. When there were disagreements between the 2 annotators, they discussed the discrepancies to reach a consensus. If consensus was not easily reached or the IIA fell below the target threshold, a third-party judge was involved to mediate and make the final decision. This process helped maintain consistency and accuracy in the annotations. We repeated this process on another batch of 20 documents, refining the annotation guidelines until the inter-annotator agreement (IAA) exceeded 0.60, with a Cohen’s Kappa value of 0.68. These 40 documents were used to measure a final IAA. Then we annotated the remaining 530 clinical notes using the updated annotation guidelines. Table 3 presents excerpts from de-identified clinical notes of a patients, highlighting key sleep-related issues.

Table 3.

Example of sleep-related characteristics derived from de-identified clinical notes.

Note Excerpt (De-identified)	Snoring (1 = true, 0 = false)	Napping (1 = true, 0 = false)	Sleep problem (1 = true, 0 = false)	Bad sleep quality (1 = true, 0 = false)	Daytime sleepiness (1 = true, 0 = false)	Night wakings (1 = true, 0 = false)	Sleep duration (0 = short, 1 = medium, 2 = long)
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed.	0	0	1	1	0	0	0

Note Excerpt (De-identified)	Snoring (1 = true, 0 = false)	Napping (1 = true, 0 = false)	Sleep problem (1 = true, 0 = false)	Bad sleep quality (1 = true, 0 = false)	Daytime sleepiness (1 = true, 0 = false)	Night wakings (1 = true, 0 = false)	Sleep duration (0 = short, 1 = medium, 2 = long)
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed.	0	0	1	1	0	0	0

Annotations in the table indicate the presence or absence of the sleep characteristics.

Table 3.

Example of sleep-related characteristics derived from de-identified clinical notes.

Note Excerpt (De-identified)	Snoring (1 = true, 0 = false)	Napping (1 = true, 0 = false)	Sleep problem (1 = true, 0 = false)	Bad sleep quality (1 = true, 0 = false)	Daytime sleepiness (1 = true, 0 = false)	Night wakings (1 = true, 0 = false)	Sleep duration (0 = short, 1 = medium, 2 = long)
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed.	0	0	1	1	0	0	0

Note Excerpt (De-identified)	Snoring (1 = true, 0 = false)	Napping (1 = true, 0 = false)	Sleep problem (1 = true, 0 = false)	Bad sleep quality (1 = true, 0 = false)	Daytime sleepiness (1 = true, 0 = false)	Night wakings (1 = true, 0 = false)	Sleep duration (0 = short, 1 = medium, 2 = long)
A 69-year-old female patient comes to office for checkup. History of osteoporosis, Alzheimer's, COPD, and anxiety. Reports trouble getting to sleep and staying asleep; lunesta not working. Follows with specialists and manages medications for chronic conditions. No complaints of pain or other systemic symptoms affecting sleep noted during visit. Primary insomnia diagnosed.	0	0	1	1	0	0	0

Annotations in the table indicate the presence or absence of the sleep characteristics.

Rule-based NLP algorithm

We developed a rule-based NLP algorithm named nlp4sleep for sleep information extraction using MedTagger,⁴⁰ a clinical NLP tool based on the Unstructured Information Management Architecture (UIMA) framework.⁴¹ The MedTagger software is publicly available at GitHub (https://github.com/OHNLP/MedTagger).

We first used top-down and bottom-up approaches to identify the keywords in the rules for each sleep concept extraction. For the top-down approach, we searched the synonyms for each concept in the medical terminologies and ontologies, including Unified Medical Language System (UMLS) Metathesaurus. For the bottom-up approach, we used word embeddings, specifically Word2Vec,⁴² on the clinical corpus to find the top 3 most similar terms. Then we used 70% (399 documents) of the gold standard dataset as training data to develop regular expression rules for the NLP algorithm. MedTagger facilitated the execution of these regular expression-based rules, allowing the algorithm to annotate and extract information from unstructured text data. Since MedTagger includes the negation detection and hypothetical mention detection, we did not specify negation rules unless we saw undetected negations in the training data. Table 4 lists the regular expression rules used in the nlp4sleep algorithm to extract sleep concepts. The NLP system extracted sleep concepts from each clinical document and assigned a document-level classification for each concept. If there were multiple mentions one of a concept in a document, we applied majority voting strategy to obtain the final document label. The NLP algorithm is publicly available through the Open Health Natural Language Processing (OHNLP) consortium at GitHub (https://github.com/OHNLP/nlp4sleep).

Table 4.

Regular expression rules used in the NLP algorithm for the extraction of sleep concepts from clinical notes.

Concept category	Keywords and regular expressions
Snoring	snor(es\|ing\|e)?; snorings; sleep apnea; osa; obstructive sleep apnea
Napping	nap(s\|ping)?
Sleep problem	insomnia; sleeplessness; sleep (disorders?\|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence
Bad sleep quality	staying up; (trouble\|irritable\|tense) (\S+\s+){0,5}(sleep(ing)?\|asleep); sleep(s\|ing)? poorly; sleep is poor; restless sleep; ca(n’t\|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?\|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep\|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair\|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during\|at) night; nocturnal; often awake
Daytime sleepiness	(excessive) ? daytime sleep(iness\|inesses)?; (excessive) ? daytime somnolence; sleep(s\|ing\|iness)? at times; sleep(s\|iness)? in (\S+\s+){0,2}day(time)?; sleep(s\|iness)? during (\S+\s+){0,2}day(time)?; sleep all day
Night wakings	night (wakings\|awakenings), wak(e\|es\|ing up) (\S+\s+){0,5}night; awake(ning\|n)? (from\|during\|at) night(mares)?
Sleep duration	Short: sleep(s\|ing)? (less than\|up to) (1\|2\|3\|4\|5\|6) hours Medium: sleep(s\|ing)? (\S+\s+){0,5}(6\|7\|8)-(6\|7\|8) hours Long: sleep(s\|ing)? (\S+\s+){0,5}more than (8\|9\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours; sleep(s\|ing)? (\S+\s+){0,5}(8\|9\|10\|11\|12)-(8\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours

Concept category	Keywords and regular expressions
Snoring	snor(es\|ing\|e)?; snorings; sleep apnea; osa; obstructive sleep apnea
Napping	nap(s\|ping)?
Sleep problem	insomnia; sleeplessness; sleep (disorders?\|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence
Bad sleep quality	staying up; (trouble\|irritable\|tense) (\S+\s+){0,5}(sleep(ing)?\|asleep); sleep(s\|ing)? poorly; sleep is poor; restless sleep; ca(n’t\|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?\|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep\|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair\|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during\|at) night; nocturnal; often awake
Daytime sleepiness	(excessive) ? daytime sleep(iness\|inesses)?; (excessive) ? daytime somnolence; sleep(s\|ing\|iness)? at times; sleep(s\|iness)? in (\S+\s+){0,2}day(time)?; sleep(s\|iness)? during (\S+\s+){0,2}day(time)?; sleep all day
Night wakings	night (wakings\|awakenings), wak(e\|es\|ing up) (\S+\s+){0,5}night; awake(ning\|n)? (from\|during\|at) night(mares)?
Sleep duration	Short: sleep(s\|ing)? (less than\|up to) (1\|2\|3\|4\|5\|6) hours Medium: sleep(s\|ing)? (\S+\s+){0,5}(6\|7\|8)-(6\|7\|8) hours Long: sleep(s\|ing)? (\S+\s+){0,5}more than (8\|9\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours; sleep(s\|ing)? (\S+\s+){0,5}(8\|9\|10\|11\|12)-(8\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours

Table 4.

Regular expression rules used in the NLP algorithm for the extraction of sleep concepts from clinical notes.

Concept category	Keywords and regular expressions
Snoring	snor(es\|ing\|e)?; snorings; sleep apnea; osa; obstructive sleep apnea
Napping	nap(s\|ping)?
Sleep problem	insomnia; sleeplessness; sleep (disorders?\|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence
Bad sleep quality	staying up; (trouble\|irritable\|tense) (\S+\s+){0,5}(sleep(ing)?\|asleep); sleep(s\|ing)? poorly; sleep is poor; restless sleep; ca(n’t\|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?\|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep\|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair\|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during\|at) night; nocturnal; often awake
Daytime sleepiness	(excessive) ? daytime sleep(iness\|inesses)?; (excessive) ? daytime somnolence; sleep(s\|ing\|iness)? at times; sleep(s\|iness)? in (\S+\s+){0,2}day(time)?; sleep(s\|iness)? during (\S+\s+){0,2}day(time)?; sleep all day
Night wakings	night (wakings\|awakenings), wak(e\|es\|ing up) (\S+\s+){0,5}night; awake(ning\|n)? (from\|during\|at) night(mares)?
Sleep duration	Short: sleep(s\|ing)? (less than\|up to) (1\|2\|3\|4\|5\|6) hours Medium: sleep(s\|ing)? (\S+\s+){0,5}(6\|7\|8)-(6\|7\|8) hours Long: sleep(s\|ing)? (\S+\s+){0,5}more than (8\|9\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours; sleep(s\|ing)? (\S+\s+){0,5}(8\|9\|10\|11\|12)-(8\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours

Concept category	Keywords and regular expressions
Snoring	snor(es\|ing\|e)?; snorings; sleep apnea; osa; obstructive sleep apnea
Napping	nap(s\|ping)?
Sleep problem	insomnia; sleeplessness; sleep (disorders?\|problems?); hypersomnia; parasomnia; osa; obstructive sleep apnea; sleep apnea; hypersomnolence
Bad sleep quality	staying up; (trouble\|irritable\|tense) (\S+\s+){0,5}(sleep(ing)?\|asleep); sleep(s\|ing)? poorly; sleep is poor; restless sleep; ca(n’t\|nnot) sleep; sleep issues?; sleep(ing)? (\S+\s+){0,5}(problems?\|problematic); sleeps? a lot; difficulty (\S+\s+){0,5}(asleep\|sleep(ing)?); sleep disturbance; disturbance in sleep; sleep quality: (fair\|bad); not sleeping; no sleep; sleep difficulty; nocturnal agitation; up (during\|at) night; nocturnal; often awake
Daytime sleepiness	(excessive) ? daytime sleep(iness\|inesses)?; (excessive) ? daytime somnolence; sleep(s\|ing\|iness)? at times; sleep(s\|iness)? in (\S+\s+){0,2}day(time)?; sleep(s\|iness)? during (\S+\s+){0,2}day(time)?; sleep all day
Night wakings	night (wakings\|awakenings), wak(e\|es\|ing up) (\S+\s+){0,5}night; awake(ning\|n)? (from\|during\|at) night(mares)?
Sleep duration	Short: sleep(s\|ing)? (less than\|up to) (1\|2\|3\|4\|5\|6) hours Medium: sleep(s\|ing)? (\S+\s+){0,5}(6\|7\|8)-(6\|7\|8) hours Long: sleep(s\|ing)? (\S+\s+){0,5}more than (8\|9\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours; sleep(s\|ing)? (\S+\s+){0,5}(8\|9\|10\|11\|12)-(8\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20) hours

Machine learning models

In addition to the rule-based NLP algorithm, we also developed machine learning models to extract sleep-related concepts. We trained and tested 4 major machine learning-based clinical text classification models, namely, decision trees (DT), logistic regression (LR), K-nearest neighbors (KNN), and support vector machine (SVM), for each sleep concept. All models were trained using Scikit-learn Python library to ensure consistent and standard behavior across the various approaches. Later, we used 5-fold cross-validation to optimize the models’ parameters. During each fold, we systematically evaluated a range of hyperparameter combinations to identify the best parameters, using grid search optimization technique. We varied the maximum depth (1-20) and minimum samples split (2-10) for DT; tested different regularization strengths (C values from 0.001 to 10) for LR; explored the number of neighbors (1-15) and different distance metrics (eg, Euclidean, Manhattan) for KNN; and varied the kernel type (linear, polynomial, radial basis function) and regularization parameter (C values from 0.001 to 10) for SVM.

Since real-world clinical text data may be inadequate and inconsistent for machine learning-based NLP models to comprehend, we adopted multiple preprocessing steps before feeding the text data to the machine learning models. First, the text was converted to lowercase and tokenized to break the sentences into smaller units, including words, phrases, symbols, or other meaningful elements. Stop words and non-numeric tokens were removed from the token lists, and the tokens were lemmatized to reduce the complexity of the text. The entire document was then converted into a numeric vector using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.

All experiments, with the exception of “Sleep Duration,” were performed as binary text classification tasks, assigning positive or negative predictions to each concept category. “Sleep Duration,” however, was treated as a multi-label classification task. The same training and test datasets were used as in the rule-based NLP algorithm, with performance reported on the test dataset. These preprocessing steps and model development techniques ensured a structured and reliable approach to extracting sleep-related concepts.

LLM-based algorithms

We further extended our computational methodologies for sleep information extraction by incorporating LLMs with a focus on LLAMA2,⁴³ leveraging both chain-of-thought (CoT) prompting and supervised fine-tuning (SFT) approaches. The decision to utilize LLAMA2 stems from its demonstrated effectiveness in clinical information extraction tasks, as evidenced in our prior research works,⁴⁴^,⁴⁵ and its availability as an open-source model aligns with our commitment to accessible and reproducible scientific work.

The CoT-based NLP algorithm was implemented through carefully crafted prompts designed for each sleep concept. This approach harnesses the model’s capability to generate intermediate reasoning steps, facilitating a more nuanced understanding of complex clinical narratives found in EHRs. By simulating a thought process that mirrors clinical reasoning, CoT prompting aims to improve the model’s accuracy in identifying and classifying the various sleep concepts. The customized prompts were developed based on insights from clinical experts and iteratively refined to capture the intricacies of sleep-related terminology and contexts within the clinical notes.

Beyond the CoT prompting, we also implemented SFT on LLAMA2, specifically adopting low-rank adaptation (LoRA)-based instruction tuning techniques known for their efficiency in parameter adjustment. This approach fine-tunes a select group of parameters within LLAMA2, resulting in a model variant precisely calibrated for identifying sleep-related concepts within EHR data. The methodology for instruction tuning involved providing the LLAMA2 model with a labeled dataset containing detailed instructions for each sleep concept. This dataset was constructed using a triplet format, where each sample included the instruction prompt, the associated sleep concept, and contextual examples, that is, the original annotations. The instruction prompt was designed to guide the model in identifying specific sleep-related concepts in clinical notes. For each instruction prompt, the associated sleep concept was highlighted, along with examples illustrating how it might be expressed in clinical notes.

The SFT process involved instruction tuning the LLAMA2 model using a LoRA-based approach, with parameters carefully selected to minimize computational costs by focusing on low-rank adaptations. The attention dimension was set to 64, indicating the number of dimensions in the low-rank adaptation. The alpha parameter was set to 16, serving as a scaling factor for the low-rank matrices in LoRA, determining their influence on the overall model. To prevent overfitting, the dropout probability for the LoRA layers was set to 0.1.

The training process consisted of 10 epochs, providing ample time for model optimization. The maximum gradient norm was limited to 0.3, which controlled the size of gradients to maintain training stability. The initial learning rate for the AdamW optimizer was set to 2e-4, ensuring a gradual learning process. A weight decay of 0.001 was applied to the training process to regularize the model and reduce overfitting.

Evaluation

We used 30% (171 documents) of the gold standard dataset as the testing data to validate the rule-based NLP algorithm and machine learning models. All models were evaluated using sensitivity, specificity, positive predictive value (PPV), F1 score, and area under the receiver operating characteristic curve (AUROC). Since our dataset is imbalanced, we report the weighted-averaged F1 score. The definitions of the evaluation metrics are shown below:

\begin{matrix} S ensitivity = \frac{True Positive}{True Positive + False Nagative} \\ \begin{matrix} S pecificity = \frac{True Nagative}{True Nagative + False Positive} \\ \begin{matrix} P P V = \frac{True Positive}{True Positive + False Positive} \\ F 1 score = \frac{2 \cdot True Positive}{2 \cdot True Positive + False Positive + False Nagative} \end{matrix} \end{matrix} \end{matrix}

Results

Table 5 shows the demographics of the AD cohort of 482 patients. Patients were primarily white (89.2%), female (64.1%), and not Hispanic or Latino (94.6%), with a mean age of 85 years. The demographics of this cohort are similar to the demographics of the population with AD in western Pennsylvania.⁴⁶

Table 5.

Demographics of the annotated corpus.

Demographics	Total (n = 482)
Age (Mean)	84.7
Sex
Female	309 (64.1%)
Male	173 (35.9%)
Race
White	430 (89.2%)
Black	47 (9.8%)
Others	5 (1%)
Ethnicity
Hispanic or Latino	3 (0.6%)
Not Hispanic or Latino	456 (94.6%)

Demographics	Total (n = 482)
Age (Mean)	84.7
Sex
Female	309 (64.1%)
Male	173 (35.9%)
Race
White	430 (89.2%)
Black	47 (9.8%)
Others	5 (1%)
Ethnicity
Hispanic or Latino	3 (0.6%)
Not Hispanic or Latino	456 (94.6%)

Table 5.

Demographics of the annotated corpus.

Demographics	Total (n = 482)
Age (Mean)	84.7
Sex
Female	309 (64.1%)
Male	173 (35.9%)
Race
White	430 (89.2%)
Black	47 (9.8%)
Others	5 (1%)
Ethnicity
Hispanic or Latino	3 (0.6%)
Not Hispanic or Latino	456 (94.6%)

Demographics	Total (n = 482)
Age (Mean)	84.7
Sex
Female	309 (64.1%)
Male	173 (35.9%)
Race
White	430 (89.2%)
Black	47 (9.8%)
Others	5 (1%)
Ethnicity
Hispanic or Latino	3 (0.6%)
Not Hispanic or Latino	456 (94.6%)

Table 6 lists the number of documents for each sleep concept in the annotated training and test datasets. As shown in the table, the frequency of these sleep concepts is low in the gold standard dataset. Though the clinical documents were identified by using a list of relevant keywords, most documents do not contain any sleep-related concepts. The reason might be that some keywords may not be only related to sleep; for example, wheezing might be related to respiratory diseases.

Table 6.

Number of clinical documents for each sleep concept in the annotated training and test datasets.

Concept category	No. of documents in training	No. of documents in test	No. of total documents
Concept category	(Yes/No)	(Yes/No)	(Yes/No)
Snoring	60/290	21/199	81/489
Napping	31/319	10/210	41/529
Sleep problem	71/279	23/197	94/476
Bad sleep quality	45/305	25/195	70/500
Daytime sleepiness	103/247	34/186	137/433
Night wakings	104/246	35/185	139/431
Sleep duration	240 (Short)/	80 (Short)/	320 (Short)/
	117 (Medium)/	42 (Medium)/	159 (Medium)/
	67 (Long)	24 (Long)	91 (Long)

Concept category	No. of documents in training	No. of documents in test	No. of total documents
Concept category	(Yes/No)	(Yes/No)	(Yes/No)
Snoring	60/290	21/199	81/489
Napping	31/319	10/210	41/529
Sleep problem	71/279	23/197	94/476
Bad sleep quality	45/305	25/195	70/500
Daytime sleepiness	103/247	34/186	137/433
Night wakings	104/246	35/185	139/431
Sleep duration	240 (Short)/	80 (Short)/	320 (Short)/
	117 (Medium)/	42 (Medium)/	159 (Medium)/
	67 (Long)	24 (Long)	91 (Long)

“Yes” indicates that the specified sleep concept is present in the clinical note, while “No” indicates that it is not.

Table 6.

Number of clinical documents for each sleep concept in the annotated training and test datasets.

Concept category	No. of documents in training	No. of documents in test	No. of total documents
Concept category	(Yes/No)	(Yes/No)	(Yes/No)
Snoring	60/290	21/199	81/489
Napping	31/319	10/210	41/529
Sleep problem	71/279	23/197	94/476
Bad sleep quality	45/305	25/195	70/500
Daytime sleepiness	103/247	34/186	137/433
Night wakings	104/246	35/185	139/431
Sleep duration	240 (Short)/	80 (Short)/	320 (Short)/
	117 (Medium)/	42 (Medium)/	159 (Medium)/
	67 (Long)	24 (Long)	91 (Long)

Concept category	No. of documents in training	No. of documents in test	No. of total documents
Concept category	(Yes/No)	(Yes/No)	(Yes/No)
Snoring	60/290	21/199	81/489
Napping	31/319	10/210	41/529
Sleep problem	71/279	23/197	94/476
Bad sleep quality	45/305	25/195	70/500
Daytime sleepiness	103/247	34/186	137/433
Night wakings	104/246	35/185	139/431
Sleep duration	240 (Short)/	80 (Short)/	320 (Short)/
	117 (Medium)/	42 (Medium)/	159 (Medium)/
	67 (Long)	24 (Long)	91 (Long)

“Yes” indicates that the specified sleep concept is present in the clinical note, while “No” indicates that it is not.

The performance of the rule-based NLP algorithm and machine learning models is listed in Table 7. The rule-based NLP algorithm demonstrated exceptional performance across all sleep-related concepts, achieving perfect scores in sensitivity, specificity, F1, and PPV for the concept of daytime sleepiness and sleep duration. For sleep duration, this algorithm’s strength lies in its ability to accurately identify instances of sleep concepts with precision, as evidenced by its high PPV values, notably in the snoring concept (0.94) and sleep duration where it also achieved perfect scores. The AUROC scores for the rule-based algorithm were consistently high, achieving the best score for 4 out of 6 concepts, among the 7 models. Note that AUROC could not be calculated for sleep duration as it had more than 2 labels. However, its performance varied across other sleep concepts, with the lowest scores observed in napping (specificity and PPV both at 0.5), indicating potential challenges in distinguishing relevant napping instances from unrelated contexts. Despite these variances, the rule-based approach excelled in extracting specific sleep-related concepts, particularly in accurately identifying instances without false negatives, as shown by the perfect sensitivity scores across the board.

Table 7.

Performance of the rule-based NLP algorithm and machine learning models.

Sensitivity	Daytime sleepiness	Napping	Night wakings	Sleep problem	Bad sleep quality	Snoring	Sleep duration
Specificity
F1
PPV
AUROC
Rule-based NLP	1.00	0.50	1.00	0.85	0.62	0.94	1.00
	1.00	0.99	0.99	0.93	0.51	0.97	1.00
	1.00	0.98	0.99	0.91	0.91	0.97	1.00
	1.00	0.50	0.75	0.80	0.60	0.89	1.00
	1.00	0.97	0.98	0.92	0.50	0.86	–
DT	0.86	0.89	0.82	0.78	0.63	0.75	0.79
	0.86	0.89	0.80	0.72	0.57	0.75	0.74
	0.90	0.98	0.81	0.74	0.58	0.75	0.76
	0.86	0.89	0.84	0.84	0.81	0.78	0.77
	0.85	0.89	0.79	0.71	0.56	0.75	–
LR	0.90	0.47	0.92	0.91	0.42	0.42	0.77
	0.77	0.50	0.83	0.58	0.50	0.50	0.67
	0.81	0.48	0.86	0.59	0.45	0.46	0.71
	0.89	0.94	0.91	0.82	0.83	0.84	0.78
	0.78	0.50	0.82	0.58	0.50	0.50	–
KNN	0.93	0.79	0.87	0.79	0.76	0.70	0.84
	0.85	0.79	0.79	0.65	0.72	0.66	0.89
	0.88	0.79	0.82	0.68	0.74	0.71	0.86
	0.93	0.95	0.89	0.83	0.86	0.81	0.87
	0.85	0.79	0.78	0.64	0.72	0.66	–
SVM	0.91	0.81	0.90	0.91	0.93	0.95	0.85
	0.80	0.69	0.85	0.61	0.60	0.69	0.84
	0.84	0.74	0.87	0.63	0.63	0.75	0.84
	0.90	0.95	0.91	0.83	0.86	0.90	0.78
	0.79	0.69	0.85	0.61	0.60	0.68	–
LLAMA2-CoT	0.69	0.72	0.48	0.82	0.74	0.80	0.77
	0.54	0.62	0.34	0.86	0.67	0.85	0.72
	0.57	0.58	0.39	0.83	0.70	0.79	0.67
	0.54	0.62	0.34	0.86	0.67	0.84	0.72
	0.54	0.60	0.37	0.84	0.68	0.78	–
LLAMA2-SFT	0.93	0.82	0.90	0.90	0.87	0.88	1.00
	0.92	0.94	0.94	0.90	0.87	0.88	1.00
	0.91	0.88	0.96	0.89	0.84	0.78	1.00
	0.92	0.85	0.93	0.89	0.84	0.83	1.00
	0.91	0.82	0.90	0.89	0.87	0.87	–

Sensitivity	Daytime sleepiness	Napping	Night wakings	Sleep problem	Bad sleep quality	Snoring	Sleep duration
Specificity
F1
PPV
AUROC
Rule-based NLP	1.00	0.50	1.00	0.85	0.62	0.94	1.00
	1.00	0.99	0.99	0.93	0.51	0.97	1.00
	1.00	0.98	0.99	0.91	0.91	0.97	1.00
	1.00	0.50	0.75	0.80	0.60	0.89	1.00
	1.00	0.97	0.98	0.92	0.50	0.86	–
DT	0.86	0.89	0.82	0.78	0.63	0.75	0.79
	0.86	0.89	0.80	0.72	0.57	0.75	0.74
	0.90	0.98	0.81	0.74	0.58	0.75	0.76
	0.86	0.89	0.84	0.84	0.81	0.78	0.77
	0.85	0.89	0.79	0.71	0.56	0.75	–
LR	0.90	0.47	0.92	0.91	0.42	0.42	0.77
	0.77	0.50	0.83	0.58	0.50	0.50	0.67
	0.81	0.48	0.86	0.59	0.45	0.46	0.71
	0.89	0.94	0.91	0.82	0.83	0.84	0.78
	0.78	0.50	0.82	0.58	0.50	0.50	–
KNN	0.93	0.79	0.87	0.79	0.76	0.70	0.84
	0.85	0.79	0.79	0.65	0.72	0.66	0.89
	0.88	0.79	0.82	0.68	0.74	0.71	0.86
	0.93	0.95	0.89	0.83	0.86	0.81	0.87
	0.85	0.79	0.78	0.64	0.72	0.66	–
SVM	0.91	0.81	0.90	0.91	0.93	0.95	0.85
	0.80	0.69	0.85	0.61	0.60	0.69	0.84
	0.84	0.74	0.87	0.63	0.63	0.75	0.84
	0.90	0.95	0.91	0.83	0.86	0.90	0.78
	0.79	0.69	0.85	0.61	0.60	0.68	–
LLAMA2-CoT	0.69	0.72	0.48	0.82	0.74	0.80	0.77
	0.54	0.62	0.34	0.86	0.67	0.85	0.72
	0.57	0.58	0.39	0.83	0.70	0.79	0.67
	0.54	0.62	0.34	0.86	0.67	0.84	0.72
	0.54	0.60	0.37	0.84	0.68	0.78	–
LLAMA2-SFT	0.93	0.82	0.90	0.90	0.87	0.88	1.00
	0.92	0.94	0.94	0.90	0.87	0.88	1.00
	0.91	0.88	0.96	0.89	0.84	0.78	1.00
	0.92	0.85	0.93	0.89	0.84	0.83	1.00
	0.91	0.82	0.90	0.89	0.87	0.87	–

Highlighted are the best performances on each sleep concept.

Table 7.

Performance of the rule-based NLP algorithm and machine learning models.

Sensitivity	Daytime sleepiness	Napping	Night wakings	Sleep problem	Bad sleep quality	Snoring	Sleep duration
Specificity
F1
PPV
AUROC
Rule-based NLP	1.00	0.50	1.00	0.85	0.62	0.94	1.00
	1.00	0.99	0.99	0.93	0.51	0.97	1.00
	1.00	0.98	0.99	0.91	0.91	0.97	1.00
	1.00	0.50	0.75	0.80	0.60	0.89	1.00
	1.00	0.97	0.98	0.92	0.50	0.86	–
DT	0.86	0.89	0.82	0.78	0.63	0.75	0.79
	0.86	0.89	0.80	0.72	0.57	0.75	0.74
	0.90	0.98	0.81	0.74	0.58	0.75	0.76
	0.86	0.89	0.84	0.84	0.81	0.78	0.77
	0.85	0.89	0.79	0.71	0.56	0.75	–
LR	0.90	0.47	0.92	0.91	0.42	0.42	0.77
	0.77	0.50	0.83	0.58	0.50	0.50	0.67
	0.81	0.48	0.86	0.59	0.45	0.46	0.71
	0.89	0.94	0.91	0.82	0.83	0.84	0.78
	0.78	0.50	0.82	0.58	0.50	0.50	–
KNN	0.93	0.79	0.87	0.79	0.76	0.70	0.84
	0.85	0.79	0.79	0.65	0.72	0.66	0.89
	0.88	0.79	0.82	0.68	0.74	0.71	0.86
	0.93	0.95	0.89	0.83	0.86	0.81	0.87
	0.85	0.79	0.78	0.64	0.72	0.66	–
SVM	0.91	0.81	0.90	0.91	0.93	0.95	0.85
	0.80	0.69	0.85	0.61	0.60	0.69	0.84
	0.84	0.74	0.87	0.63	0.63	0.75	0.84
	0.90	0.95	0.91	0.83	0.86	0.90	0.78
	0.79	0.69	0.85	0.61	0.60	0.68	–
LLAMA2-CoT	0.69	0.72	0.48	0.82	0.74	0.80	0.77
	0.54	0.62	0.34	0.86	0.67	0.85	0.72
	0.57	0.58	0.39	0.83	0.70	0.79	0.67
	0.54	0.62	0.34	0.86	0.67	0.84	0.72
	0.54	0.60	0.37	0.84	0.68	0.78	–
LLAMA2-SFT	0.93	0.82	0.90	0.90	0.87	0.88	1.00
	0.92	0.94	0.94	0.90	0.87	0.88	1.00
	0.91	0.88	0.96	0.89	0.84	0.78	1.00
	0.92	0.85	0.93	0.89	0.84	0.83	1.00
	0.91	0.82	0.90	0.89	0.87	0.87	–

Sensitivity	Daytime sleepiness	Napping	Night wakings	Sleep problem	Bad sleep quality	Snoring	Sleep duration
Specificity
F1
PPV
AUROC
Rule-based NLP	1.00	0.50	1.00	0.85	0.62	0.94	1.00
	1.00	0.99	0.99	0.93	0.51	0.97	1.00
	1.00	0.98	0.99	0.91	0.91	0.97	1.00
	1.00	0.50	0.75	0.80	0.60	0.89	1.00
	1.00	0.97	0.98	0.92	0.50	0.86	–
DT	0.86	0.89	0.82	0.78	0.63	0.75	0.79
	0.86	0.89	0.80	0.72	0.57	0.75	0.74
	0.90	0.98	0.81	0.74	0.58	0.75	0.76
	0.86	0.89	0.84	0.84	0.81	0.78	0.77
	0.85	0.89	0.79	0.71	0.56	0.75	–
LR	0.90	0.47	0.92	0.91	0.42	0.42	0.77
	0.77	0.50	0.83	0.58	0.50	0.50	0.67
	0.81	0.48	0.86	0.59	0.45	0.46	0.71
	0.89	0.94	0.91	0.82	0.83	0.84	0.78
	0.78	0.50	0.82	0.58	0.50	0.50	–
KNN	0.93	0.79	0.87	0.79	0.76	0.70	0.84
	0.85	0.79	0.79	0.65	0.72	0.66	0.89
	0.88	0.79	0.82	0.68	0.74	0.71	0.86
	0.93	0.95	0.89	0.83	0.86	0.81	0.87
	0.85	0.79	0.78	0.64	0.72	0.66	–
SVM	0.91	0.81	0.90	0.91	0.93	0.95	0.85
	0.80	0.69	0.85	0.61	0.60	0.69	0.84
	0.84	0.74	0.87	0.63	0.63	0.75	0.84
	0.90	0.95	0.91	0.83	0.86	0.90	0.78
	0.79	0.69	0.85	0.61	0.60	0.68	–
LLAMA2-CoT	0.69	0.72	0.48	0.82	0.74	0.80	0.77
	0.54	0.62	0.34	0.86	0.67	0.85	0.72
	0.57	0.58	0.39	0.83	0.70	0.79	0.67
	0.54	0.62	0.34	0.86	0.67	0.84	0.72
	0.54	0.60	0.37	0.84	0.68	0.78	–
LLAMA2-SFT	0.93	0.82	0.90	0.90	0.87	0.88	1.00
	0.92	0.94	0.94	0.90	0.87	0.88	1.00
	0.91	0.88	0.96	0.89	0.84	0.78	1.00
	0.92	0.85	0.93	0.89	0.84	0.83	1.00
	0.91	0.82	0.90	0.89	0.87	0.87	–

Highlighted are the best performances on each sleep concept.

The machine learning models, encompassing DT, LR, KNN, and SVM, showed varied performances across different sleep concepts. The best parameters identified through grid search optimization for each model were maximum depth of 10 and minimum samples split of 4 for DT, regularization strength of 1 for LR, 7 neighbors with Euclidean distance metric for KNN and radial basis function kernel with a regularization parameter of 1 for SVM. The AUROC scores for these models generally ranged from 0.70 to 0.95. The SVM model, in particular, demonstrated robustness, with high sensitivity and specificity scores, with highest at 0.95 for identifying night wakings and maintaining strong performance in sleep duration. However, these models generally exhibited lower PPV scores compared to the rule-based NLP algorithm, indicating a higher rate of false positives. The KNN model displayed notable consistency across metrics, suggesting its capability in handling the unstructured nature of clinical text data, albeit with some limitations in precision as indicated by its PPV scores. The variability in performance across these models underscores the challenges in applying machine learning to clinical text classification, particularly with sparse and infrequent concepts. These results are consistent with previous studies⁴⁷^,⁴⁸ that machine learning models might not be effective in clinical text classification when the size of the annotated training dataset is small, and the concepts of interest are sparse and infrequent in the documents.

The LLM-based NLP algorithms, LLAMA2-chain of thought (CoT) and LLAMA2 with finetuning (SFT), introduced advanced contextual understanding to the task. The LLAMA2-SFT, leveraging LoRA-based parameter-efficient finetuning, exhibited remarkable performance, closely rivaling the rule-based NLP algorithm, particularly in sleep duration where it achieved perfect scores. The AUROC scores for LLAMA2-SFT were among the highest, with highest for bad sleep quality (0.87) and snoring (0.88). It achieved high sensitivity, specificity, and F1 scores, especially in processing complex sleep concepts like night wakings and sleep problems, indicating its strong contextual comprehension and adaptability. The LLAMA2-CoT approach showed a more moderate performance, illustrating the potential limitations of relying solely on reasoning chains without finetuning for highly specialized tasks like clinical concept extraction.

Comparing the 3 types of algorithms, the rule-based NLP and LLAMA2-SFT models stand out for their superior performance, particularly in capturing the intricacies of sleep duration with perfect accuracy. The rule-based NLP algorithm excels in specificity and sensitivity, attributed to its tailored rules that effectively capture specific sleep-related concepts. LLAMA2-SFT, with its finetuning, demonstrates comparable excellence, benefiting from the deep contextual understanding and adaptability of LLMs to the nuances of clinical narratives. The AUROC scores for these models support this observation, with the rule-based NLP algorithm and LLAMA2-SFT showing high scores across different concepts, indicating their robustness in distinguishing between classes. While machine learning models offer valuable insights, their performance, particularly in terms of PPV, indicates a susceptibility to false positives, a critical limitation in clinical applications where accuracy is paramount.

The exceptional results of the rule-based NLP and LLAMA2-SFT models underscore their effectiveness in clinical text classification, suggesting that a hybrid approach leveraging the precision of rule-based methods and the contextual adaptability of finetuned LLMs could provide a robust solution for extracting sleep information from the unstructured text of EHRs. This is particularly evident in their handling of sleep duration, where both models demonstrated their capability to accurately and reliably capture sleep patterns, highlighting the potential for comprehensive sleep information extraction in clinical settings.

Error analysis of the rule-based NLP algorithm

We conducted an error analysis of the documents misclassified by the rule-based NLP algorithm and analyzed the causes of the false positives and false negatives for each sleep concept. Some false positives were due to our annotators failing to annotate the information. For example, in the text “Histories Past Medical History Combined Chronic Systolic/Diastolic CHF COPD (industrial exposure) CAD s/p stents PMH Right Ocular Stroke—chronic visual defect BPH Type 2 DM OSA on BiPAP,” the rule-based NLP algorithm identified OSA as snoring and sleep problem. However, the concept had not been annotated. Many semi-structured clinical text auto-populated in the EHR system is difficult for the annotator to read and annotate. In another false positive, the sentence “He had episodes yesterday in which he became confused after waking up from a nap” indicates that the patient had a nap that may not be related to sleep pattern. In another false positive case for sleep problem “Depression screen done 7/2017, PHQ9 score 16 points for sleep problem which seems better now,” the NLP algorithm couldn’t identify that this sentence was not about a positive sleep problem. In a false-positive case for bad sleep quality, the sentence “Take Melatonin 5 mg at bedtime every night for 3- 4 weeks for difficulty falling asleep” was a suggestion for the patient.

Some false negatives are due to errors in negation detection. For example, in the sentence “The patient’s daughter states that she has not been complaining of her back pain or of her leg cramps we discussed the fact that she is doing less and does nap during the day,” the algorithm incorrectly identified this mention as negated since it failed to identify 2 sentences. In another example for sleep problem “Change in social contacts/activities? No Patient Active Problem List Diagnosis Primary open angle glaucoma Urge incontinence Backache, unspecified Pneumonia, organism unspecified Insomnia,” the NLP algorithm incorrectly split the sentence into “Change in social contacts/activities?” and “No Patient Active Problem List Diagnosis Primary open angle glaucoma Urge incontinence Backache, unspecified Pneumonia, organism unspecified Insomnia” and wrongly identified the negation. The semi-structured clinical text also confused the NLP algorithm in detecting sentences and negations.

Discussion

Detailed descriptions of SDOH are usually captured in unstructured clinical text; however, the SDOH information may be sparsely documented due to the lack of clinical practice guidelines for documenting such information. Our study shows that sleep information is infrequently recorded in clinical notes for patients with AD. For example, in the gold standard dataset, only 14% of clinical documents recorded snoring concepts (81 out of 570), 7.2% napping (41 out of 570), 16.5% sleep problem (94 out of 570), 12.3% bad sleep quality (70 out of 570), 24% daytime sleepiness (137 out of 570), 24.4% night wakings (139 out of 570), and approximately 56% sleep duration (570 with 320/159/91 distribution).

Another challenge we encountered during the project was the definition of sleep-related concepts. We initially considered 8 concepts, including sleep disorder, sleep problem symptoms, snoring, napping, sleep quality, daytime sleepiness, night wakings, and sleep duration, with detailed granularity according to the relevant sleep research in the literature. For example, there were 4 categories associated with snoring or daytime sleepiness: negated, positive, sometimes, and all the time. There were 3 categories for night wakings: 0, 1-2, and >2. However, during the annotation process, we found that the granular categories for each concept were rarely used and there were significant overlaps between sleep problem symptoms and other concepts. For example, phrases like “staying up all night” meet the description of insomnia, but the patient was never diagnosed with insomnia. Likewise, snoring and sleep apnea shared concept-likeness but are not always annotated similarly. For example, the concept of snoring is annotated as snoring but not sleep apnea. However, the concept of sleep apnea is annotated as a sleep problem and snoring because the concept meets both definitions. Thus, we simplified the concept definition and the categories for each concept.

Since SDOH comprises more conditions related to socioeconomic status, living environment, housing, education, food, community, it might be more challenging to define these SDOH concepts. It is also questionable whether such information is adequately documented in EHRs and whether such information from EHRs would be useful for research. Thus, a feasibility study of assessing the availability of SDOH in EHRs for a certain cohort of patients might be necessary before algorithm development. In addition, it is also a tedious and time-consuming process to manually annotate a gold standard dataset. A potential beginning point of building automated systems to extract SDOH from EHRs might be a community effort to build an SDOH ontology and terminology.

Additionally, sleep information is infrequently documented in the clinical notes and keywords are shared with concepts. Although we used IR to select the documents with keywords related to sleep, keywords including wheeze, wheezing, and apnea appear often but are unrelated to the patient’s sleep. For example, physicians commonly check a patient’s respiratory health and record the presence of wheezing in the clinical notes. Wheezing, a shared concept between respiratory health and sleep, was found problematic when retrieving sleep-specific documents. Using a comprehensive list of keywords that sufficiently cover the domain of interest to retrieve relevant clinical documents has been adopted as the general approach for sampling a set of documents to be annotated and extracted. However, this approach requires tedious work with collaboration with an engaged focus group. In addition, this sampling approach hampers the NLP methods to identify rare cases and uncommon phenotypes, which is a major threat to NLP generalizability.

Limitations

There are several limitations in this study. First, the ICD codes used to define AD may not be optimal. However, a more comprehensive way to define AD is out of scope of this study. Second, the initial search keywords used to retrieve sleep-related clinical notes may not be complete and could miss some documents. However, this could be a common problem for SDOH information extraction due to the sparse and infrequent documentation in clinical notes. Third, we acknowledge that the annotated dataset used to train and test the proposed systems is relatively small, which may limit the usability of the system and discredit the conclusions. However, the clinical note annotation is a time-consuming and expensive process. Each document requires substantial time (∼2 hours) for each annotator to complete the annotations of 7 sleep-related concepts. The proposed NLP systems in this study are still valuable to the literature. Fourth, we used ICD codes to identify 7266 patients and retrieved about 1.1 million clinical documents to study Alzheimer’s disease (AD), but relying solely on ICD codes for patient identification may inadvertently include non-AD individuals, potentially introducing data noise into our analysis. Last but not least, we did not consider sleep information in other EHR data types (eg, diagnosis codes, survey data, questionnaire data), sleep studies such as polysomnography, and sleep tests such as multiple sleep latency test (MSLT), which should be considered for further cohort studies on the association between sleep and AD.

Future work

In future work, we plan to explore more sophisticated methods in retrieving relevant documents to the considered medical concepts with high precision. This might be a key challenge in collecting a corpus for studying an SDOH concept. In addition, we will also investigate novel machine learning methods that require less or no data for training, such as semi-supervised learning and self-supervised learning.

Given the notable imbalance in the dataset for certain sleep concepts, such as napping, where the ratio of positive to negative samples is significantly skewed, we opted not to employ specific sampling strategies. These techniques, while common in machine learning, are less prevalent in rule-based NLP. We did not want to employ sampling strategies during preprocessing to ensure consistency across different models. Future studies could explore sampling strategies to determine its impact on the performance of machine learning models.

This work could also nicely support the research on the connection between sleep and AD. Knowing that sleep is one of the modifiable lifestyle-related factors, this provides evidence that the research being conducted on AD and sleep interventions is necessary and critical. Research should continue to understand the associations among sleep variables (eg, sleep duration, sleep difficulties, and snoring) and cognitive function as well as interventions that are more effective to address sleep disturbances in older adults with AD.

Furthermore, we plan to test the generalizability of the algorithms by applying it to datasets from other hospitals. Specifically, we will leverage the Evolve to Next-gen ACT (ENACT) network, which is an National Institutes of Health (NIH)-funded federated network with data contributed from over 50 Clinical and Translational Science Awards (CTSA) hubs, to test the algorithm across varied clinical settings.

Conclusion

The study underscores the effectiveness of NLP in extracting sleep information from the clinical notes of AD patients, with the rule-based algorithm showing the highest accuracy across all sleep concepts. Our findings demonstrate that the rule-based NLP algorithm consistently outperformed machine learning and LLM-based algorithms across all evaluated sleep concepts, showcasing its superior accuracy and reliability. This study focused on the clinical notes of patients with AD, but could be extended to general sleep information extraction for other diseases.

Furthermore, the methodologies and findings of this study have broader implications for the application of NLP in healthcare. The open-source nature of the developed rule-based NLP algorithm and the insights gained from comparing different NLP approaches can be leveraged by other researchers and practitioners to advance the extraction of health-related information from EHRs.

Author contributions

Sonish Sivarajkumar: conceptualized the study, wrote the manuscript; Thomas Yu Chow Tam: conducted data analysis; edited the manuscript; Haneef Ahamed Mohammad: conducted data analysis; edited the manuscript; Samuel Viggiano: conducted data analysis; edited the manuscript; David Oniani: conducted data analysis; edited the manuscript; Shyam Visweswaran: edited the manuscript; Yanshan Wang: conceptualized the study, wrote the manuscript.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This project was partially supported by the University of Pittsburgh Momentum Funds and the National Institutes of Health through Grant Numbers UL1TR001857, U24TR004111, and R01LM014306 funds. The funders had no role in the design of the study, collection, analysis, and interpretation of data and in preparation of the manuscript. The views presented in this report are not necessarily representative of the funder’s views and belong solely to the authors.

Conflicts of interest

None declared.

Data availability

The NLP algorithm is publicly available through the Open Health Natural Language Processing (OHNLP) consortium at GitHub (https://github.com/OHNLP/nlp4sleep).

References

1

Alzheimer’s Association

.

2018 Alzheimer’s disease facts and figures

.

Alzheimers Dementia

.

2018

;

14

(

3

):

367

-

429

.

2

Alzheimer’s Association

.

2019 Alzheimer’s disease facts and figures

.

Alzheimers Dementia

.

2019

;

15

(

3

):

321

-

387

.

10.1007/s00405-023-08282-5

3

Jia

J

,

Wei

C

,

Chen

S

, et al.

The cost of Alzheimer’s disease in China and re‐estimation of costs worldwide

.

Alzheimers Dement

.

2018

;

14

(

4

):

483

-

491

.

4

Lechien

JR

,

Georgescu

BM

,

Hans

S

,

Chiesa-Estomba

CM.

ChatGPT performance in laryngology and head and neck surgery: a clinical case-series

.

Eur Arch Otorhinolaryngol

.

2024

;

281

(

1

):

319

-

333

.

5

Feigin

VL

,

Abajobir

AA

,

Abate

KH

,

GBD 2015 Neurological Disorders Collaborator Group

, et al.

Global, regional, and national burden of neurological disorders during 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015

.

Lancet Neurol

.

2017

;

16

(

11

):

877

-

897

.

6

Rosenberg

A

,

Ngandu

T

,

Rusanen

M

, et al.

Multidomain lifestyle intervention benefits a large elderly population at risk for cognitive decline and dementia regardless of baseline characteristics: the FINGER trial

.

Alzheimers Dement

.

2018

;

14

(

3

):

263

-

270

.

7

Keage

HAD

,

Banks

S

,

Yang

KL

,

Morgan

K

,

Brayne

C

,

Matthews

FE.

What sleep characteristics predict cognitive decline in the elderly?

Sleep Med

.

2012

;

13

(

7

):

886

-

892

.

8

Cricco

M

,

Simonsick

EM

,

Foley

DJ.

The impact of insomnia on cognitive functioning in older adults

.

J Am Geriatr Soc

.

2001

;

49

(

9

):

1185

-

1189

.

9

Foley

D

,

Monjan

A

,

Masaki

K

, et al.

Daytime sleepiness is associated with 3‐year incident dementia and cognitive decline in older Japanese‐American men

.

J Am Geriatr Soc

.

2001

;

49

(

12

):

1628

-

1632

.

PubMed

10

Elwood

PC

,

Bayer

AJ

,

Fish

M

,

Pickering

J

,

Mitchell

C

,

Gallacher

JE.

Sleep disturbance and daytime sleepiness predict vascular dementia

.

J Epidemiol Community Health

.

2011

;

65

(

9

):

820

-

824

.

11

Quesnot

A

,

Alperovitch

A.

Snoring and risk of cognitive decline: a 4‐year follow‐up study in 1389 older individuals

.

J Am Geriatr Soc

.

1999

;

47

(

9

):

1159

-

1160

.

12

Tworoger

SS

,

Lee

S

,

Schernhammer

ES

,

Grodstein

F.

The association of self-reported sleep duration, difficulty sleeping, and snoring with cognitive function in older women

.

Alzheimer Dis Assoc Disord

.

2006

;

20

(

1

):

41

-

48

.

13

Potvin

O

,

Lorrain

D

,

Forget

H

, et al.

Sleep quality and 1-year incident cognitive impairment in community-dwelling older adults

.

Sleep

.

2012

;

35

(

4

):

491

-

499

.

14

Burke

SL

,

Cadet

T

,

Alcide

A

,

O’Driscoll

J

,

Maramaldi

P.

Psychosocial risk factors and Alzheimer’s disease: the associative effect of depression, sleep disturbance, and anxiety

.

Aging Ment Health

.

2018

;

22

(

12

):

1577

-

1584

.

15

Jelicic

M

,

Bosma

H

,

Ponds

RW

,

Van Boxtel

MP

,

Houx

PJ

,

Jolles

J.

Subjective sleep problems in later life as predictors of cognitive decline. Report from the Maastricht Ageing Study (MAAS)

.

Int J Geriatr Psychiatry

.

2002

;

17

(

1

):

73

-

77

.

16

Falck

RS

,

Best

JR

,

Davis

JC

, et al.

Sleep and cognitive function in chronic stroke: a comparative cross-sectional study

.

Sleep

.

2019

;

42

(

5

):

zsz040

.

17

Chen

W-C

,

Wang

X-y

Longitudinal associations between sleep duration and cognitive impairment in Chinese elderly

.

Front Aging Neurosci

.

2022

;

14

:

1037650

.

18

Blumenthal

D.

Launching hitech

.

N Engl J Med

.

2010

;

362

(

5

):

382

-

385

.

19

Knutson

KL

,

Pershing

ML

,

Abbott

S

, et al.

Study protocol for a longitudinal observational study of disparities in sleep and cognition in older adults: the DISCO study

.

BMJ Open

.

2023

;

13

(

11

):

e073734

.

20

Blackman

J

,

Stankeviciute

L

,

Arenaza-Urquijo

EM

,

European Prevention of Alzheimer’s Disease (EPAD) Consortium

, et al.

Cross-sectional and longitudinal association of sleep and Alzheimer biomarkers in cognitively unimpaired adults

.

Brain Commun

.

2022

;

4

(

6

):

fcac257

.

21

Perera

G

,

Pedersen

L

,

Ansel

D

, et al.

Dementia prevalence and incidence in a federation of European Electronic Health Record databases: the European Medical Informatics Framework resource

.

Alzheimers Dement

.

2018

;

14

(

2

):

130

-

139

.

22

Chen

L

,

Reed

C

,

Happich

M

,

Nyhuis

A

,

Lenox-Smith

A.

Health care resource utilisation in primary care prior to and after a diagnosis of Alzheimer’s disease: a retrospective, matched case–control study in the United Kingdom

.

BMC Geriatr

.

2014

;

14

(

1

):

76

-

79

.

23

Poblador-Plou

B

,

Calderón-Larrañaga

A

,

Marta-Moreno

J

, et al.

Comorbidity of dementia: a cross-sectional study of primary care older patients

.

BMC Psychiatry

.

2014

;

14

(

1

):

84

-

88

.

24

Mayeda

ER

,

Glymour

MM

,

Quesenberry

CP

,

Whitmer

RA.

Inequalities in dementia incidence between six racial and ethnic groups over 14 years

.

Alzheimers Dement

.

2016

;

12

(

3

):

216

-

224

.

25

Wang

Y

,

Wang

L

,

Rastegar-Mojarad

M

, et al.

Clinical information extraction applications: a literature review

.

J Biomed Inform

.

2018

;

77

:

34

-

49

.

26

Felder

JN

,

Baer

RJ

,

Rand

L

,

Jelliffe-Pawlowski

LL

,

Prather

AA.

Sleep disorder diagnosis during pregnancy and risk of preterm birth

.

Obstet Gynecol

.

2017

;

130

(

3

):

573

-

581

.

27

Hsiao

Y-H

,

Chen

Y-T

,

Tseng

C-M

, et al.

Sleep disorders and increased risk of autoimmune diseases in individuals without sleep apnea

.

Sleep

.

2015

;

38

(

4

):

581

-

586

.

28

Ramesh

J

,

Keeran

N

,

Sagahyroon

A

,

Aloul

F

, eds.

Towards Validating the Effectiveness of Obstructive Sleep Apnea Classification from Electronic Health Records Using Machine Learning Healthcare

.

Multidisciplinary Digital Publishing Institute

;

2021

.

29

Larsen

AJ

,

Rindal

DB

,

Hatch

JP

, et al.

Evidence supports no relationship between obstructive sleep apnea and premolar extraction: an electronic health records review

.

J Clin Sleep Med

.

2015

;

11

(

12

):

1443

-

1448

.

30

Jolley

RJ

,

Liang

Z

,

Peng

M

, et al.

Identifying cases of sleep disorders through international classification of diseases (ICD) codes in administrative data

.

Int J Popul Data Sci

.

2018

;

3

(

1

):

448

.

PubMed

31

Singer

EV

,

Niarchou

M

,

Maxwell-Horn

A

, et al.

2021

. Characterizing sleep disorders in an autism-specific collection of electronic health records. Sleep Med. 2022;92:

88

-

95

.

32

Divita

G

,

Luo

G

,

Tran

L-TT

,

Workman

TE

,

Gundlapalli

AV

,

Samore

MH.

General symptom extraction from VA electronic medical notes. In:

MEDINFO 2017: Precision Healthcare through Informatics

.

IOS Press

;

2017

:

356

-

360

.

33

Jackson

RG

,

Patel

R

,

Jayatilleke

N

, et al.

Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project

.

BMJ Open

.

2017

;

7

(

1

):

e012012

.

34

Zhou

L

,

Baughman

AW

,

Lei

VJ

, et al. Identifying patients with depression using free-text clinical documents. In:

MEDINFO 2015: eHealth-Enabled Health

.

IOS Press

;

2015

:

629

-

633

.

35

Irving

J

,

Patel

R

,

Oliver

D

, et al.

Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk

.

Schizophr Bull

.

2021

;

47

(

2

):

405

-

414

.

36

Kartoun

U

,

Aggarwal

R

,

Beam

AL

, et al.

Development of an algorithm to identify patients with physician-documented insomnia

.

Sci Rep

.

2018

;

8

(

1

):

7862

-

7869

.

37

Tang

H

,

Solti

I

,

Kirkendall

E

, et al.

Leveraging food and drug administration adverse event reports for the automated monitoring of electronic health records in a pediatric hospital

.

Biomed Inform Insights

.

2017

;

9

:

1178222617713018

.

38

Wang

X

,

Hripcsak

G

,

Markatou

M

,

Friedman

C.

Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study

.

J Am Med Inform Assoc

.

2009

;

16

(

3

):

328

-

337

.

39

Devore

EE

,

Grodstein

F

,

Schernhammer

ES.

Sleep duration in relation to cognitive function among older adults: a systematic review of observational studies

.

Neuroepidemiology

.

2016

;

46

(

1

):

57

-

78

.

40

Liu

H

,

Bielinski

SJ

,

Sohn

S

, et al.

An information extraction framework for cohort identification using electronic health records

.

AMIA Jt Summits on Transl Sci Proc

.

2013

;

2013

:

149

-

153

.

41

Ferrucci

D

,

Lally

A.

UIMA: an architectural approach to unstructured information processing in the corporate research environment

.

Nat Lang Eng

.

2004

;

10

(

3-4

):

327

-

348

.

42

Mikolov

T

,

Chen

K

,

Corrado

G

,

Dean

J.

2013

. Efficient estimation of word representations in vector space. arXiv, arXiv:13013781, preprint: not peer reviewed.

43

Touvron

H

,

Martin

L

,

Stone

K

, et al.

2023

. LLAMA 2: open foundation and fine-tuned chat models. arXiv, arXiv:230709288, preprint: not peer reviewed.

44

Sivarajkumar

S

,

Kelley

M

,

Samolyk-Mazzanti

A

,

Visweswaran

S

,

Wang

Y.

An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study. JMIR Med Inform. 2024;12:e55318.

45

Sivarajkumar

S

,

Wang

Y.

HealthPrompt: a zero-shot learning paradigm for clinical natural language processing

.

AMIA Annu Symp Proc.

2022

;

2022

:

972

-

981

.

PubMed

46

Pennsylvania Department of Health

.

The State of Health Equity in Pennsylvania

.

Office of Health Equity

;

2019

.

47

Wang

Y

,

Sohn

S

,

Liu

S

, et al.

A clinical text classification paradigm using weak supervision and deep representation

.

BMC Med Inform Decis

.

2019

;

19

(

1

):

1

-

13

.