Abstract

Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Our method significantly outperformed a state-of-the-art deep-learning method that was trained with experimentally determined tertiary structures, and either outperformed or was comparable with several other language-model-based state-of-the-art methods with amino acid sequences as input. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D.

Introduction

Proteins, the essential functional units of life, play crucial roles in catalyzing biochemical reactions (1), providing structural support for anaphase spindle (2), and regulating gene expressions (3,4). Accurately annotating protein functions is important for understanding biological processes and discovering novel drug targets (5,6). However, experimentally determining the functions of proteins is both laborious and expensive (7), whereas machine learning approaches can decrease the time and cost required for this task making accurate and comprehensive annotations possible and offering a promising avenue for protein function prediction.

Amino acid sequences determine protein structures (8,9), and protein structures determine the function of proteins (10). Therefore, proteins that share similar sequences may have similar functions. For a considerable length of time, protein function predictors focus on using machine learning methods to leverage the sequence alignment, as revealed by the critical assessment of functional annotation (CAFA) challenge. The keyword analyses of the top ten methods in CAFA2 (11) and of all participating methods in CAFA3 (7) show that sequence alignment and machine learning are the two most frequently used approaches. Our previous tool PANDA (12) uses profile-profile alignments and PSI-BLAST (13) to find similar proteins, detects reserved protein domains, executes a Bayesian model to infer GO terms from domain architectures, and then combines the candidate GO terms from these approaches to make final predictions. DeepGOPlus (14) uses a one-dimensional (1D) convolutional neural network (CNN) to predict the protein functions from amino acid sequences. GODoc (15) applies a k-nearest-neighbor algorithm over sequence information, such as amino acid-coupling pattern representations, to predict protein functions. DEEPred (16) makes predictions by feeding the sequence features, such as subsequence profile map and pseudo amino acid composition, into stacked feed-forward deep neural networks (DNN) followed by a hierarchical post-processing method. ProLanGO (17) used a recurrent neural network (RNN)-based machine translation model to predict protein functions from protein sequences.

In addition to extracting knowledge from protein sequence alignment, some newer methods leverage that by using protein language models in the last few years. Our PANDA2 (18) utilizes a graph neural network (GNN) to model the GO-directed acyclic graph (GO-DAG) topology and incorporates features generated by the protein large language model (LLM) (19). UDSMProt (20) uses a self-supervised RNN to learn task-agnostic representations of sequences, which is then fine-tuned on the downstream task of protein function prediction. Littmann et al. (21) found that predicting GO terms based on proximity of embeddings from language models SeqVec (22) or ProtBert (23) outperformed naïve sequence-based transfer. Rives et al. (19) trained a deep transformer (24) on about 250 million sequences, and the embedding generated by this evolutionary-scale language modeling (ESM) contains information on protein structures, functions, and binding information, which outperformed others in a variety of downstream tasks (19). This pre-trained language model was used by many state-of-the-art methods, such as ATGO (25), DeepGO-SE(26), SPROF-GO (27) and NETGO3 (28).

Following the recent successful prediction of the protein three-dimensional (3D) structures, a few methods shifted the focus towards integrating sequence alignment, protein structure, and machine learning. DeepFRI (29) applies a graph convolutional network (GCN) to the features generated by a protein language model and experimentally determined structures. Because the DeepFRI model was trained on experimentally determined structures, the performance on predicted models was worse than that on experimentally determined structures (29). Also, DeepFRI does not fully capture the 3D information because the input tertiary structure is first converted to a 2D contact map before being fed into the GCN architecture. Similarly, GAT-GO (30) also converts the input tertiary structure to inter-residue contacts first and then feeds the 2D contact map into the architecture. COFACTOR predicts GO terms by combining independent predictions from structure-based (31), sequence-based, and protein-protein-interaction-based pipelines (32).

Advanced learning architectures have been developed to extract knowledge from protein tertiary structures for research topics including protein sequence design (33,34), generative models of proteins (35) and inverse folding (36). When designing PANDA-3D, we believed and later verified that these learning architectures can comprehensively and efficiently capture knowledge from protein 3D structures. Usually, two formats of features can be extracted from protein 3D structures: vector features and scalar features. Vector features can be the orientation of residues in the protein structure, whereas scalar features can be distances and angles. The popular GNNs, such as Battaglia et al. (37), usually cannot operate on the scalar and vector features simultaneously. The geometric vector perceptron (GVP) was specially designed for learning 3D macromolecular structures with scalar channels and vector channels (38,39). GVPs have been proven to have advantages over convolutional neural networks and graph neural networks for model quality assessment and computational protein design (39). Hsu et al. (36) revealed that the geometric reasoning capability of GVP-GNN layers is complementary to transformer layers for inverse folding. Therefore, we incorporated GVP-GNN in the learning network of PANDA-3D, and the output from the GVP-GNN is fed into a transformer decoder.

The transformer model (40) has shown great success in a wide range of tasks, such as natural language processing (24) and vision tasks (41). However, some protein function predictors utilize only the encoder blocks in their architectures (42,43) since the decoder block was not originally designed for multi-label classifications. We feed all GO terms used for prediction into the transformer decoder so that the decoder can learn the relationships or co-occurrence patterns among all possible GO terms, which are then combined with the output from the encoder by cross-attentions.

Materials and methods

PANDA-3D architecture

The architecture of PANDA-3D is depicted in Figure 1. We developed the encoder of PANDA-3D by modifying the GVP-transformer encoder blocks proposed for inverse folding (36). This modified encoder utilizes the geometric reasoning capability of GVP-GNN layers (39). In addition, we implemented the decoder layers specialized for multi-label classification inspired by (44). We fed all PANDA-3D candidate classes, which are GO terms, as the query into the decoder, allowing PANDA-3D to compute self-attention for the GO terms. The updated query obtained from the self-attention layer was subsequently utilized to calculate cross-attention with the output from the encoder layers. This enables PANDA-3D to learn the relationships between GO terms and the structure and sequence features through the cross-attention layer.

The overall architecture of PANDA-3D. Panel A shows that the GVP-GNN are used to extract information from predicted structures and protein sequence embeddings, succeeded by a variant transformer decoder producing confidence scores over query GO terms. Panel B shows the scalar and vector channels of a GVP layer. Panel C shows the architecture of a transformer decoder layers for multi-label classification.
Figure 1.

The overall architecture of PANDA-3D. Panel A shows that the GVP-GNN are used to extract information from predicted structures and protein sequence embeddings, succeeded by a variant transformer decoder producing confidence scores over query GO terms. Panel B shows the scalar and vector channels of a GVP layer. Panel C shows the architecture of a transformer decoder layers for multi-label classification.

Features

Structure-based features fed into GVP layers

To generate embeddings from AlphaFold predicted structures, we extracted several features including Euclidean distances of the top k neighbors for each residue, the vectors starting from the carbon alpha (hereafter referred to as Cα) of a source residue to the Cα of a destination residue (edge vectors), the vectors of the same type of atoms from the forward and backward adjacent residues (orientation vectors), the unit vectors from nitrogen (hereafter referred as N) to Cα and from Cα to carbon (hereafter referred as C) (side-chain vectors), and the predicted local-distance difference test (pLDDT) accuracy of each Cα predicted by AlphaFold. These features are referred to as structure-based features and are input into the GVP layers.

To avoid the inaccurately predicted residues that may affect our prediction, we not only masked the 3D coordinates of the residues with pLDDT scores less than 0.9 but also combined the embeddings of pLDDT scores with structure-based features in GVP layers. We tested a range of pLDDT score thresholds from 0.5 to 1 with a 0.1 internal. The pLDDT threshold of 0.9 removed approximately 53.52% of residues from the training and validation data. The deep learning model trained with a 0.9 threshold outperformed in terms of validation loss (Supplementary Table S1).

The features, such as orientation vectors and side-chain vectors, have directions and are therefore considered vector features, whereas others like Euclidean distances and pLDDTs are scalar features. To unify the dimensions of channels, we embedded these features into a space of 128 dimensions, which are then input into the GVP layers.

Structural-rotation and dihedral features

We first discuss rotation-equivariant and rotation-invariance properties as these two concepts are important to understand why we apply a rotation on the query tertiary structure. Both rotation-equivariant and rotation-invariance features satisfy f(T(structure)) = T(f(strucure)), where T is a rotation function and f is a function of generating features. However, for rotation-equivariant features, f(T(strucure)) != f(strucure), which means the rotation causes the change of the feature values. Vector features, which contain directions, are rotation-equivariant features, such as edge vectors and orientation vectors.

Scalar features, on the other hand, are rotation-invariance, such as dihedrals and Euclidean distances, which satisfies f(T(strucure)) = f(strucure), meaning that the output remains the same even though the input is rotated.

A GVP layer is rotation-equivariant, but the biological functions of proteins are not affected by the rotations of protein structures. Therefore, for each input tertiary structure, we unify its orientation in the 3D space or get its local reference frame, which is defined based on the positions of N, Cα, and C atoms of each amino acid, using the implementation (36) of the algorithm proposed in (45).

Dihedral degrees are also used as input, which is defined as the angle between the two hyperplanes formed by the N, Cα and C atoms of adjacent residues.

Sequential features

The evolutionary-scale language modeling (ESM) (19) of the amino acid sequence of the query protein is also used as one input of the GVP layers. Furthermore, the token, which is a number encoded from each amino acid, is input into the summation function that adds the embeddings of the tokens, rotation-invariance features, dihedral degrees, ESM features, and pLDDTs. We set the embedding dimension of all of these features to 128, which makes the summation of these different features possible. We then pass the summation feature to the transformer layers, which will be discussed in a later section.

GVP layers

We feed the embeddings of Euclidean distances, edge vectors, orientation vectors, side-chain vectors, pLDDTs, and ESM into GVP layers (39). PANDA-3D has two GVP layers, which encode the sequential and structural features into intermediate features. As depicted in Figure 1B, a GVP layer has scalar and vector channels, where rotation-invariant scalar features |$S_n\in \mathbb {R}^n$| and rotation-equivariant vector features |$V_v\in \ \mathbb {R}^{v\times 3}$| are fed, respectively.

(1)

The vector features V|$v$| are updated by a linear layer to |$V_h\ \in \ \mathbb {R}^{h\times 3}$|⁠, where h is max (⁠|$v$|⁠, n).

(2)

We apply a L2 norm to transform the vector features Vh into a scalar feature Sh.

(3)

The scalar features Sh and Sn are concatenated and then updated to Sm by a linear layer.

(4)

Another linear layer is then applied to update Vh to Vμ.

(5)
(6)

One of the final outputs of the GVP layers V′ is calculated by applying a nonlinear scale function σ+ on the normalized |$v$|μ, followed by a row-wise multiplication of Vμ.

(7)

The other output of the GVP layers S′ is calculated by applying a nonlinear scale function σ on Sm.

Transformer layers

As depicted in Figure 1A and C, we applied the decoder layers (44) that are implemented differently from the original transformer architecture (40). The original architecture (40) is for generating sentences or sequences of words on machine translation tasks with masking during training, but our task is for multi-label classifications without using attention masks. Each decoder layer has a multi-head self-attention layer, a multi-head cross-attention layer, and a linear feed-forward layer. In the decoder self-attention layer, all of the keys, values and queries are the embedding of candidate GO terms.

A decoder multi-head cross-attention layer calculates attention over the keys and values from the encoder output and the query from the previous self-attention layer. After that, a linear layer is added followed by a sigmoid function that generates the final outputs, which can be considered as the confidence scores for each candidate GO term.

Implementation and training details

We implemented PANDA-3D based on the code from the PyTorch library (46), ESM inverse folding (36), GVP (39) and Transformer (40). The model has eight million trainable parameters. We parallelly trained the model on two NVIDIA A100 GPUs with 40GB of memory each. The model converged after about 18 hours of training. We utilized a binary cross-entropy loss function with a logarithm function for training. To account for the significant difference between the numbers of negatively labeled GO terms and positively labeled GO terms in our training data set, we assign a weight of 3.0 to all of the positive classes when the loss value is computed. We used a batch size of eight, attention heads of eight, and a learning rate of 0.0001 with Adam optimizer. We conducted experiments with different numbers of GVP layers, numbers of decoder layers, learning rates, and batch sizes. The performance results are available in the Supplementary Tables S2 and S3.

Computational time and scalability

Our benchmark showed that the runtime of PANDA-3D for a protein that has 1093 residues on a Tesla V100 was 36 seconds (not including the time for generating the AlphaFold model), and GPU memory usage was about 1 GB. A limitation of PANDA-3D is that it needs the AlphaFold model as input, but AlphaFold DB includes the models for more than 200 million proteins (47). For the cases that a model is not included in AlphaFold DB, a user can generate an AlphaFold model using DeepMind’s Colab notebook or open-source code at https://alphafold.ebi.ac.uk/ (45,47).

Datasets

We downloaded the manually-reviewed protein sequences and experimentally determined (with the evidence codes: EXP, IDA, IMP, IGI, IEP, TAS or IC) protein functions in the format of GO terms from Swiss-Prot (48), which were released on 25 May 2022. All three ontologies of GO terms were used including molecular function ontology (MFO), biological process ontology (BPO), and cellular component ontology (CCO). The downloaded GO terms are nodes in the GO-directed acyclic diagrams (DAGs). For training, we propagated the nodes up to the root node, and all of the nodes and their ancestors are considered positively labeled GO terms for each protein. We used the GO definition released on 1 July 2022 (49). We downloaded all of the predicted protein tertiary structures as of May 2022 from AlphaFold DB (45,47).

A total of 68,523 proteins have both experimentally determined GO terms and AlphaFold models. We randomly split the proteins into training (80%), validation (10%), and testing (10%). Protein sequences from the testing set were searched against the training dataset of PANDA-3D and the other tools that we compared with, and those with a maximum PSI-BLAST (13) identity score greater than a cut-off value were removed. This is to make sure that the blind test data set has no overlap with the training datasets for a fair comparison of the performance. We tested the performance with different identity score cut-off values shown in a later section.

We only used the GO terms that had been annotated with at least 50 proteins in the training dataset as the candidate GO terms or machine-learning target classes. This is to ensure that the GO terms that PANDA-3D can predict have enough training data and good accuracy, but a tradeoff of this is that it reduces the number of GO terms that PANDA-3D can predict. We benchmarked two different numbers of proteins (45 and 50) for a candidate GO term to be included, and 50 resulted in slightly better performance (data not shown). The number of proteins in the training, validation, and testing datasets and the number of GO terms or machine-learning classes are shown in Table 1.

Evaluation metrics

We evaluated the methods using both protein- and term-centric evaluation measures, which were what the official assessment measures used in CAFA2, CAFA3 and CAFA-π (7,11). Both measures were performed on propagated predictions and propagated ground truths. In propagated predictions, a GO term’s confidence score was updated to the highest predicted score among its descendant GO terms. All of the ancestor GO terms of the downloaded leaf GO terms are considered positively annotated. We performed the maximum F-measure (Fmax), minimum semantic distance (Smin), and area under the precision-recall curve (AUPR) for the protein-centric evaluations, and used the area under the receiver operating characteristic (ROC) curve (AUROC) metric for the GO-centric evaluation as performed in CAFA (7,11,50).

The Fmax is calculated over a set of confidence-score thresholds as:

(8)

where pr(t) and rc(t) are precision and recall for all testing proteins over a threshold t, respectively. They are calculated as:

(9)
(10)

where m(t) represents the number of proteins with at least one GO term having a confidence score greater than or equal to t, and n denotes the number of total testing proteins. pri(t) and rci(t) are precision and recall of a protein that was calculated as:

(11)
(12)

where Pi represents the predicted GO term set having confidence scores greater than or equal to t, Ti presents the true GO term set, f denotes a GO term, and I represents an indicator function. We calculated the precision and recall for a protein when Pi contained at least one GO term when a maximum identity cut-off value was applied.

The calculation of Smin takes into account the unbalanced information content (IC) of GO terms. The information content (IC) of a GO term is calculated on all of the 68,523 proteins as follows:

(13)

where f is a GO term, Occurf indicates the occurrences of f and its descendants, and |${Occur}_{all_terms}$| is the total occurrence of all GO terms.

The Smin is calculated as:

(14)

where ru(t) and mi(t) are the average of remaining uncertainty rui(t) and misinformation mii(t), respectively. These measures are calculated using the following formulas:

(15)
(16)
(17)
(18)

where i means the measure of a protein, f represents a GO term, and I represents an indicator function.

Results

Overview

We performed both protein- and term-centric evaluations on the testing dataset. We compared PANDA-3D to a similar method, DeepFRI, which also utilizes tertiary structures and sequences for protein function prediction. We used AlphaFold-predicted structures instead of experimentally determined structures in our testing. We were unable to compare our approach with another similar method, GAT-GO, as neither its training data nor trained model is available.

We used two methods as baselines: Naïve and BLAST. The Naïve baseline method predicts GO terms based on the relative frequency of each GO term in the Uniprot Swiss Prot database. The BLAST method predicts GO terms by transferring the experimental GO terms of similar sequences found in the training dataset using PSI-BLAST (51), where the predicted scores are the maximum identity scores.

On the same dataset as PANDA-3D, we also trained and evaluated DeepGOCNN that is the neural network component of DeepGOPlus, which is a CNN-based network designed to directly predict protein functions from amino acid sequences (14).

On the second and third testing datasets, we performed protein-centric evaluations to compare PANDA-3D with DeepGO-SE and PANDA-3D with SPROF-GO, respectively. The details about the methodologies and/or training of the baseline methods, DeepGOCNN, DeepFRI, DeepGO-SE, and SPROF-GO can be found in the supplementary materials.

The performance of PANDA-3D on Fmax, Smin and AUPR

Figure 2 shows the precision-recall curves and Fmax scores for comparing PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST, which indicates that PANDA-3D outperforms DeepFRI and all baseline methods. Table 2 shows the performance of these methods in terms of Fmax, Smin and AUPR on the first testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95. PANDA-3D achieved the best Fmax scores for all three GO categories: 0.642 for MFO, 0.471 for BPO, and 0.705 for CCO. PANDA-3D also outperforms all other methods in terms of Smin and AUPR.

Table 1.

The number of proteins in the training, validation and testing datasets and the number of GO terms or machine-learning classes. The sequences in three separate testing datasets were searched against the training datasets of PANDA-3D and DeepFRI, PANDA-3D and DeepGO-SE, and PANDA-3D and SPROF-GO, respectively. The testing sequences with a maximum sequence identity score cutoff greater than a cut-off value were removed

Dataset# of proteins
Training51 245
Validation6419
Testing (DeepFRI) with identity score cutoff of0.954719
0.83533
0.73024
0.62506
0.51917
0.41167
Testing (DeepGO-SE) with identity score cutoff of0.95260
0.8198
0.7176
0.6154
0.5127
0.491
Testing (SPROF-GO) with identity score cutoff of0.95461
0.8351
0.7282
0.6234
0.5183
0.4117
Ontologies# of GO terms
MFO classes438
BPO classes3105
CCO classes405
Dataset# of proteins
Training51 245
Validation6419
Testing (DeepFRI) with identity score cutoff of0.954719
0.83533
0.73024
0.62506
0.51917
0.41167
Testing (DeepGO-SE) with identity score cutoff of0.95260
0.8198
0.7176
0.6154
0.5127
0.491
Testing (SPROF-GO) with identity score cutoff of0.95461
0.8351
0.7282
0.6234
0.5183
0.4117
Ontologies# of GO terms
MFO classes438
BPO classes3105
CCO classes405
Table 1.

The number of proteins in the training, validation and testing datasets and the number of GO terms or machine-learning classes. The sequences in three separate testing datasets were searched against the training datasets of PANDA-3D and DeepFRI, PANDA-3D and DeepGO-SE, and PANDA-3D and SPROF-GO, respectively. The testing sequences with a maximum sequence identity score cutoff greater than a cut-off value were removed

Dataset# of proteins
Training51 245
Validation6419
Testing (DeepFRI) with identity score cutoff of0.954719
0.83533
0.73024
0.62506
0.51917
0.41167
Testing (DeepGO-SE) with identity score cutoff of0.95260
0.8198
0.7176
0.6154
0.5127
0.491
Testing (SPROF-GO) with identity score cutoff of0.95461
0.8351
0.7282
0.6234
0.5183
0.4117
Ontologies# of GO terms
MFO classes438
BPO classes3105
CCO classes405
Dataset# of proteins
Training51 245
Validation6419
Testing (DeepFRI) with identity score cutoff of0.954719
0.83533
0.73024
0.62506
0.51917
0.41167
Testing (DeepGO-SE) with identity score cutoff of0.95260
0.8198
0.7176
0.6154
0.5127
0.491
Testing (SPROF-GO) with identity score cutoff of0.95461
0.8351
0.7282
0.6234
0.5183
0.4117
Ontologies# of GO terms
MFO classes438
BPO classes3105
CCO classes405
Table 2.

The performances of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST for Fmax, Smin and AUPR. The highest Fmax, the smallest Smin, and the highest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1

FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
Naïve0.3220.3110.60511.38149.40412.7280.1970.2270.525
BLAST0.5670.4070.5349.26949.21112.5660.4980.3050.465
DeepGOCNN0.4690.3870.6589.83446.86111.6350.4440.3360.69
DeepFRI0.4350.3520.4779.89448.07912.560.3050.2570.377
PANDA-3D0.6420.4710.7057.2943.60110.0270.6540.4450.766
FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
Naïve0.3220.3110.60511.38149.40412.7280.1970.2270.525
BLAST0.5670.4070.5349.26949.21112.5660.4980.3050.465
DeepGOCNN0.4690.3870.6589.83446.86111.6350.4440.3360.69
DeepFRI0.4350.3520.4779.89448.07912.560.3050.2570.377
PANDA-3D0.6420.4710.7057.2943.60110.0270.6540.4450.766
Table 2.

The performances of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST for Fmax, Smin and AUPR. The highest Fmax, the smallest Smin, and the highest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1

FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
Naïve0.3220.3110.60511.38149.40412.7280.1970.2270.525
BLAST0.5670.4070.5349.26949.21112.5660.4980.3050.465
DeepGOCNN0.4690.3870.6589.83446.86111.6350.4440.3360.69
DeepFRI0.4350.3520.4779.89448.07912.560.3050.2570.377
PANDA-3D0.6420.4710.7057.2943.60110.0270.6540.4450.766
FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
Naïve0.3220.3110.60511.38149.40412.7280.1970.2270.525
BLAST0.5670.4070.5349.26949.21112.5660.4980.3050.465
DeepGOCNN0.4690.3870.6589.83446.86111.6350.4440.3360.69
DeepFRI0.4350.3520.4779.89448.07912.560.3050.2570.377
PANDA-3D0.6420.4710.7057.2943.60110.0270.6540.4450.766
The Fmax scores and precision-recall curves of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95.
Figure 2.

The Fmax scores and precision-recall curves of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95.

Figure 3 presents the precision-recall curves and Fmax scores for comparing PANDA-3D with DeepGO-SE. PANDA-3D outperforms DeepGO-SE for BPO and CCO almost all the time while exhibiting comparable performance in terms of MFO. In Figure 4, the precision-recall curves and Fmax scores of PANDA-3D and SPROF-GO show that PANDA-3D outperforms or is comparable to SPROF-GO for BPO and CCO but slightly worse in MFO.

The Fmax scores and precision-recall curves of PANDA-3D and DeepGO-SE. The benchmark was performed on the testing dataset labeled as ‘DeepGO-SE’ in Table 1 with the maximum sequence identity cutoff of 0.95.
Figure 3.

The Fmax scores and precision-recall curves of PANDA-3D and DeepGO-SE. The benchmark was performed on the testing dataset labeled as ‘DeepGO-SE’ in Table 1 with the maximum sequence identity cutoff of 0.95.

The Fmax scores and precision-recall curves of PANDA-3D and SPROF-GO. The benchmark was performed on the testing dataset labeled as ‘SPROF-GO’ in Table 1 with the maximum sequence identity cutoff of 0.95.
Figure 4.

The Fmax scores and precision-recall curves of PANDA-3D and SPROF-GO. The benchmark was performed on the testing dataset labeled as ‘SPROF-GO’ in Table 1 with the maximum sequence identity cutoff of 0.95.

Table 3 shows the performance of PANDA-3D and DeepGO-SE in terms of Fmax, Smin and AUPR on the second testing dataset labeled as ‘DeepGO-SE’ in Table 1 with the maximum sequence identity cutoff of 0.95. PANDA-3D outperformed DeepGO-SE in eight out of nine scores. Table 4 shows that PANDA-3D outperformed SPROF-GO in BPO and CCO, securing six out of nine scores on the third testing dataset labeled as ‘SPROF-GO’ in Table 1 with the maximum sequence identity cutoff of 0.95.

Table 3.

The performances of PANDA-3D and DeepGO-SE for Fmax, Smin and AUPR. The highest Fmax, the smallest Smin, and the highest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘DeepGO-SE’ in Table 1 with the maximum sequence identity cutoff of 0.95

FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
DeepGO-SE0.4910.4130.6417.66530.8979.9250.4470.3680.554
PANDA-3D0.4860.450.6927.52629.3068.8370.470.4090.74
FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
DeepGO-SE0.4910.4130.6417.66530.8979.9250.4470.3680.554
PANDA-3D0.4860.450.6927.52629.3068.8370.470.4090.74
Table 3.

The performances of PANDA-3D and DeepGO-SE for Fmax, Smin and AUPR. The highest Fmax, the smallest Smin, and the highest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘DeepGO-SE’ in Table 1 with the maximum sequence identity cutoff of 0.95

FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
DeepGO-SE0.4910.4130.6417.66530.8979.9250.4470.3680.554
PANDA-3D0.4860.450.6927.52629.3068.8370.470.4090.74
FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
DeepGO-SE0.4910.4130.6417.66530.8979.9250.4470.3680.554
PANDA-3D0.4860.450.6927.52629.3068.8370.470.4090.74
Table 4.

The performances of PANDA-3D and SPROF-GO for Fmax, Smin and AUPR. The highest Fmax, the smallest Smin, and the highest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘SPROF-GO’ in Table 1 with the maximum sequence identity cutoff of 0.95

FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
SPROF-GO0.6690.4460.6595.4731.848.4320.6690.4090.642
PANDA-3D0.6140.4620.6685.6931.2327.8210.6410.4220.72
FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
SPROF-GO0.6690.4460.6595.4731.848.4320.6690.4090.642
PANDA-3D0.6140.4620.6685.6931.2327.8210.6410.4220.72
Table 4.

The performances of PANDA-3D and SPROF-GO for Fmax, Smin and AUPR. The highest Fmax, the smallest Smin, and the highest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘SPROF-GO’ in Table 1 with the maximum sequence identity cutoff of 0.95

FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
SPROF-GO0.6690.4460.6595.4731.848.4320.6690.4090.642
PANDA-3D0.6140.4620.6685.6931.2327.8210.6410.4220.72
FmaxSminAUPR
MethodMFOBPOCCOMFOBPOCCOMFOBPOCCO
SPROF-GO0.6690.4460.6595.4731.848.4320.6690.4090.642
PANDA-3D0.6140.4620.6685.6931.2327.8210.6410.4220.72

Figure 5 displays the Fmax scores of PANDA-3D and other methods with different maximum identity cutoffs. The performance of BLAST improves as the cutoff value increases, while the performance of other methods is not significantly affected by cutoffs. PANDA-3D outperforms all other methods on Fmax at different cutoffs. Figure 6 plots Fmax scores of PANDA-3D and DeepGO-SE with different maximum sequence identity cutoffs. PANDA-3D almost always outperformed DeepGO-SE. Figure 7 plots the Fmax scores of PANDA-3D and SPROF-GO with different maximum sequence identity cutoffs. PANDA-3D outperformed SPROF-GO at 80% and 95% cutoffs and showed similar performance at other cutoffs for BPO and CCO, while SPROF-GO outperforms PANDA-3D for MFO.

The Fmax scores of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST at different maximum sequence identity cutoffs. 40%, 50%, 60%, 70%, 80% and 95% are maximum sequence identity cutoffs. Sequences from the testing set having a maximum identity score greater than the maximum sequence identity were removed. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1.
Figure 5.

The Fmax scores of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST at different maximum sequence identity cutoffs. 40%, 50%, 60%, 70%, 80% and 95% are maximum sequence identity cutoffs. Sequences from the testing set having a maximum identity score greater than the maximum sequence identity were removed. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1.

The Fmax scores of PANDA-3D and DeepGO-SE at different maximum sequence identity cutoffs. 40%, 50%, 60%, 70%, 80% and 95% are maximum sequence identity cutoffs. Sequences from the testing set having a maximum identity score greater than the maximum sequence identity were removed. The benchmark was performed on the testing dataset labeled as ‘DeepGO-SE’ in Table 1.
Figure 6.

The Fmax scores of PANDA-3D and DeepGO-SE at different maximum sequence identity cutoffs. 40%, 50%, 60%, 70%, 80% and 95% are maximum sequence identity cutoffs. Sequences from the testing set having a maximum identity score greater than the maximum sequence identity were removed. The benchmark was performed on the testing dataset labeled as ‘DeepGO-SE’ in Table 1.

The Fmax scores of PANDA-3D and SPROF-GO at different maximum sequence identity cutoffs. 40%, 50%, 60%, 70%, 80% and 95% are maximum sequence identity cutoffs. Sequences from the testing set having a maximum identity score greater than the maximum sequence identity were removed. The benchmark was performed on the testing dataset labeled as ‘SPROF-GO’ in Table 1.
Figure 7.

The Fmax scores of PANDA-3D and SPROF-GO at different maximum sequence identity cutoffs. 40%, 50%, 60%, 70%, 80% and 95% are maximum sequence identity cutoffs. Sequences from the testing set having a maximum identity score greater than the maximum sequence identity were removed. The benchmark was performed on the testing dataset labeled as ‘SPROF-GO’ in Table 1.

Feature importance

To interpret the significance of the features used in PANDA-3D, we permutated one feature every time to test its contributions towards accuracy. Specifically, we shuffled the values of each feature randomly at the residue level and examined the resulting changes in performance. The features we tested were: ESMs, tokens of sequence, dihedral, Euclidean distances of the top k neighbors, edge vectors, orientation vectors, and pLDDTs. The results shown in Table 5 suggest that ESM, token, and dihedral angle are the top three important features, as shuffling these features results in the worst performance of PANDA-3D in terms of Fmax, Smin and AUPR.

Table 5.

The performance of PANDA-3D with permutated features. The top three lowest Fmax, highest Smin, and the lowest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95

FmaxSminAUPR
Permutated FeaturesMFOBPOCCOMFOBPOCCOMFOBPOCCO
None0.6420.4710.7057.2943.60110.0270.6540.4450.766
ESM0.630.4580.77.53944.24910.2180.640.4310.761
Token0.6340.4650.77.42143.93410.1510.6470.4370.763
Dihedral0.6310.4630.7017.45744.02410.1770.6430.4350.762
Euclidean distance0.6420.4710.7057.2943.60510.0280.6540.4450.766
Edge vector0.6420.4710.7057.29143.60110.0260.6540.4450.766
Orientation vector0.6410.470.7047.29543.62710.0230.6510.4430.764
pLDDT0.6430.4710.7047.28643.5810.0340.6540.4450.765
Side-chain vector0.6420.470.7057.29143.61110.0340.6540.4440.765
FmaxSminAUPR
Permutated FeaturesMFOBPOCCOMFOBPOCCOMFOBPOCCO
None0.6420.4710.7057.2943.60110.0270.6540.4450.766
ESM0.630.4580.77.53944.24910.2180.640.4310.761
Token0.6340.4650.77.42143.93410.1510.6470.4370.763
Dihedral0.6310.4630.7017.45744.02410.1770.6430.4350.762
Euclidean distance0.6420.4710.7057.2943.60510.0280.6540.4450.766
Edge vector0.6420.4710.7057.29143.60110.0260.6540.4450.766
Orientation vector0.6410.470.7047.29543.62710.0230.6510.4430.764
pLDDT0.6430.4710.7047.28643.5810.0340.6540.4450.765
Side-chain vector0.6420.470.7057.29143.61110.0340.6540.4440.765
Table 5.

The performance of PANDA-3D with permutated features. The top three lowest Fmax, highest Smin, and the lowest AUPR are in bold and italics. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95

FmaxSminAUPR
Permutated FeaturesMFOBPOCCOMFOBPOCCOMFOBPOCCO
None0.6420.4710.7057.2943.60110.0270.6540.4450.766
ESM0.630.4580.77.53944.24910.2180.640.4310.761
Token0.6340.4650.77.42143.93410.1510.6470.4370.763
Dihedral0.6310.4630.7017.45744.02410.1770.6430.4350.762
Euclidean distance0.6420.4710.7057.2943.60510.0280.6540.4450.766
Edge vector0.6420.4710.7057.29143.60110.0260.6540.4450.766
Orientation vector0.6410.470.7047.29543.62710.0230.6510.4430.764
pLDDT0.6430.4710.7047.28643.5810.0340.6540.4450.765
Side-chain vector0.6420.470.7057.29143.61110.0340.6540.4440.765
FmaxSminAUPR
Permutated FeaturesMFOBPOCCOMFOBPOCCOMFOBPOCCO
None0.6420.4710.7057.2943.60110.0270.6540.4450.766
ESM0.630.4580.77.53944.24910.2180.640.4310.761
Token0.6340.4650.77.42143.93410.1510.6470.4370.763
Dihedral0.6310.4630.7017.45744.02410.1770.6430.4350.762
Euclidean distance0.6420.4710.7057.2943.60510.0280.6540.4450.766
Edge vector0.6420.4710.7057.29143.60110.0260.6540.4450.766
Orientation vector0.6410.470.7047.29543.62710.0230.6510.4430.764
pLDDT0.6430.4710.7047.28643.5810.0340.6540.4450.765
Side-chain vector0.6420.470.7057.29143.61110.0340.6540.4440.765

The performance of PANDA-3D on term-centric evaluation

We report the term-centric evaluation results in Table 6, which shows the AUROCs of the methods based on all of the candidate GO terms in PANDA-3D and another two individual GO terms: biofilm formation (GO:0042710) and motility (GO:0001539). The 0.5 AUROCs of Naïve is caused by the design of the Naïve predictor (see supplementary materials) making it predict the same confidence scores for all of the candidate GO terms for all of the proteins in the testing dataset.

Table 6.

Term-centric evaluations of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST, in which the AUROCs on all candidate GO terms, biofilm formation (GO:0042710) and motility (GO:0001539) are reported. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95

MethodsAll GO-term classesGO:0042710GO:0001539
PANDA-3D0.8979240.9421650.872102
DeepFRI0.5304140.4188580.342771
BLAST0.6655740.5252960.581225
Naïve0.50.50.5
DeepGOCNN0.8093020.7662820.721743
MethodsAll GO-term classesGO:0042710GO:0001539
PANDA-3D0.8979240.9421650.872102
DeepFRI0.5304140.4188580.342771
BLAST0.6655740.5252960.581225
Naïve0.50.50.5
DeepGOCNN0.8093020.7662820.721743
Table 6.

Term-centric evaluations of PANDA-3D, DeepFRI, DeepGOCNN, Naïve and BLAST, in which the AUROCs on all candidate GO terms, biofilm formation (GO:0042710) and motility (GO:0001539) are reported. The benchmark was performed on the testing dataset labeled as ‘DeepFRI’ in Table 1 with the maximum sequence identity cutoff of 0.95

MethodsAll GO-term classesGO:0042710GO:0001539
PANDA-3D0.8979240.9421650.872102
DeepFRI0.5304140.4188580.342771
BLAST0.6655740.5252960.581225
Naïve0.50.50.5
DeepGOCNN0.8093020.7662820.721743
MethodsAll GO-term classesGO:0042710GO:0001539
PANDA-3D0.8979240.9421650.872102
DeepFRI0.5304140.4188580.342771
BLAST0.6655740.5252960.581225
Naïve0.50.50.5
DeepGOCNN0.8093020.7662820.721743

This type of evaluation was also used in CAFA2, CAFA3 and CAFA-π to rank predictors. In the CAFA3 and CAFA-π evaluations, three GO terms were used for evaluation, but among these three GO terms, only two GO terms: biofilm formation (GO:0042710) and motility (GO:0001539), are the candidate GO terms of PANDA-3D. Therefore, we only evaluated the performances of PANDA-3D on these two GO terms. Predicting these two GO terms is challenging in CAFA3 and CAFA-π since the best AUROC of the top five teams when predicting GO:0042710 only slightly exceeds the BLAST baseline (7).

PANDA-3D outperforms the BLAST and other methods significantly when predicting all candidate GO terms, biofilm formation (GO:0042710), and motility (GO:0001539).

Discussion and conclusions

The successful performance of AlphaFold in predicting protein tertiary structures allows the community to have access to a large number of models with good qualities. We developed PANDA-3D, a novel method that predicts protein functions from AlphaFold models and protein sequences. PANDA-3D combines GVP-GNN layers and decoder transformer layers that capture the 3D features from the AlphaFold models and then predict GO terms. PANDA-3D was evaluated for protein function prediction using both protein- and term-centric evaluations and achieved noticeably better performances than an existing method that takes experimentally determined structures as input and other methods that take protein sequence as input. Permutation feature importance analysis revealed that ESMs, tokens of sequence, and dihedrals are the top three most important features. In term-centric evaluation, PANDA-3D significantly outperformed other methods in predicting biofilm formation (GO:0042710), motility (GO:0001539), and all GO-term classes, which suggests that PANDA-3D is a highly accurate method compared to the top methods in CAFA3 and CAFA-π.

PANDA-3D for accurately predicting protein functions can speed up the search for molecular compounds with the potential to cure a disease by reducing the number of clinical candidate molecules, which is a process that can take years. The alteration of biological activity between proteins and drugs is primarily determined by their structures. Unlike traditional protein function predictors that predict protein functions from sequences, PANDA-3D predicts protein functions based on protein 3D structures. As the predicted structures become more accurate, the advantages of structure-based protein function prediction will be further obvious.

A limitation of PANDA-3D is its reliance on AlphaFold models as input. If a model is not in the AlphaFold DB, users must generate the AlphaFold model first.

Data availability

The web server of PANDA-3D can be freely accessed from http://dna.cs.miami.edu/PANDA-3D. The source code, training, validation, and testing datasets and trained models of PANDA-3D can be found at http://dna.cs.miami.edu/PANDA-3D and https://github.com/zwang-bioinformatics/PANDA-3D.

Supplementary data

Supplementary Data are available at NARGAB Online.

Acknowledgements

GPT-3.5 was used as a tool to assist in spell-checking and grammar verification.

Author contributions: C.Z., T.L. and Z.W. conceived the experiments. C.Z. implemented the code, conducted the experiments, and analyzed and visualized the results. C.Z. and Z.W. also wrote and reviewed the manuscript.

Funding

This research was supported by the National Institute of General Medical Sciences grant [1R35GM137974 to Z.W.].

Conflict of interest statement. None declared.

References

1.

Weisiger
 
R.A.
 
Cytosolic fatty acid binding proteins catalyze two distinct steps in intracellular transport of their ligands
.
Molecular and Cellular Biochemistry
.
2002
;
239
:
35
43
.

2.

Gardner
 
M.K.
,
Haase
 
J.
,
Mythreye
 
K.
,
Molk
 
J.N.
,
Anderson
 
M.
,
Joglekar
 
A.P.
,
O’Toole
 
E.T.
,
Winey
 
M.
,
Salmon
 
E.
,
Odde
 
D.J.
 
The microtubule-based motor Kar3 and plus end–binding protein Bim1 provide structural support for the anaphase spindle
.
J. Cell Biol.
 
2008
;
180
:
91
100
.

3.

Josling
 
G.A.
,
Petter
 
M.
,
Oehring
 
S.C.
,
Gupta
 
A.P.
,
Dietz
 
O.
,
Wilson
 
D.W.
,
Schubert
 
T.
,
Längst
 
G.
,
Gilson
 
P.R.
,
Crabb
 
B.S.
 
A Plasmodium falciparum bromodomain protein regulates invasion gene expression
.
Cell Host Microbe
.
2015
;
17
:
741
751
.

4.

Niu
 
X.
,
Zhang
 
T.
,
Liao
 
L.
,
Zhou
 
L.
,
Lindner
 
D.J.
,
Zhou
 
M.
,
Rini
 
B.
,
Yan
 
Q.
,
Yang
 
H.
 
The von Hippel–Lindau tumor suppressor protein regulates gene expression and tumor growth through histone demethylase JARID1C
.
Oncogene
.
2012
;
31
:
776
786
.

5.

Kramer
 
R.
,
Cohen
 
D.
 
Functional genomics to new drug targets
.
Nat. Rev. Drug Disc.
 
2004
;
3
:
965
972
.

6.

Savino
 
R.
,
Paduano
 
S.
,
Preianò
 
M.
,
Terracciano
 
R.
 
The proteomics big challenge for biomarkers and new drug-targets discovery
.
Int. J. Mol. Sci.
 
2012
;
13
:
13926
13948
.

7.

Zhou
 
N.
,
Jiang
 
Y.
,
Bergquist
 
T.R.
,
Lee
 
A.J.
,
Kacsoh
 
B.Z.
,
Crocker
 
A.W.
,
Lewis
 
K.A.
,
Georghiou
 
G.
,
Nguyen
 
H.N.
,
Hamid
 
M.N.
 
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens
.
Genome biol.
 
2019
;
20
:
244
.

8.

Anfinsen
 
C.B.
 
Principles that govern the folding of protein chains
.
Science
.
1973
;
181
:
223
230
.

9.

Anfinsen
 
C.B.
,
Haber
 
E.
,
Sela
 
M.
,
White
 
F.
 Jr
 
The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain
.
Proc. Natl. Acad. Sci. U.S.A.
 
1961
;
47
:
1309
1314
.

10.

Hugli
 
T.E.
 
Biochemistry and biology of anaphylatoxins
.
Complement
.
1986
;
3
:
111
127
.

11.

Jiang
 
Y.
,
Oron
 
T.R.
,
Clark
 
W.T.
,
Bankapur
 
A.R.
,
D’Andrea
 
D.
,
Lepore
 
R.
,
Funk
 
C.S.
,
Kahanda
 
I.
,
Verspoor
 
K.M.
,
Ben-Hur
 
A.
 
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
.
Genome Biol.
 
2016
;
17
:
184
.

12.

Wang
 
Z.
,
Zhao
 
C.
,
Wang
 
Y.
,
Sun
 
Z.
,
Wang
 
N.
 
PANDA: Protein function prediction using domain architecture and affinity propagation
.
Sci. Rep.
 
2018
;
8
:
3484
.

13.

Bhagwat
 
M.
,
Aravind
 
L.
 
Psi-blast tutorial
.
2007
;
Springer
177
186
.

14.

Kulmanov
 
M.
,
Hoehndorf
 
R.
 
DeepGOPlus: improved protein function prediction from sequence
.
Bioinformatics
.
2020
;
36
:
422
429
.

15.

Liu
 
Y.-W.
,
Hsu
 
T.-W.
,
Chang
 
C.-Y.
,
Liao
 
W.-H.
,
Chang
 
J.-M.
 
GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
.
BMC Bioinformatics
.
2020
;
21
:
276
.

16.

Sureyya Rifaioglu
 
A.
,
Doan
 
T.
,
Jesus Martin
 
M.
,
Cetin-Atalay
 
R.
,
Atalay
 
V.
 
DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks
.
Sci. Rep.
 
2019
;
9
:
7344
.

17.

Cao
 
R.
,
Freitas
 
C.
,
Chan
 
L.
,
Sun
 
M.
,
Jiang
 
H.
,
Chen
 
Z.
 
ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network
.
Molecules
.
2017
;
22
:
1732
.

18.

Zhao
 
C.
,
Liu
 
T.
,
Wang
 
Z.
 
PANDA2: protein function prediction using graph neural networks
.
NAR Genom. Bioinform.
 
2022
;
4
:
lqac004
.

19.

Rives
 
A.
,
Meier
 
J.
,
Sercu
 
T.
,
Goyal
 
S.
,
Lin
 
Z.
,
Liu
 
J.
,
Guo
 
D.
,
Ott
 
M.
,
Zitnick
 
C.L.
,
Ma
 
J.
 
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
.
Proc. Natl. Acad. Sci. U.S.A.
 
2021
;
118
:
e2016239118
.

20.

Strodthoff
 
N.
,
Wagner
 
P.
,
Wenzel
 
M.
,
Samek
 
W.
 
UDSMProt: universal deep sequence models for protein classification
.
Bioinformatics
.
2020
;
36
:
2401
2409
.

21.

Littmann
 
M.
,
Heinzinger
 
M.
,
Dallago
 
C.
,
Olenyi
 
T.
,
Rost
 
B.
 
Embeddings from deep learning transfer GO annotations beyond homology
.
Sci. Rep.
 
2021
;
11
:
1160
1160
.

22.

Heinzinger
 
M.
,
Elnaggar
 
A.
,
Wang
 
Y.
,
Dallago
 
C.
,
Nechaev
 
D.
,
Matthes
 
F.
,
Rost
 
B.
 
Modeling aspects of the language of life through transfer-learning protein sequences
.
BMC Bioinformatics
.
2019
;
20
:
723
.

23.

Elnaggar
 
A.
,
Heinzinger
 
M.
,
Dallago
 
C.
,
Rehawi
 
G.
,
Wang
 
Y.
,
Jones
 
L.
,
Gibbs
 
T.
,
Feher
 
T.
,
Angerer
 
C.
,
Steinegger
 
M.
 
Prottrans: Toward understanding the language of life through self-supervised learning
.
IEEE T. Pattern Anal. Mach. Int.
 
2021
;
44
:
7112
7127
.

24.

Devlin
 
J.
,
Chang
 
M.-W.
,
Lee
 
K.
,
Toutanova
 
K.
 
Bert: pre-training of deep bidirectional transformers for language understanding
.
2018
;
arXiv doi:
24 May 2019, preprint: not peer reviewed
https://arxiv.org/abs/1810.04805.

25.

Zhu
 
Y.-H.
,
Zhang
 
C.
,
Yu
 
D.-J.
,
Zhang
 
Y.
 
Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction
.
PLoS Computat. Biol.
 
2022
;
18
:
e1010793
.

26.

Kulmanov
 
M.
,
Guzmán-Vega
 
F.J.
,
Duek Roggli
 
P.
,
Lane
 
L.
,
Arold
 
S.T.
,
Hoehndorf
 
R.
 
Protein function prediction as approximate semantic entailment
.
Nat. Mach. Intell.
 
2024
;
6
:
220
228
.

27.

Yuan
 
Q.
,
Xie
 
J.
,
Xie
 
J.
,
Zhao
 
H.
,
Yang
 
Y.
 
Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion
.
Brief. Bioinform.
 
2023
;
24
:
bbad117
.

28.

Wang
 
S.
,
You
 
R.
,
Liu
 
Y.
,
Xiong
 
Y.
,
Zhu
 
S.
 
NetGO 3.0: protein language model improves large-scale functional annotations
.
Genom. Proteom. Bioinfor.
 
2023
;
21
:
349
358
.

29.

Gligorijeviç
 
V.
,
Renfrew
 
P.D.
,
Kosciolek
 
T.
,
Leman
 
J.K.
,
Berenberg
 
D.
,
Vatanen
 
T.
,
Chandler
 
C.
,
Taylor
 
B.C.
,
Fisk
 
I.M.
,
Vlamakis
 
H.
 et al. .  
Structure-based protein function prediction using graph convolutional networks
.
Nat. Commun.
 
2021
;
12
:
3168
.

30.

Lai
 
B.
,
Xu
 
J.
 
Accurate protein function prediction via graph attention networks with predicted structure information
.
Brief. Bioinform.
 
2022
;
23
:
1477
4054
.

31.

Zhang
 
C.
,
Freddolino
 
P.L.
,
Zhang
 
Y.
 
COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information
.
Nucleic Acids Res.
 
2017
;
45
:
W291
W299
.

32.

Zhou
 
X.
,
Zheng
 
W.
,
Li
 
Y.
,
Pearce
 
R.
,
Zhang
 
C.
,
Bell
 
E.W.
,
Zhang
 
G.
,
Zhang
 
Y.
 
I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction
.
Nat. Protoc.
 
2022
;
17
:
2326
2353
.

33.

Anand
 
N.
,
Eguchi
 
R.
,
Mathews
 
I.I.
,
Perez
 
C.P.
,
Derry
 
A.
,
Altman
 
R.B.
,
Huang
 
P.-S.
 
Protein sequence design with a learned potential
.
Nat. Commun.
 
2022
;
13
:
746
.

34.

Qi
 
Y.
,
Zhang
 
J.Z.
 
DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with DenseNet
.
J. Chem. Inf. Model.
 
2020
;
60
:
1245
1252
.

35.

Ingraham
 
J.
,
Garg
 
V.
,
Barzilay
 
R.
,
Jaakkola
 
T.
 
Generative models for graph-based protein design
.
Adv. Neur. Inf. Process. Syst.
 
2019
;
32
:
15820
15831
.

36.

Hsu
 
C.
,
Verkuil
 
R.
,
Liu
 
J.
,
Lin
 
Z.
,
Hie
 
B.
,
Sercu
 
T.
,
Lerer
 
A.
,
Rives
 
A.
 
Learning inverse folding from millions of predicted structures
.
International Conference on Machine Learning
.
PMLR
8946
8970
.

37.

Battaglia
 
P.W.
,
Hamrick
 
J.B.
,
Bapst
 
V.
,
Sanchez-Gonzalez
 
A.
,
Zambaldi
 
V.
,
Malinowski
 
M.
,
Tacchetti
 
A.
,
Raposo
 
D.
,
Santoro
 
A.
,
Faulkner
 
R.
 
Relational inductive biases, deep learning, and graph networks
.
2018
;
arXiv doi:
17 October 2018, preprint: not peer reviewed
https://arxiv.org/abs/1806.01261.

38.

Jing
 
B.
,
Eismann
 
S.
,
Soni
 
P.N.
,
Dror
 
R.O.
 
Equivariant graph neural networks for 3d macromolecular structure
.
2021
;
arXiv doi:
13 July 2021,preprint: not peer reviewed
https://arxiv.org/abs/2106.03843.

39.

Jing
 
B.
,
Eismann
 
S.
,
Suriana
 
P.
,
Townshend
 
R.J.
,
Dror
 
R.
 
Learning from protein structure with geometric vector perceptrons
.
International Conference on Learning Representations
.
2020
;

40.

Vaswani
 
A.
,
Shazeer
 
N.
,
Parmar
 
N.
,
Uszkoreit
 
J.
,
Jones
 
L.
,
Gomez
 
A.N.
,
Kaiser
 
U.
,
Polosukhin
 
I.
 
Attention is all you need
.
NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems
.
2017
;
6000
6010
.

41.

Liu
 
Z.
,
Lin
 
Y.
,
Cao
 
Y.
,
Hu
 
H.
,
Wei
 
Y.
,
Zhang
 
Z.
,
Lin
 
S.
,
Guo
 
B.
 
Swin transformer: hierarchical vision transformer using shifted windows
.
Proceedings of the IEEE/CVF International Conference on Computer Vision
.
10012
10022
.

42.

Cao
 
Y.
,
Shen
 
Y.
 
TALE: transformer-based protein function annotation with joint sequence–label embedding
.
Bioinformatics
.
2021
;
37
:
2825
2833
.

43.

Kabir
 
A.
,
Shehu
 
A.
 
GOProFormer: a multi-modal transformer method for gene ontology protein function prediction
.
Biomolecules
.
2022
;
12
:
1709
.

44.

Liu
 
S.
,
Zhang
 
L.
,
Yang
 
X.
,
Su
 
H.
,
Zhu
 
J.
 
Query2label: a simple transformer way to multi-label classification
.
2021
;
arXiv doi:
22 July 2021, preprint: not peer reviewed
https://arxiv.org/abs/2107.10834.

45.

Jumper
 
J.
,
Evans
 
R.
,
Pritzel
 
A.
,
Green
 
T.
,
Figurnov
 
M.
,
Ronneberger
 
O.
,
Tunyasuvunakool
 
K.
,
Bates
 
R.
,
Žídek
 
A.
,
Potapenko
 
A.
 
Highly accurate protein structure prediction with AlphaFold
.
Nature
.
2021
;
596
:
583
589
.

46.

Paszke
 
A.
,
Gross
 
S.
,
Massa
 
F.
,
Lerer
 
A.
,
Bradbury
 
J.
,
Chanan
 
G.
,
Killeen
 
T.
,
Lin
 
Z.
,
Gimelshein
 
N.
,
Antiga
 
L.
 
Pytorch: An imperative style, high-performance deep learning library
.
Proceedings of the 33rd International Conference on Neural Information Processing Systems
.
2019
;
8026
8037
.

47.

Varadi
 
M.
,
Anyango
 
S.
,
Deshpande
 
M.
,
Nair
 
S.
,
Natassia
 
C.
,
Yordanova
 
G.
,
Yuan
 
D.
,
Stroe
 
O.
,
Wood
 
G.
,
Laydon
 
A.
 et al. .  
AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
.
Nucleic Acids Res.
 
2021
;
50
:
D439
D444
.

48.

The UniProt Consortium
 
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res.
 
2018
;
46
:
2699
2699
.

49.

Ashburner
 
M.
,
Ball
 
C.A.
,
Blake
 
J.A.
,
Botstein
 
D.
,
Butler
 
H.
,
Cherry
 
J.M.
,
Davis
 
A.P.
,
Dolinski
 
K.
,
Dwight
 
S.S.
,
Eppig
 
J.T.
 
Gene Ontology: tool for the unification of biology
.
Nat. Genet.
 
2000
;
25
:
25
.

50.

Radivojac
 
P.
,
Clark
 
W.T.
,
Oron
 
T.R.
,
Schnoes
 
A.M.
,
Wittkop
 
T.
,
Sokolov
 
A.
,
Graim
 
K.
,
Funk
 
C.
,
Verspoor
 
K.
,
Ben-Hur
 
A.
 
A large-scale evaluation of computational protein function prediction
.
Nat. Metho.
 
2013
;
10
:
221
227
.

51.

Altschul
 
S.
,
Madden
 
T.
,
Schaffer
 
A.
,
Zhang
 
J.
,
Zhang
 
Z.
,
Miller
 
W.
,
Lipman
 
D.
 
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
.
1997
;
25
:
3389
3402
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.