Deciphering genomic codes using advanced natural language processing techniques: a scoping review

An overview of NLP application using big genomic data.

Ref.	Model	Goal and aim	Data	Sample size	Parameter size	Tokenization	Transformer architecture	Model derived	Data/model availability
Clauwaert and Waegeman¹⁰	Transformer-XL plus enhancement	Sequence labeling tasks	DNA sequences	928 3304 samples	185 346 to 462 402	Sequential segments of 512 nucleotides	Transformer-XL + convolutional layer	Pretrain	Yes/Yes
Hossain et al¹¹	DistilBERT + CRF + Attention Mask	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences	233 004 sequences^a	66M	BPE	DistilBERT	Pretrain & Fine tune (CpG island detection)	Yes/No
Ji et al¹²	DNABERT	Capture understanding of DNA sequences; Predict regulatory elements	DNA sequences	690 TF ChIP-seq profiles	110M	K-mer (3, 4, 5, 6)	BERT	Pretrain & Fine tune	Yes/Yes
Le et al¹³	BERT-CNN	Identify DNA enhancers	DNA sequences	1484 each (training); 200 each (test)	1 317 442	Nucleotides as words, DNA sequences as sentences	BERT + 2D CNN	Pretrain & Fine tune	Yes/Yes
Le et al¹⁴	BERT-Promoter	Improve the prediction of DNA promoters and their strength	DNA sequences	3382 each^b	110M	DNA sequences split into 81-bp fragments	BERT + SHAP	Pretrain	Yes/Yes
Luo et al¹⁵	TFBERT	Improve the prediction of DNA-protein binding sites	DNA sequences	690 ChIP-seq datasets^c	110M	K-mer	BERT	Pretrain & Fine tune	Yes/Yes
Rajkumar et al¹⁶	DeepViFi	Detect Oncoviral Infections in Cancer Genomes using Transformers	DNA sequences	1 145 800 reads	8 encoder	Each base-pair as a token	Self-attention heads	Pretrain	Yes/Yes
Roy et al¹⁷	GENEMASK-based DNABERT	Improve MLM training efficiency for gene sequences	DNA sequences	Prom-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test	110M	k-mer (k = 6)	BERT	Genomic specific pretrain paradigm	Yes/Yes
Wang et al¹⁸	BERT-5mC	Predict 5mC sites of DNA	DNA sequences	Training: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negative	NR^d	K-mer (k = 3)	BERT	Pretrain & Fine tune	Yes/Yes
Wang et al¹⁹	SMFM	Identify and characterize DNA enhancers	DNA sequences	2968 samples (1484 enhancers and 1484 non-enhancers)	NR	K-mer (k = 3)	BERT	Fine tune	Yes/Yes
Zhang et al²⁰	DNABERT	Predict Protein-DNA binding sites	DNA sequences	45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp	110M	K-mer (k = 6)	BERT; multi-headed self-attention	Fine tune	Yes/No
Zhang et al²¹	SemanticCAP	Chromatin accessibility prediction	DNA sequence	MT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 148	5.61M	Character based inputs	BERT	Pretrain	Yes/Yes
An et al²²	moDNA	Genome embedding; Promoter prediction; Transcription factor binding sites prediction	Non-coding DNA sequences	NR	NR	6-mers	BERT	Pretrain on human genome data & Fine tune	Yes/No
Hossain et al²³	BERT + CRF	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences with annotated CpG islands	233 004 sequences^a	110M	BPE	BERT with CRF layer	Pretrain & Fine tune (CpG island detection)	Yes/No
Pipoli et al²⁴	Transformer DeepLncLoc	Predict the abundance of mRNA (gene expression levels)	DNA sequences; mRNA half-life features; Transcription factors	18 000 gene sequences with their expression values	123 881	K-mer (k = 3)	Vanilla encoder block + DeepLncLoc embedding	Evaluate and use the output embedding	Yes/Yes
Zeng et al²⁵	MuLan-Methyl	Predict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine	DNA sequences; Taxonomy lineages	250 599 samples across 12 genomes	110M^e	Custom Tokenizer^f	BERT; DistilBERT; ALBERT	Pretrain & Fine tune	Yes/Yes
Wang et al²⁶	MSCAN	Identify RNA methylation sites	RNA sequences	m1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negative	NR	Word2vec embedding k-mer (k = 3)	Multi-scale self- and cross-attention mechanisms with multi-head attention	Pretrain & Fine tune	Yes/No
Wang et al²⁷	MTTLm6A	Predict base-resolution mRNA m6A sites	RNA sequences	m1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)	NR	One-hot encoding	CNN; Multi-head attention	Fine tune	No/No
Wei et al²⁸	BCMCMI	Predict potential circRNA-miRNA interactions	circRNA and miRNA sequences	circBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)	NR	BERT-based tokenization with WordPiece embeddings	BERT	Directly use BERT to get embedding	Yes/Yes
Wang et al²⁹	BioDGW-CMI	Predict circRNA-miRNA interactions	RNA sequences; Network structure	CMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753	NR	K-mer (k = 2 for miRNA and k = 5 for circRNA)	BERT	Use existing pretrained model	Yes/Yes
Zhang et al³⁰	miTDS	Predict miRNA-mRNA interactions	miRNA and mRNA sequences	10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs	110M	BERT-based tokenization	BERT	Fine tune	Yes/Yes
Zhang et al³¹	TMSC-m7G	Predict RNA N7-methylguanosine (m7G) sites	RNA sequences with N7-methylguanosine modification sites	Benchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)	NR	K-mer, then multi-sense-scaled word embedding	Transformer with CNN layer	Fine tune	Yes/No
Wang et al³²	Transformer-based DNA methylation detection model	Detect DNA methylation on ionic signals	Nanopore sequencing data	NR	NR	One-hot encoding	BERT	Fine tune	No/Yes
Jhee et al³³	CGCD	Predict optimized potential anti-breast cancer therapeutic target genes	Multi-omics data	105 breast cancer patients	65M	Gene expression values as tokens	Transformer encoder	NR	Yes/No
Jurenaite et al³⁴	SETQUENCE, SETOMIC	Enhance tumor type classification; Provide ML model which can hand over omics data	Transcriptome expression data; Somatic mutation data	544 healthy & 7518 tumor samples across 32 cancer types	NR	6-mers	DNABERT, DNN	Pretrain & Fine tune	Yes/Yes
Wang et al³⁵	IGnet	Automated classification of Alzheimer’s disease	3D MRI; SNP; CNV markers	ADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)	NR	SNPs {0,1, 2}^g	3D CNN for CV; two-layer transformer for genetic sequence	Train end to end	Yes/No

Ref.	Model	Goal and aim	Data	Sample size	Parameter size	Tokenization	Transformer architecture	Model derived	Data/model availability
Clauwaert and Waegeman¹⁰	Transformer-XL plus enhancement	Sequence labeling tasks	DNA sequences	928 3304 samples	185 346 to 462 402	Sequential segments of 512 nucleotides	Transformer-XL + convolutional layer	Pretrain	Yes/Yes
Hossain et al¹¹	DistilBERT + CRF + Attention Mask	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences	233 004 sequences^a	66M	BPE	DistilBERT	Pretrain & Fine tune (CpG island detection)	Yes/No
Ji et al¹²	DNABERT	Capture understanding of DNA sequences; Predict regulatory elements	DNA sequences	690 TF ChIP-seq profiles	110M	K-mer (3, 4, 5, 6)	BERT	Pretrain & Fine tune	Yes/Yes
Le et al¹³	BERT-CNN	Identify DNA enhancers	DNA sequences	1484 each (training); 200 each (test)	1 317 442	Nucleotides as words, DNA sequences as sentences	BERT + 2D CNN	Pretrain & Fine tune	Yes/Yes
Le et al¹⁴	BERT-Promoter	Improve the prediction of DNA promoters and their strength	DNA sequences	3382 each^b	110M	DNA sequences split into 81-bp fragments	BERT + SHAP	Pretrain	Yes/Yes
Luo et al¹⁵	TFBERT	Improve the prediction of DNA-protein binding sites	DNA sequences	690 ChIP-seq datasets^c	110M	K-mer	BERT	Pretrain & Fine tune	Yes/Yes
Rajkumar et al¹⁶	DeepViFi	Detect Oncoviral Infections in Cancer Genomes using Transformers	DNA sequences	1 145 800 reads	8 encoder	Each base-pair as a token	Self-attention heads	Pretrain	Yes/Yes
Roy et al¹⁷	GENEMASK-based DNABERT	Improve MLM training efficiency for gene sequences	DNA sequences	Prom-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test	110M	k-mer (k = 6)	BERT	Genomic specific pretrain paradigm	Yes/Yes
Wang et al¹⁸	BERT-5mC	Predict 5mC sites of DNA	DNA sequences	Training: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negative	NR^d	K-mer (k = 3)	BERT	Pretrain & Fine tune	Yes/Yes
Wang et al¹⁹	SMFM	Identify and characterize DNA enhancers	DNA sequences	2968 samples (1484 enhancers and 1484 non-enhancers)	NR	K-mer (k = 3)	BERT	Fine tune	Yes/Yes
Zhang et al²⁰	DNABERT	Predict Protein-DNA binding sites	DNA sequences	45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp	110M	K-mer (k = 6)	BERT; multi-headed self-attention	Fine tune	Yes/No
Zhang et al²¹	SemanticCAP	Chromatin accessibility prediction	DNA sequence	MT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 148	5.61M	Character based inputs	BERT	Pretrain	Yes/Yes
An et al²²	moDNA	Genome embedding; Promoter prediction; Transcription factor binding sites prediction	Non-coding DNA sequences	NR	NR	6-mers	BERT	Pretrain on human genome data & Fine tune	Yes/No
Hossain et al²³	BERT + CRF	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences with annotated CpG islands	233 004 sequences^a	110M	BPE	BERT with CRF layer	Pretrain & Fine tune (CpG island detection)	Yes/No
Pipoli et al²⁴	Transformer DeepLncLoc	Predict the abundance of mRNA (gene expression levels)	DNA sequences; mRNA half-life features; Transcription factors	18 000 gene sequences with their expression values	123 881	K-mer (k = 3)	Vanilla encoder block + DeepLncLoc embedding	Evaluate and use the output embedding	Yes/Yes
Zeng et al²⁵	MuLan-Methyl	Predict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine	DNA sequences; Taxonomy lineages	250 599 samples across 12 genomes	110M^e	Custom Tokenizer^f	BERT; DistilBERT; ALBERT	Pretrain & Fine tune	Yes/Yes
Wang et al²⁶	MSCAN	Identify RNA methylation sites	RNA sequences	m1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negative	NR	Word2vec embedding k-mer (k = 3)	Multi-scale self- and cross-attention mechanisms with multi-head attention	Pretrain & Fine tune	Yes/No
Wang et al²⁷	MTTLm6A	Predict base-resolution mRNA m6A sites	RNA sequences	m1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)	NR	One-hot encoding	CNN; Multi-head attention	Fine tune	No/No
Wei et al²⁸	BCMCMI	Predict potential circRNA-miRNA interactions	circRNA and miRNA sequences	circBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)	NR	BERT-based tokenization with WordPiece embeddings	BERT	Directly use BERT to get embedding	Yes/Yes
Wang et al²⁹	BioDGW-CMI	Predict circRNA-miRNA interactions	RNA sequences; Network structure	CMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753	NR	K-mer (k = 2 for miRNA and k = 5 for circRNA)	BERT	Use existing pretrained model	Yes/Yes
Zhang et al³⁰	miTDS	Predict miRNA-mRNA interactions	miRNA and mRNA sequences	10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs	110M	BERT-based tokenization	BERT	Fine tune	Yes/Yes
Zhang et al³¹	TMSC-m7G	Predict RNA N7-methylguanosine (m7G) sites	RNA sequences with N7-methylguanosine modification sites	Benchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)	NR	K-mer, then multi-sense-scaled word embedding	Transformer with CNN layer	Fine tune	Yes/No
Wang et al³²	Transformer-based DNA methylation detection model	Detect DNA methylation on ionic signals	Nanopore sequencing data	NR	NR	One-hot encoding	BERT	Fine tune	No/Yes
Jhee et al³³	CGCD	Predict optimized potential anti-breast cancer therapeutic target genes	Multi-omics data	105 breast cancer patients	65M	Gene expression values as tokens	Transformer encoder	NR	Yes/No
Jurenaite et al³⁴	SETQUENCE, SETOMIC	Enhance tumor type classification; Provide ML model which can hand over omics data	Transcriptome expression data; Somatic mutation data	544 healthy & 7518 tumor samples across 32 cancer types	NR	6-mers	DNABERT, DNN	Pretrain & Fine tune	Yes/Yes
Wang et al³⁵	IGnet	Automated classification of Alzheimer’s disease	3D MRI; SNP; CNV markers	ADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)	NR	SNPs {0,1, 2}^g	3D CNN for CV; two-layer transformer for genetic sequence	Train end to end	Yes/No

a

61 051 sequences containing 142 325 CpG islands.

b

3382 promoters (1591 strong and 1791 weak promoter samples) and 3382 non-promoters.

c

4 153 122 training samples, 461 458 validation samples, 800 000 testing samples.

d

NR: Not Reported.

e

BERT: 110M; DistilBERT: 40% of BERT; ALBERT: reduced size with cross-layer sharing.

f

A custom tokenizer that can capture any sample represented by 6-mer DNA words and a textual description of taxonomic lineage.

g

SNPs encoded as 0,1,2; selected with Fisher’s test, concatenated with APOE.

Table 1.

An overview of NLP application using big genomic data.

Ref.	Model	Goal and aim	Data	Sample size	Parameter size	Tokenization	Transformer architecture	Model derived	Data/model availability
Clauwaert and Waegeman¹⁰	Transformer-XL plus enhancement	Sequence labeling tasks	DNA sequences	928 3304 samples	185 346 to 462 402	Sequential segments of 512 nucleotides	Transformer-XL + convolutional layer	Pretrain	Yes/Yes
Hossain et al¹¹	DistilBERT + CRF + Attention Mask	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences	233 004 sequences^a	66M	BPE	DistilBERT	Pretrain & Fine tune (CpG island detection)	Yes/No
Ji et al¹²	DNABERT	Capture understanding of DNA sequences; Predict regulatory elements	DNA sequences	690 TF ChIP-seq profiles	110M	K-mer (3, 4, 5, 6)	BERT	Pretrain & Fine tune	Yes/Yes
Le et al¹³	BERT-CNN	Identify DNA enhancers	DNA sequences	1484 each (training); 200 each (test)	1 317 442	Nucleotides as words, DNA sequences as sentences	BERT + 2D CNN	Pretrain & Fine tune	Yes/Yes
Le et al¹⁴	BERT-Promoter	Improve the prediction of DNA promoters and their strength	DNA sequences	3382 each^b	110M	DNA sequences split into 81-bp fragments	BERT + SHAP	Pretrain	Yes/Yes
Luo et al¹⁵	TFBERT	Improve the prediction of DNA-protein binding sites	DNA sequences	690 ChIP-seq datasets^c	110M	K-mer	BERT	Pretrain & Fine tune	Yes/Yes
Rajkumar et al¹⁶	DeepViFi	Detect Oncoviral Infections in Cancer Genomes using Transformers	DNA sequences	1 145 800 reads	8 encoder	Each base-pair as a token	Self-attention heads	Pretrain	Yes/Yes
Roy et al¹⁷	GENEMASK-based DNABERT	Improve MLM training efficiency for gene sequences	DNA sequences	Prom-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test	110M	k-mer (k = 6)	BERT	Genomic specific pretrain paradigm	Yes/Yes
Wang et al¹⁸	BERT-5mC	Predict 5mC sites of DNA	DNA sequences	Training: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negative	NR^d	K-mer (k = 3)	BERT	Pretrain & Fine tune	Yes/Yes
Wang et al¹⁹	SMFM	Identify and characterize DNA enhancers	DNA sequences	2968 samples (1484 enhancers and 1484 non-enhancers)	NR	K-mer (k = 3)	BERT	Fine tune	Yes/Yes
Zhang et al²⁰	DNABERT	Predict Protein-DNA binding sites	DNA sequences	45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp	110M	K-mer (k = 6)	BERT; multi-headed self-attention	Fine tune	Yes/No
Zhang et al²¹	SemanticCAP	Chromatin accessibility prediction	DNA sequence	MT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 148	5.61M	Character based inputs	BERT	Pretrain	Yes/Yes
An et al²²	moDNA	Genome embedding; Promoter prediction; Transcription factor binding sites prediction	Non-coding DNA sequences	NR	NR	6-mers	BERT	Pretrain on human genome data & Fine tune	Yes/No
Hossain et al²³	BERT + CRF	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences with annotated CpG islands	233 004 sequences^a	110M	BPE	BERT with CRF layer	Pretrain & Fine tune (CpG island detection)	Yes/No
Pipoli et al²⁴	Transformer DeepLncLoc	Predict the abundance of mRNA (gene expression levels)	DNA sequences; mRNA half-life features; Transcription factors	18 000 gene sequences with their expression values	123 881	K-mer (k = 3)	Vanilla encoder block + DeepLncLoc embedding	Evaluate and use the output embedding	Yes/Yes
Zeng et al²⁵	MuLan-Methyl	Predict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine	DNA sequences; Taxonomy lineages	250 599 samples across 12 genomes	110M^e	Custom Tokenizer^f	BERT; DistilBERT; ALBERT	Pretrain & Fine tune	Yes/Yes
Wang et al²⁶	MSCAN	Identify RNA methylation sites	RNA sequences	m1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negative	NR	Word2vec embedding k-mer (k = 3)	Multi-scale self- and cross-attention mechanisms with multi-head attention	Pretrain & Fine tune	Yes/No
Wang et al²⁷	MTTLm6A	Predict base-resolution mRNA m6A sites	RNA sequences	m1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)	NR	One-hot encoding	CNN; Multi-head attention	Fine tune	No/No
Wei et al²⁸	BCMCMI	Predict potential circRNA-miRNA interactions	circRNA and miRNA sequences	circBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)	NR	BERT-based tokenization with WordPiece embeddings	BERT	Directly use BERT to get embedding	Yes/Yes
Wang et al²⁹	BioDGW-CMI	Predict circRNA-miRNA interactions	RNA sequences; Network structure	CMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753	NR	K-mer (k = 2 for miRNA and k = 5 for circRNA)	BERT	Use existing pretrained model	Yes/Yes
Zhang et al³⁰	miTDS	Predict miRNA-mRNA interactions	miRNA and mRNA sequences	10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs	110M	BERT-based tokenization	BERT	Fine tune	Yes/Yes
Zhang et al³¹	TMSC-m7G	Predict RNA N7-methylguanosine (m7G) sites	RNA sequences with N7-methylguanosine modification sites	Benchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)	NR	K-mer, then multi-sense-scaled word embedding	Transformer with CNN layer	Fine tune	Yes/No
Wang et al³²	Transformer-based DNA methylation detection model	Detect DNA methylation on ionic signals	Nanopore sequencing data	NR	NR	One-hot encoding	BERT	Fine tune	No/Yes
Jhee et al³³	CGCD	Predict optimized potential anti-breast cancer therapeutic target genes	Multi-omics data	105 breast cancer patients	65M	Gene expression values as tokens	Transformer encoder	NR	Yes/No
Jurenaite et al³⁴	SETQUENCE, SETOMIC	Enhance tumor type classification; Provide ML model which can hand over omics data	Transcriptome expression data; Somatic mutation data	544 healthy & 7518 tumor samples across 32 cancer types	NR	6-mers	DNABERT, DNN	Pretrain & Fine tune	Yes/Yes
Wang et al³⁵	IGnet	Automated classification of Alzheimer’s disease	3D MRI; SNP; CNV markers	ADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)	NR	SNPs {0,1, 2}^g	3D CNN for CV; two-layer transformer for genetic sequence	Train end to end	Yes/No

Ref.	Model	Goal and aim	Data	Sample size	Parameter size	Tokenization	Transformer architecture	Model derived	Data/model availability
Clauwaert and Waegeman¹⁰	Transformer-XL plus enhancement	Sequence labeling tasks	DNA sequences	928 3304 samples	185 346 to 462 402	Sequential segments of 512 nucleotides	Transformer-XL + convolutional layer	Pretrain	Yes/Yes
Hossain et al¹¹	DistilBERT + CRF + Attention Mask	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences	233 004 sequences^a	66M	BPE	DistilBERT	Pretrain & Fine tune (CpG island detection)	Yes/No
Ji et al¹²	DNABERT	Capture understanding of DNA sequences; Predict regulatory elements	DNA sequences	690 TF ChIP-seq profiles	110M	K-mer (3, 4, 5, 6)	BERT	Pretrain & Fine tune	Yes/Yes
Le et al¹³	BERT-CNN	Identify DNA enhancers	DNA sequences	1484 each (training); 200 each (test)	1 317 442	Nucleotides as words, DNA sequences as sentences	BERT + 2D CNN	Pretrain & Fine tune	Yes/Yes
Le et al¹⁴	BERT-Promoter	Improve the prediction of DNA promoters and their strength	DNA sequences	3382 each^b	110M	DNA sequences split into 81-bp fragments	BERT + SHAP	Pretrain	Yes/Yes
Luo et al¹⁵	TFBERT	Improve the prediction of DNA-protein binding sites	DNA sequences	690 ChIP-seq datasets^c	110M	K-mer	BERT	Pretrain & Fine tune	Yes/Yes
Rajkumar et al¹⁶	DeepViFi	Detect Oncoviral Infections in Cancer Genomes using Transformers	DNA sequences	1 145 800 reads	8 encoder	Each base-pair as a token	Self-attention heads	Pretrain	Yes/Yes
Roy et al¹⁷	GENEMASK-based DNABERT	Improve MLM training efficiency for gene sequences	DNA sequences	Prom-core & Prom-300: 53 276 training, 5920 test; Splice-40: 24 300 training, 3000 test; Cohn-enh: 20 843 training, 6948 test	110M	k-mer (k = 6)	BERT	Genomic specific pretrain paradigm	Yes/Yes
Wang et al¹⁸	BERT-5mC	Predict 5mC sites of DNA	DNA sequences	Training: 55 800 positive, 658 861 negative; Testing: 13 950 positive, 164 715 negative	NR^d	K-mer (k = 3)	BERT	Pretrain & Fine tune	Yes/Yes
Wang et al¹⁹	SMFM	Identify and characterize DNA enhancers	DNA sequences	2968 samples (1484 enhancers and 1484 non-enhancers)	NR	K-mer (k = 3)	BERT	Fine tune	Yes/Yes
Zhang et al²⁰	DNABERT	Predict Protein-DNA binding sites	DNA sequences	45 public transcription factor ChIP-seq datasets with DNA sequence samples of 101 bp	110M	K-mer (k = 6)	BERT; multi-headed self-attention	Fine tune	Yes/No
Zhang et al²¹	SemanticCAP	Chromatin accessibility prediction	DNA sequence	MT 244 692, PC 418 624; PH 503816, NO 264 264; MU 266 868, NP 283 148	5.61M	Character based inputs	BERT	Pretrain	Yes/Yes
An et al²²	moDNA	Genome embedding; Promoter prediction; Transcription factor binding sites prediction	Non-coding DNA sequences	NR	NR	6-mers	BERT	Pretrain on human genome data & Fine tune	Yes/No
Hossain et al²³	BERT + CRF	Detect CpG islands in DNA sequences; Promoter prediction; epigenetic causes identification	DNA sequences with annotated CpG islands	233 004 sequences^a	110M	BPE	BERT with CRF layer	Pretrain & Fine tune (CpG island detection)	Yes/No
Pipoli et al²⁴	Transformer DeepLncLoc	Predict the abundance of mRNA (gene expression levels)	DNA sequences; mRNA half-life features; Transcription factors	18 000 gene sequences with their expression values	123 881	K-mer (k = 3)	Vanilla encoder block + DeepLncLoc embedding	Evaluate and use the output embedding	Yes/Yes
Zeng et al²⁵	MuLan-Methyl	Predict DNA methylation sites N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine	DNA sequences; Taxonomy lineages	250 599 samples across 12 genomes	110M^e	Custom Tokenizer^f	BERT; DistilBERT; ALBERT	Pretrain & Fine tune	Yes/Yes
Wang et al²⁶	MSCAN	Identify RNA methylation sites	RNA sequences	m1A_train0: 593 positive, 5930 negative; m6A_train: 41 307 positive, 41 307 negative	NR	Word2vec embedding k-mer (k = 3)	Multi-scale self- and cross-attention mechanisms with multi-head attention	Pretrain & Fine tune	Yes/No
Wang et al²⁷	MTTLm6A	Predict base-resolution mRNA m6A sites	RNA sequences	m1A sites: 1987 positive samples, 2249 negative samples (Homo sapiens); m6A sites: 24 669 m6A sites (S. cerevisiae)	NR	One-hot encoding	CNN; Multi-head attention	Fine tune	No/No
Wei et al²⁸	BCMCMI	Predict potential circRNA-miRNA interactions	circRNA and miRNA sequences	circBank: 9589 (2115 circRNAs, 821 miRNAs) CMI-9905: 9905 (2346 circRNAs, 926 miRNAs)	NR	BERT-based tokenization with WordPiece embeddings	BERT	Directly use BERT to get embedding	Yes/Yes
Wang et al²⁹	BioDGW-CMI	Predict circRNA-miRNA interactions	RNA sequences; Network structure	CMI-9905: 9905 (2346 circRNAs, 962 miRNAs); CMI-9589: 9589; CMI-753: 753	NR	K-mer (k = 2 for miRNA and k = 5 for circRNA)	BERT	Use existing pretrained model	Yes/Yes
Zhang et al³⁰	miTDS	Predict miRNA-mRNA interactions	miRNA and mRNA sequences	10 test datasets, each with 548 positive and 548 negative miRNA-mRNA pairs	110M	BERT-based tokenization	BERT	Fine tune	Yes/Yes
Zhang et al³¹	TMSC-m7G	Predict RNA N7-methylguanosine (m7G) sites	RNA sequences with N7-methylguanosine modification sites	Benchmark: 741 positives, 741 negatives (balanced); Independent: 334 positives, 3340 negatives (imbalanced)	NR	K-mer, then multi-sense-scaled word embedding	Transformer with CNN layer	Fine tune	Yes/No
Wang et al³²	Transformer-based DNA methylation detection model	Detect DNA methylation on ionic signals	Nanopore sequencing data	NR	NR	One-hot encoding	BERT	Fine tune	No/Yes
Jhee et al³³	CGCD	Predict optimized potential anti-breast cancer therapeutic target genes	Multi-omics data	105 breast cancer patients	65M	Gene expression values as tokens	Transformer encoder	NR	Yes/No
Jurenaite et al³⁴	SETQUENCE, SETOMIC	Enhance tumor type classification; Provide ML model which can hand over omics data	Transcriptome expression data; Somatic mutation data	544 healthy & 7518 tumor samples across 32 cancer types	NR	6-mers	DNABERT, DNN	Pretrain & Fine tune	Yes/Yes
Wang et al³⁵	IGnet	Automated classification of Alzheimer’s disease	3D MRI; SNP; CNV markers	ADNI-1 subset with 379 participants (174 AD patients and 205 normal controls)	NR	SNPs {0,1, 2}^g	3D CNN for CV; two-layer transformer for genetic sequence	Train end to end	Yes/No

a

61 051 sequences containing 142 325 CpG islands.

b

3382 promoters (1591 strong and 1791 weak promoter samples) and 3382 non-promoters.

c

4 153 122 training samples, 461 458 validation samples, 800 000 testing samples.

d

NR: Not Reported.

e

BERT: 110M; DistilBERT: 40% of BERT; ALBERT: reduced size with cross-layer sharing.

f

A custom tokenizer that can capture any sample represented by 6-mer DNA words and a textual description of taxonomic lineage.

g

SNPs encoded as 0,1,2; selected with Fisher’s test, concatenated with APOE.

Preprocessing and modeling

Preprocessing genomic sequencing data is crucial before predictive modeling can be applied. This involves converting the raw genomic sequences into a format that computational models can understand, making the complex genetic data more accessible. The preprocessing steps include tokenization, which breaks text into manageable sub-word units. Subsequently, advanced architectures like transformers are utilized during the modeling phase to capture intricate dependencies and patterns in the data.

Tokenization of genomic data for LLMs

Tokenization is the first step in preprocessing genomic sequencing data for LLMs, which can capture biologically significant patterns such as promoter elements like TATA or CAAT boxes. It involves breaking down the sequences into smaller, manageable, and interpretable units that can be fed into computational models. K-mers is the most widely used tokenization method among the studies reviewed, consistent with common practice in genomic research. Biological functions are often determined by short patterns in DNA sequences, such as motifs and binding sites, making k-mers suited to capture these. Furthermore, many studies are built on top of existing pre-trained models and follow the tokenizer used in the original model. For example, several studies build on DNABERT¹⁵^,³⁴ and employ k-mers accordingly. Beyond k-mers, other tokenization methods are also applied. For instance, Hossain et al¹¹^,²³ approach the problem of CpG island detection as named entity recognition (NER) and use a BPE tokenizer.

K-merization: K-merization is a method in bioinformatics that breaks down DNA sequences into smaller, overlapping segments of fixed length, known as k-mers, where “k” represents the number of nucleotides in each segment.³⁶ For instance, in DNABERT, applying k-mers with a value of 3 on a sequence like “ACTGACTGAC” results in tokens such as [“ACT”, “CTG”, “TGA”, “GAC”]. Several studies have effectively utilized k-mer tokenization for sequencing data processing. Wang et al¹⁸ applied $k = 1, 3, 5$ to split DNA sequences into k-mers, treating them as words in a natural language. Additionally, Ji et al¹² employed various k-mer lengths (3, 4, 5, 6) in training and fine-tuning DNABERT models to understand the DNA sequences and predict regulatory elements. DNABERT was trained using the Bidirectional Encoder Representations from Transformers (BERT) model,³⁷ which is developed to create deep bidirectional representations by preprocessing text without labels, considering the context from both the left and the right at every layer. Similarly, Zeng et al²⁵ introduced a custom corpus and tokenizer using 6-mers with taxonomic lineage descriptions and aimed to predict the DNA methylation sites (⁠ $F 1 = 0.95$ ⁠). Moreover, An et al²² used 6-mers to pre-train on human genome data and fine-tune downstream data, getting the moDNA model, which is designed for promoter prediction and transcription start site (TSS) detection (⁠ $F 1 = 0.862$ ⁠).

Byte-pair encoding (BPE): BPE is a tokenization method that iteratively merges the most frequent pairs of characters in a text to create subword tokens, allowing for efficient handling of rare and unseen words.³⁸ When BPE is applied to the sequence “ACTGACTGAC,” it may yield tokens like [“ACTG,” “ACTG,” “AC”]. Hossain et al¹¹ utilized BPE³⁸ to tokenize the DNA sequences. The tokenized data were then applied to three models, each having a parameter size of 66 million. DistilBERT was set as the benchmark, and Conditional Random Fields (CRF),³⁹ and Attention Mask was added to each layer to detect CpG islands in DNA sequences. This methodology aims to predict the promoter regions and identify epigenetic causes of diseases. The F1 scores for these three models are 0.718, 0.726, and 0.735, respectively.

In 2023, Hossain et al²³ refined this approach by pre-training the BERT model and CRF layer (parameter size = 110M) on a large sequencing dataset with 142 325 CpG islands and fine-tuning on a smaller dataset. Eventually, they achieved an F-1 score of 0.834.

Fixed nucleotide tokenization: Fixed Nucleotide Tokenization is a method that segments DNA sequences into fixed-length nucleotide fragments. This technique differs from K-merization, primarily because it does not allow overlapping segments between the fragments, often treating these fragments as distinct “tokens” or “words” in the context of deep learning and NLP models. Using FNT with 3-nucleotide groupings on “ACTGACTGAC,” for example, produces tokens like [“ACT,” “GAC,” “TGA,” “C”]. For example, Le et al¹⁴ divided each DNA sequence into 81-base pair fragments and treated each fragment as a token, ensuring each fragment was treated as an independent token without any overlap. This fragment-based tokenization helps maintain sequence context over a fixed window size. They aimed to identify both promoters and non-promoters and their activities. In this study, researchers proposed the latest pre-trained BERT model, which eventually got promoter identification and strength identification, achieving an accuracy of over 0.8. Another study¹³ presented a novel method that combines BERT and 2D convolutional neural networks (CNNs) to predict DNA enhancers. During the training process, each nucleotide was transformed into a contextualized word embedding vector of size 768. These vectors, which represent fixed-length sequences, were then passed to a CNN for further analysis. The model demonstrated superior performance by training on iEnhancer-2L dataset, achieving an accuracy of 0.756 and a sensitivity of 0.8. In addition, Rajkumar et al¹⁶ presented a transformer-based model named DeepViFi that tokenized each nucleotide base (A, C, G, T, N) individually, rather than using k-mers or sub-sequences. This approach leveraged a random forest classifier to identify viral reads in cancer genomes, particularly focusing on the Human Papillomavirus (HPV), and a LightGBM model to classify these viral reads into specific subfamilies. The results showed that DeepViFi achieved a high precision-recall AUC of 0.94 in detecting and classifying HPV reads, demonstrating its effectiveness in this domain.

Transformer architecture for genomic sequencing data

Once the data is tokenized, the next step involves using transformer architectures to capture the complex, contextual relationships within them. Transformers are highly effective due to their attention mechanisms, which allow models to focus on different parts of the sequence simultaneously.

Since most studies are phrased as prediction or classification problems, BERT and its variants are often used as feature extractors, with an additional classifier added on top of that. In this section, we will mostly focus on the BERT and transformer components. For sequencing labeling tasks, the transformer is directly used.¹⁰^,⁴⁰ Additionally, some models incorporate CNNs within transformer-based architectures to enhance local feature extraction. CNNs are particularly effective for capturing motifs and short sequence patterns, which are essential for refining sequence-level predictions within broader transformer architectures.

BERT and variants: After tokenizing the DNA sequences, Wang et al¹⁸ pre-trained and fine-tuned a BERT model to predict 5-methylcytosine (5mC) sites and identify DNA enhancers. The BERT-5mc model derived an accuracy of 0.933. The DNABERT models by Ji et al¹² on different k-mers have 110M parameters and indicate the F1 values over 0.9 for all “k.” Luo et al¹⁵ introduced TFBert, a model based on the BERT architecture specifically designed to predict DNA-protein binding sites. The model was derived by initializing with the DNABERT pretraining model and then performing task-specific pretraining on a large dataset of 690 ChIP-seq datasets, which consist of various DNA-protein binding data. The model tokenized DNA sequences into k-mers, treating them as words in the context of a language model, allowing it to capture the context of DNA sequences effectively. The primary goal of TFBert is to improve the accuracy and robustness of DNA-protein binding predictions, especially in cases where the datasets are small or medium-sized. The results demonstrated that TFBert¹⁵ achieved state-of-the-art performance, outperforming other existing models, with an average AUC of 0.947, making it a valuable tool for various biological sequence prediction tasks.

Transformer encoder blocks: Besides BERT, several studies only utilized vanilla transformer encoder blocks or modified versions, which refer to the original transformer architecture with basic attention and feed-forward layers, without additional layers or task-specific pretraining. Unlike models like BERT, which are optimized with masked language modeling and specialized layers for specific downstream tasks, vanilla transformers typically lack these enhancements and may require additional tuning to process complex genomic data effectively. Pipoli et al²⁴ proposed a transformer-based model called Transformer DeepLncLoc to process the DNA sequences into a more compact and informative representation using a k-mer approach combined with word2vec embedding. It was specifically designed to process gene promoter sequences and predict the abundance of mRNA, managing the task as a regression problem. The transformer model was then used to analyze these embedded sequences, and its performance was compared against other models like LSTM DeepLncLoc and a convolutional model called DivideEtImpera, which utilized CNN layers to capture local sequence features. Including CNNs in this context helped capture motifs and short patterns within the sequence, providing an advantage in prediction accuracy. Roy et al¹⁷ introduced a novel approach for masked language modeling (MLM) in gene sequences, specifically focusing on improving the efficiency and performance of transformer-based models like DNABERT and LOGO. The paper presented the GENEMASK model (parameter size = 110M), derived by applying a Pointwise Mutual Information (PMI)-based masking strategy to gene sequences. This strategy identifies and masks the most correlated spans of nucleotides, as opposed to the random masking strategy used in traditional models. The GENEMASK model aims to enhance the learning process by making it more challenging and reducing the pretraining time while improving accuracy, particularly in few-shot settings where training data is limited. The results demonstrated that GENEMASK significantly outperforms the baseline models in several gene sequence classification tasks, showing better accuracy (⁠ $0.898 \pm 0.005$ ⁠) and ROCAUC (⁠ $0.962 \pm 0.002$ ⁠), especially in low-resource scenarios.

Advanced attention mechanisms: Advanced attention mechanisms are specialized features within transformer architectures that enhance the model’s capacity to identify and capture complex relationships within genomic data. Wang et al presented a novel deep learning model called MTTLm(6)A, designed to predict N6-methyladenosine (m6A) sites on mRNA at base resolution.²⁷ The model employed a multi-task transfer learning approach, leveraging information from related tasks to improve m6A site prediction. The primary model is an improved transformer architecture, fine-tuned using datasets from various species to enhance its generalization capabilities. The results demonstrated that MTTLm(6)A outperformed other state-of-the-art models in terms of prediction accuracy and efficiency. In another study, Wang et al²⁶ proposed MSCAN, a deep learning framework designed for RNA methylation site prediction, mainly focused on identifying various types of RNA modifications. The model incorporated multi-scale self- and cross-attention mechanisms to capture both local and long-range dependencies in RNA sequences. Using different input sequence scales, the model effectively captured the complex contextual relationships crucial for accurate methylation site prediction. MSCAN outperformed existing models in predicting 12 different RNA modifications.

Predicting regulatory annotations

After preprocessing and deriving embeddings through the transformer, the tokenized and transformed genomic data can be used to predict various regulatory annotations. These predictions include identifying transcription-factor binding sites, enhancer-promoter interactions, chromatin accessibility, and gene expression patterns. Many studies have demonstrated significant success in these predictive tasks, and the performance of selected models is summarized in Table 2. Our selection is based on studies that employed predictive models and reported at least one performance metric, ensuring high standards of empirical validation.

Table 2.

Detailed results of selected NLP applications in genomic sequencing data analysis.

Model	Accuracy	F1	MCC	ROCAUC	Specificity	Precision	Recall	Validation method
BERT-5mC¹⁸	0.933		0.656	0.966	0.938		0.872	5-fold cross-validation^a
DNABERT¹²^,^b	0.965	0.965		0.930				10% data for hold-out validation
SETOMIC³⁴	0.950	0.921		0.997		0.945		20% data for validation
SETQUENCE³⁴	0.475	0.359		0.910		0.375		20% data for validation
BERT-CNN¹³	0.756		0.514		0.712		0.800	5-fold cross-validation
TFBERT¹⁵	0.880	0.880	0.762	0.947		0.882	0.880	3-fold cross-validation
IGnet³⁵	0.838	0.824		0.924		0.875	0.778	10-fold cross validation
MuLan-Methyl²⁵	0.948	0.950		0.968			0.979	20% data for validation
moDNA²²	0.862	0.862	0.725	0.935		0.863	0.862	NR^c
DistilBERT+CRF+Attention Mask¹¹	0.965	0.735			0.959	0.691	0.852	10% data for hold-out validation
BERT+CRF (with/without)²³	0.973	0.834			0.962	0.780	0.897	10% data for hold-out validation
BERT-Promoter¹⁴	0.855				0.866		0.843	10-fold cross validation
DeepViFi (pipeline)¹⁶	0.960			0.94		0.996	1.000	30% data for validation
GENEMASK-based¹⁷	0.898			0.962				30% data for hold-out validation
MSCAN²⁶	0.957	0.713	0.710	0.937	0.994	0.905		few-fold cross-validation
MTTLm6²⁷	0.699	0.713	0.399	0.771	0.649	0.681		5-fold cross-validation
BioDGW-CMI²⁹^,^b	0.885	0.885		0.948		0.885	0.885	5-fold cross validation
BCMCMI²⁸	0.832	0.836	0.667	0.904		0.808	0.868	5-fold cross-validation
MiTDS³⁰	0.770	0.810					0.960	20% data for validation

Model	Accuracy	F1	MCC	ROCAUC	Specificity	Precision	Recall	Validation method
BERT-5mC¹⁸	0.933		0.656	0.966	0.938		0.872	5-fold cross-validation^a
DNABERT¹²^,^b	0.965	0.965		0.930				10% data for hold-out validation
SETOMIC³⁴	0.950	0.921		0.997		0.945		20% data for validation
SETQUENCE³⁴	0.475	0.359		0.910		0.375		20% data for validation
BERT-CNN¹³	0.756		0.514		0.712		0.800	5-fold cross-validation
TFBERT¹⁵	0.880	0.880	0.762	0.947		0.882	0.880	3-fold cross-validation
IGnet³⁵	0.838	0.824		0.924		0.875	0.778	10-fold cross validation
MuLan-Methyl²⁵	0.948	0.950		0.968			0.979	20% data for validation
moDNA²²	0.862	0.862	0.725	0.935		0.863	0.862	NR^c
DistilBERT+CRF+Attention Mask¹¹	0.965	0.735			0.959	0.691	0.852	10% data for hold-out validation
BERT+CRF (with/without)²³	0.973	0.834			0.962	0.780	0.897	10% data for hold-out validation
BERT-Promoter¹⁴	0.855				0.866		0.843	10-fold cross validation
DeepViFi (pipeline)¹⁶	0.960			0.94		0.996	1.000	30% data for validation
GENEMASK-based¹⁷	0.898			0.962				30% data for hold-out validation
MSCAN²⁶	0.957	0.713	0.710	0.937	0.994	0.905		few-fold cross-validation
MTTLm6²⁷	0.699	0.713	0.399	0.771	0.649	0.681		5-fold cross-validation
BioDGW-CMI²⁹^,^b	0.885	0.885		0.948		0.885	0.885	5-fold cross validation
BCMCMI²⁸	0.832	0.836	0.667	0.904		0.808	0.868	5-fold cross-validation
MiTDS³⁰	0.770	0.810					0.960	20% data for validation

a

Also includes independent dataset testing based on the benchmark datasets.

b

The results represent the best performance of the model under the tested configurations.

c

Not Reported.

Table 2.

10.1016/j.csbj.2021.05.039

Detailed results of selected NLP applications in genomic sequencing data analysis.

Model	Accuracy	F1	MCC	ROCAUC	Specificity	Precision	Recall	Validation method
BERT-5mC¹⁸	0.933		0.656	0.966	0.938		0.872	5-fold cross-validation^a
DNABERT¹²^,^b	0.965	0.965		0.930				10% data for hold-out validation
SETOMIC³⁴	0.950	0.921		0.997		0.945		20% data for validation
SETQUENCE³⁴	0.475	0.359		0.910		0.375		20% data for validation
BERT-CNN¹³	0.756		0.514		0.712		0.800	5-fold cross-validation
TFBERT¹⁵	0.880	0.880	0.762	0.947		0.882	0.880	3-fold cross-validation
IGnet³⁵	0.838	0.824		0.924		0.875	0.778	10-fold cross validation
MuLan-Methyl²⁵	0.948	0.950		0.968			0.979	20% data for validation
moDNA²²	0.862	0.862	0.725	0.935		0.863	0.862	NR^c
DistilBERT+CRF+Attention Mask¹¹	0.965	0.735			0.959	0.691	0.852	10% data for hold-out validation
BERT+CRF (with/without)²³	0.973	0.834			0.962	0.780	0.897	10% data for hold-out validation
BERT-Promoter¹⁴	0.855				0.866		0.843	10-fold cross validation
DeepViFi (pipeline)¹⁶	0.960			0.94		0.996	1.000	30% data for validation
GENEMASK-based¹⁷	0.898			0.962				30% data for hold-out validation
MSCAN²⁶	0.957	0.713	0.710	0.937	0.994	0.905		few-fold cross-validation
MTTLm6²⁷	0.699	0.713	0.399	0.771	0.649	0.681		5-fold cross-validation
BioDGW-CMI²⁹^,^b	0.885	0.885		0.948		0.885	0.885	5-fold cross validation
BCMCMI²⁸	0.832	0.836	0.667	0.904		0.808	0.868	5-fold cross-validation
MiTDS³⁰	0.770	0.810					0.960	20% data for validation

Model	Accuracy	F1	MCC	ROCAUC	Specificity	Precision	Recall	Validation method
BERT-5mC¹⁸	0.933		0.656	0.966	0.938		0.872	5-fold cross-validation^a
DNABERT¹²^,^b	0.965	0.965		0.930				10% data for hold-out validation
SETOMIC³⁴	0.950	0.921		0.997		0.945		20% data for validation
SETQUENCE³⁴	0.475	0.359		0.910		0.375		20% data for validation
BERT-CNN¹³	0.756		0.514		0.712		0.800	5-fold cross-validation
TFBERT¹⁵	0.880	0.880	0.762	0.947		0.882	0.880	3-fold cross-validation
IGnet³⁵	0.838	0.824		0.924		0.875	0.778	10-fold cross validation
MuLan-Methyl²⁵	0.948	0.950		0.968			0.979	20% data for validation
moDNA²²	0.862	0.862	0.725	0.935		0.863	0.862	NR^c
DistilBERT+CRF+Attention Mask¹¹	0.965	0.735			0.959	0.691	0.852	10% data for hold-out validation
BERT+CRF (with/without)²³	0.973	0.834			0.962	0.780	0.897	10% data for hold-out validation
BERT-Promoter¹⁴	0.855				0.866		0.843	10-fold cross validation
DeepViFi (pipeline)¹⁶	0.960			0.94		0.996	1.000	30% data for validation
GENEMASK-based¹⁷	0.898			0.962				30% data for hold-out validation
MSCAN²⁶	0.957	0.713	0.710	0.937	0.994	0.905		few-fold cross-validation
MTTLm6²⁷	0.699	0.713	0.399	0.771	0.649	0.681		5-fold cross-validation
BioDGW-CMI²⁹^,^b	0.885	0.885		0.948		0.885	0.885	5-fold cross validation
BCMCMI²⁸	0.832	0.836	0.667	0.904		0.808	0.868	5-fold cross-validation
MiTDS³⁰	0.770	0.810					0.960	20% data for validation

a

Also includes independent dataset testing based on the benchmark datasets.

b

The results represent the best performance of the model under the tested configurations.

c

Not Reported.

Methylation and epigenetic sites detection: Recent studies in NLP applied to genomic data have focused on predicting various DNA methylation sites, such as 5-methylcytosine,¹⁸ N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine,²⁵ which are crucial for understanding epigenetic modifications and their roles in gene regulation. These efforts are integral to advancing our understanding of gene expression regulation, epigenetic modifications, and their broader implications in health and disease. Future research may further explore other underrepresented epigenetic modifications to expand the applications of NLP in this critical area.

Transcription-related predictions: Other research efforts are dedicated to genome embedding, promoter prediction, and transcription factor binding site prediction. These goals are essential for mapping the regulatory landscapes of genomes, which is a critical step in understanding how genes are controlled and expressed in different conditions. In parallel, detecting CpG islands in DNA sequences has been a focal point,¹¹^,²³ with models designed to predict promoter regions and identify epigenetic causes of diseases. This line of research aims to unravel the genetic factors that contribute to various diseases, providing insights that could lead to new therapeutic strategies.

Post-transcriptional interaction predictions: Another important application of NLP in genomic research is predicting interactions between various non-coding RNAs and their targets, which play crucial roles in post-transcriptional regulation. Specifically, studies have focused on predicting circRNA-miRNA interactions,²⁸^,²⁹ and miRNA-mRNA interactions.³⁰ These interactions are critical for understanding the complex regulatory networks that control gene expression after transcription and influence processes such as mRNA stability, translation, and degradation. By accurately predicting these interactions, NLP models can provide valuable insights into the mechanisms of gene regulation at the post-transcriptional level, which has significant implications for understanding diseases, developing biomarkers, and designing targeted therapies. This section also includes identifying RNA methylation sites, such as m6A²⁷ and m7G,³¹ which are vital for understanding RNA biology and its impact on gene expression.

Cancer research and oncology: In the context of cancer research, NLP models have been applied to several specialized tasks aimed at improving cancer diagnosis and treatment. These tasks include predicting optimized potential anti-breast cancer therapeutic target genes,³³ enhancing tumor type classification,³⁴ and providing machine learning models capable of handling omics data. Moreover, models have been developed to investigate the reusability and generalizability of cell-type annotation in single-cell RNA sequencing data, which is crucial for understanding tumor heterogeneity, as well as the regulatory mechanisms that control gene activity within cancerous tissues. Additionally, models have been developed to investigate the reusability and generalizability of cell-type annotation in single-cell RNA sequencing data, which is crucial for deepening our knowledge of cellular functions and the complex interactions that drive cancer progression. Another significant goal is the detection of oncoviral infections in cancer genomes using transformers,¹⁶ a critical step in understanding the role of viral integration in cancer development and progression. Collectively, these applications of NLP in oncology highlight its potential to revolutionize cancer research by providing more precise diagnostic tools, therapeutic strategies, and insights into the multi-layered biological data that underpin cancer biology.

Emerging directions: Some goals have been explored less frequently in the literature but hold significant potential for future research. For instance, using NLP models to analyze gene-disease associations across vast biomedical literature can help uncover novel genetic risk factors and pathways associated with complex diseases. Similarly, chromatin accessibility prediction, which involves identifying regions of the genome that are open and accessible to transcription factors, is crucial for understanding gene regulation.²¹ These emerging research directions could guide future studies, encouraging researchers to explore these underrepresented yet critical areas, ultimately expanding the applications of NLP in genomics and enhancing our understanding of complex biological processes.

Data diversity and accessibility

Upon reviewing recent research on the application of NLP in genomic sequencing data interpretation, a distinct trend in the types of data used becomes evident. DNA sequences dominate the landscape, highlighting their frequent use in NLP-driven genomic analyses. There is also a growing incorporation of RNA sequences and multi-omic data, reflecting a shift toward more diverse and comprehensive datasets.

Additionally, specific genomic features extracted from sequencing data, such as circRNAs, SNPs, and somatic mutations, have been directly integrated into prediction models. Such secondary genomic features are represented by short sequences surrounding the genomic features, eliminating the need to incorporate the full genome or sequencing data. They incorporate succinct but clinically relevant genomic information, making them more powerful and cost-efficient when integrated with radiological imaging or pathology images, especially in cancer or hereditary disease applications for the development of clinically oriented clinical decision systems.

Among the studies, most datasets are publicly accessible, with a few studies having limited or request-based access for specific subsets. It fosters inclusivity and sustainable development in integrating genomic data with NLP, enhancing collaboration and progress in this rapidly evolving field.

Computational resources requirements: Advanced NLP techniques are notoriously known for high hardware requirements. However, the actual computational demands for application vary significantly depending on the extent of model training involved. Pre-training models like BERT or LLMs from scratch requires significant computational resources. For example, Ji et al¹² trained DNABERT for 25 days using 8 GPUs, while Zhang et al³⁰ trained on 6000 GPUs. In contrast, using existing pretrained models as feature extractors and building a classifier such as XGBoost¹⁴^,²⁸ or small neural networks²⁹^,³⁴ on top have minimal requirements. Fine-tuning or continuously pretraining from a publicly available model lies between these extremes.¹⁵^,²⁵ In addition, some studies intentionally consider resource constraints in model design and training processes. For example, Roy et al¹⁷ stopped training at 10 000 steps due to resource limitations and diminishing marginal returns to training. Furthermore, Wang et al³⁵ designed a small architecture (a two-layer transformer) to fit into low-resource environments.

Discussion

The application of LLMs within NLP for genomic data interpretation marks a significant advancement in processing and analyzing complex biological data. This review highlights key areas where these technologies have been effectively utilized, including tokenization techniques, transformer architectures, and the prediction of regulatory annotations. While the progress in these areas is promising, several challenges and opportunities for future research remain.

One of the major challenges in applying NLP and LLMs to genomic data is the inherent complexity of the data itself.⁴¹ Genomic sequences contain vast information, making it difficult to capture the full context within a model. This complexity also impacts model interpretability, as the black-box nature of LLMs makes it challenging for researchers to understand how the model arrives at its predictions. A “black-box” model refers to a system where the internal workings are not transparent or easily understood, and training data is obscured or undocumented, making it difficult to trace how specific inputs are transformed into outputs.⁴² For instance, while models like DNABERT¹² have shown success in predicting regulatory elements and annotating single-cell RNA data, the pathways and features leading to these predictions are often vague, which can limit their utility in clinical settings.

To address this issue, future research should focus on developing methods that enhance model interpretability. Techniques such as attention visualization, feature attribution, and post-hoc analysis can provide insights into which parts of the genomic sequence are most influential in the model’s predictions. By making these models more transparent, researchers and clinicians can gain greater confidence in their use for decision-making in personalized medicine.

Another significant challenge genomic researchers face is the absence of well-established pipelines or guidelines for integrating LLMs into genomic data analysis, such as determining which models are best suited for different data types, such as DNA/RNA sequences, proteomics, or epigenomics data. In addition, k-mers is still the most popular tokenizer, which might be sub-optimal.⁴³ Selecting the best tokenizer for relevant tasks needs further investigation. Although LLMs have shown great promise in various applications, their use in genomics is still in its early stages, often requiring ad hoc and highly specialized approaches. Developing systematic pipelines that outline best practices for tokenization, model selection, training, and validation in the context of genomic data is crucial. Such guidelines would standardize the use of LLMs across research groups, ensuring reproducibility and reliability of results. Moreover, these frameworks could make LLMs more accessible to researchers with limited computational backgrounds, helping streamline their adoption in genomics.

One limitation of this study is the constrained scope of the literature review sources, which primarily includes human genome data and only limited exploration of bacterial, viral, and other non-human DNA. Moreover, the study predominantly focuses on cancer when it comes to disease analysis, giving relatively less attention to other disease domains that involve complex DNA interactions, such as neurodegenerative diseases, autoimmune diseases, and genetic disorders. These areas also offer rich opportunities for genomic research and could benefit from applying NLP techniques.

Looking forward, future models should also consider integrating multimodal data more, such as combining genomic sequences with transcriptomic and proteomic data, and clinical data, such as lab values and diagnoses. Integrating multiple types of genomic data can better reveal complex biological interactions, providing insights into how different layers of biological information interact to drive cellular functions.⁴⁴ Including clinical data can enhance model predictions by grounding them in real-world patient information, thereby improving clinical relevance and enabling personalized insights. This comprehensive approach can provide a more comprehensive understanding of the regulatory mechanisms governing gene expression and the interplay between different molecular layers. It can also enable models to generate and validate more accurate and biologically meaningful predictions, thereby increasing physicians’ confidence in NLP-generated results and promoting the widespread application of NLP-based models.

Ultimately, integrating NLP and LLMs into genomics aims to translate these advancements into practical applications. This includes developing models that can predict individual responses to treatments based on genomic data, identify potential therapeutic targets, as well as provide clinicians with actionable and interpretable insights. As the field progresses, collaboration between computational scientists, geneticists, and clinicians will be essential to ensure that these models are both scientifically valid and clinically useful.

Author contributions

Shuyan Cheng (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization), Yishu Wei (Conceptualization, Data curation, Investigation, Methodology), Yiliang Zhou (Data curation, Methodology, Visualization), Zihan Xu (Data curation, Methodology, Validation), Drew Wright (Resources), and Yifan Peng (Funding acquisition, Investigation, Resources, Supervision)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the National Library of Medicine under Award No. R01LM014306.

Conflicts of interest

None declare.

Data availability

The data underlying this article are available in the article and in its online supplementary material.

References

1

Iuchi

H

,

Matsutani

T

,

Yamada

K

, et al.

Representation learning applications in biological sequence analysis

.

Comput Struct Biotechnol J

.

2021

;

19

:

3198

-

3208

.

2

Tang

LIN.

Large models for genomics

.

Nat Methods

.

2023

;

20

12

:

1868

.

10.1038/s41592-023-02105-5

3

ScienceDirect Topics. Human Genome. Accessed

October 8,

2024

.

4

Goodwin

S

,

McPherson

JD

,

McCombie

WR.

Coming of age: ten years of next-generation sequencing technologies

.

Nat Rev Genet

.

2016

;

17

:

333

-

351

.

5

Dotan

E

,

Jaschek

G

,

Pupko

T

,

Belinkov

Y

. Effect of tokenization on transformers for biological sequences. Bioinformatics. 2024;40:btae196.

10.1093/bioinformatics/btae196

6

Jiang

J

,

Ke

L

,

Chen

L

, et al. Transformer technology in molecular science. Wiley Interdiscip Rev Comput Mol Sci. 2024;14:e1725.

10.1002/wcms.1725

Crossref

7

Consens

ME

,

Dufault

C

,

Wainberg

M

, et al. To transformers and beyond: large language models for the genome. arXiv, arXiv:2311.07621,

2023

.

10.48550/arXiv.2311.07621

8

Choi

SR

,

Lee

M.

Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review

.

Biology

.

2023

;

12

7

:

1033

.

10.3390/biology12071033

9

Covidence systematic review software, Veritas Health Innovation

. Melbourne, Australia. www.covidence.org

10

Clauwaert

J

,

Waegeman

W.

Novel transformer networks for improved sequence labeling in genomics

.

IEEE/ACM Trans Comput Biol Bioinform

.

2022

;

19

:

97

-

106

.

10.1109/TCBB.2020.3035021

11

Hossain

MJ

,

Bhuiyan

MIH

,

Abdullah

ZR.

CpG Island detection using transformer model with conditional random field. In: IEEE Bombay Section Signature Conference (IBSSC). IEEE,

2022

.

10.1109/IBSSC56953.2022.10037492

12

Ji

Y

,

Zhou

Z

,

Liu

H

,

Davuluri

RV.

DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome

.

Bioinformatics

.

2021

;

37

:

2112

-

2120

.

10.1093/bioinformatics/btab083

13

Le

NQK

,

Ho

QT

,

Nguyen

TTD

,

Ou

YY.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

.

Brief Bioinform

.

2021

;

22

:

bbab005

.

14

Le

NQK

,

Ho

QT

,

Nguyen

VN

,

Chang

JS.

BERT-promoter: an improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection

.

Comput Biol Chem

.

2022

;

99

:

107732

.

10.1016/j.compbiolchem.2022.107732

15

Luo

H

,

Shan

W

,

Chen

C

,

Ding

P

,

Luo

L.

Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training

.

Interdiscip Sci

.

2023

;

15

:

32

-

43

.

10.1007/s12539-022-00537-9

16

Rajkumar

U

,

Javadzadeh

S

,

Bafna

M

, et al DeepViFi: detecting oncoviral infections in cancer genomes using transformers. In: Proceedings of the 2022 Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’22). ACM,

2022

. p.

8

.

10.1145/3535508.3545551

17

Roy

S

,

Wallat

J

,

Sundaram

SS

,

Nejdl

W

,

Ganguly

N.

GENEMask: fast pretraining of gene sequences to enable few-shot learning. In: Proceedings of the European Conference on Artificial Intelligence (ECAI). IOS Press;

2023

.

10.3233/FAIA230492

18

Wang

S

,

Liu

Y

,

Liu

Y

,

Zhang

Y

,

Zhu

X.

BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT

.

PeerJ

.

2023

;

11

:

e16600

.

19

Wang

Y

,

Hou

Z

,

Yang

Y

,

Wong

K

,

Li

X.

Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework

.

PLoS Comput Biol

.

2022

;

18

:

e1010779

.

10.1371/journal.pcbi.1010779

20

Zhang

Y

,

Chen

Y

,

Chen

B

,

Cao

Y

,

Chen

J

,

Cong

H.

Predicting protein–DNA binding sites by fine-tuning BERT. In: Huang DS, Jo KH, Jung JH, Premaratne P, Bevilacqua V, Hussain A, eds. Intelligent Computing Theories and Application: 18th International Conference, ICIC 2022, Virtual Event, August 7–11, 2022, Proceedings, Part II. Vol. 13394 of Lecture Notes in Computer Science. Springer Nature Switzerland AG,

2022

. p.

663

-

669

.

10.1007/978-3-031-13829-4_57

21

Zhang

Y

,

Chu

X

,

Jiang

Y

,

Wu

H

,

Quan

L.

SemanticCAP: chromatin accessibility prediction enhanced by features learning from a language model

.

Genes (Basel)

.

2022

;

13

:

568

.

10.3390/genes13040568

22

An

W

,

Guo

Y

,

Bian

Y

, et al MoDNA: motif-oriented pre-training for DNA language model. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’22). Association for Computing Machinery,

2022

.

10.1145/3535508.3545512

23

Hossain

MJ

,

Bhuiyan

MIH

,

Abdullah

ZR.

CpG Island detection using modified transformer model with pre–trained embedding. In: 2023 26th International Conference on Computer and Information Technology (ICCIT). IEEE,

2023

. p.

1

-

4

.

10.1109/ICCIT60459.2023.10441209

24

Pipoli

V

,

Cappelli

M

,

Palladini

A

,

Peluso

C

,

Lovino

M

,

Ficarra

E.

Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers

.

Comput Methods Programs Biomed

.

2022

;

225

:

107035

.

10.1016/j.cmpb.2022.107035

25

Zeng

W

,

Gautam

A

,

Huson

DH.

MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction

.

Gigascience

.

2022

;

12

:

1

-

11

.

10.1093/gigascience/giad054

Crossref

10.1186/s12859-024-05649-1

26

Wang

H

,

Huang

T

,

Wang

D

,

Zeng

W

,

Sun

Y

,

Zhang

L.

MSCAN: multi-scale self- and cross-attention network for RNA methylation site prediction

.

BMC Bioinformatics

.

2024

;

25

:

32

.

27

Wang

H

,

Zeng

W

,

Huang

X

,

Liu

Z

,

Sun

Y

,

Zhang

L.

MTTLm6A: a multi-task transfer learning approach for base-resolution mRNA m6A site prediction based on an improved transformer

.

Math Biosci Eng

.

2024

;

21

:

272

-

299

.

28

Wei

MM

,

Yu

CQ

,

Li

LP

,

You

ZH

,

Wang

L.

BCMCMI: a fusion model for predicting circRNA-miRNA interactions combining semantic and meta-path

.

J Chem Inf Model

.

2023

;

63

:

5384

-

5394

.

10.1021/acs.jcim.3c00952

PubMed

OpenURL Placeholder Text

10.1016/j.compbiomed.2023.107421

29

Wang

XF

,

Yu

CQ

,

You

ZH

,

Qiao

Y

,

Li

ZW

,

Huang

WZ.

An efficient circRNA-miRNA interaction prediction model by combining biological text mining and wavelet diffusion-based sparse network structure embedding

.

Comput Biol Med

.

2023

;

165

:

107421

.

30

Zhang

J

,

Zhu

H

,

Liu

Y

,

Li

X.

miTDS: uncovering miRNA-mRNA interactions with deep learning for functional target prediction

.

Methods

.

2024

;

223

:

65

-

74

.

10.1016/j.ymeth.2023.107733

PubMed

OpenURL Placeholder Text

10.1016/j.csbj.2023.11.011

31

Zhang

S

,

Xu

Y

,

Liang

Y.

TMSC-m7G: a transformer architecture based on multi-sense-scaled embedding features and convolutional neural network to identify RNA N7-methylguanosine sites

.

Comput Struct Biotechnol J

.

2024

;

23

:

129

-

139

.

32

Wang

X

,

Ahsan

MU

,

Zhou

Y

,

Wang

K.

Transformer-based DNA methylation detection on ionic signals from Oxford Nanopore sequencing data

.

Quant Biol

.

2023

;

11

:

287

-

296

.

10.15302/J-QB-022-0323

Crossref