Cargando…

Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data

Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic...

Descripción completa

Detalles Bibliográficos
Autores principales: Ding, Jiarui, Bashashati, Ali, Roth, Andrew, Oloumi, Arusha, Tse, Kane, Zeng, Thomas, Haffari, Gholamreza, Hirst, Martin, Marra, Marco A., Condon, Anne, Aparicio, Samuel, Shah, Sohrab P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3259434/
https://www.ncbi.nlm.nih.gov/pubmed/22084253
http://dx.doi.org/10.1093/bioinformatics/btr629
_version_ 1782221388178784256
author Ding, Jiarui
Bashashati, Ali
Roth, Andrew
Oloumi, Arusha
Tse, Kane
Zeng, Thomas
Haffari, Gholamreza
Hirst, Martin
Marra, Marco A.
Condon, Anne
Aparicio, Samuel
Shah, Sohrab P.
author_facet Ding, Jiarui
Bashashati, Ali
Roth, Andrew
Oloumi, Arusha
Tse, Kane
Zeng, Thomas
Haffari, Gholamreza
Hirst, Martin
Marra, Marco A.
Condon, Anne
Aparicio, Samuel
Shah, Sohrab P.
author_sort Ding, Jiarui
collection PubMed
description Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge. Results: We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth ‘false positive’ predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study. Availability: Software called MutationSeq and datasets are available from http://compbio.bccrc.ca. Contact: saparicio@bccrc.ca Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3259434
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-32594342012-01-17 Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data Ding, Jiarui Bashashati, Ali Roth, Andrew Oloumi, Arusha Tse, Kane Zeng, Thomas Haffari, Gholamreza Hirst, Martin Marra, Marco A. Condon, Anne Aparicio, Samuel Shah, Sohrab P. Bioinformatics Original Papers Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge. Results: We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth ‘false positive’ predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study. Availability: Software called MutationSeq and datasets are available from http://compbio.bccrc.ca. Contact: saparicio@bccrc.ca Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2012-01-15 2011-11-13 /pmc/articles/PMC3259434/ /pubmed/22084253 http://dx.doi.org/10.1093/bioinformatics/btr629 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Ding, Jiarui
Bashashati, Ali
Roth, Andrew
Oloumi, Arusha
Tse, Kane
Zeng, Thomas
Haffari, Gholamreza
Hirst, Martin
Marra, Marco A.
Condon, Anne
Aparicio, Samuel
Shah, Sohrab P.
Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
title Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
title_full Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
title_fullStr Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
title_full_unstemmed Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
title_short Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
title_sort feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3259434/
https://www.ncbi.nlm.nih.gov/pubmed/22084253
http://dx.doi.org/10.1093/bioinformatics/btr629
work_keys_str_mv AT dingjiarui featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT bashashatiali featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT rothandrew featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT oloumiarusha featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT tsekane featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT zengthomas featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT haffarigholamreza featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT hirstmartin featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT marramarcoa featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT condonanne featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT apariciosamuel featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata
AT shahsohrabp featurebasedclassifiersforsomaticmutationdetectionintumournormalpairedsequencingdata