Cargando…

SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing

BACKGROUND: Next-generation sequencing (NGS) allows unbiased, in-depth interrogation of cancer genomes. Many somatic variant callers have been developed yet accurate ascertainment of somatic variants remains a considerable challenge as evidenced by the varying mutation call rates and low concordance...

Descripción completa

Detalles Bibliográficos
Autores principales: Spinella, Jean-François, Mehanna, Pamela, Vidal, Ramon, Saillour, Virginie, Cassart, Pauline, Richer, Chantal, Ouimet, Manon, Healy, Jasmine, Sinnett, Daniel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5109690/
https://www.ncbi.nlm.nih.gov/pubmed/27842494
http://dx.doi.org/10.1186/s12864-016-3281-2
_version_ 1782467585545076736
author Spinella, Jean-François
Mehanna, Pamela
Vidal, Ramon
Saillour, Virginie
Cassart, Pauline
Richer, Chantal
Ouimet, Manon
Healy, Jasmine
Sinnett, Daniel
author_facet Spinella, Jean-François
Mehanna, Pamela
Vidal, Ramon
Saillour, Virginie
Cassart, Pauline
Richer, Chantal
Ouimet, Manon
Healy, Jasmine
Sinnett, Daniel
author_sort Spinella, Jean-François
collection PubMed
description BACKGROUND: Next-generation sequencing (NGS) allows unbiased, in-depth interrogation of cancer genomes. Many somatic variant callers have been developed yet accurate ascertainment of somatic variants remains a considerable challenge as evidenced by the varying mutation call rates and low concordance among callers. Statistical model-based algorithms that are currently available perform well under ideal scenarios, such as high sequencing depth, homogeneous tumor samples, high somatic variant allele frequency (VAF), but show limited performance with sub-optimal data such as low-pass whole-exome/genome sequencing data. While the goal of any cancer sequencing project is to identify a relevant, and limited, set of somatic variants for further sequence/functional validation, the inherently complex nature of cancer genomes combined with technical issues directly related to sequencing and alignment can affect either the specificity and/or sensitivity of most callers. RESULTS: For these reasons, we developed SNooPer, a versatile machine learning approach that uses Random Forest classification models to accurately call somatic variants in low-depth sequencing data. SNooPer uses a subset of variant positions from the sequencing output for which the class, true variation or sequencing error, is known to train the data-specific model. Here, using a real dataset of 40 childhood acute lymphoblastic leukemia patients, we show how the SNooPer algorithm is not affected by low coverage or low VAFs, and can be used to reduce overall sequencing costs while maintaining high specificity and sensitivity to somatic variant calling. When compared to three benchmarked somatic callers, SNooPer demonstrated the best overall performance. CONCLUSIONS: While the goal of any cancer sequencing project is to identify a relevant, and limited, set of somatic variants for further sequence/functional validation, the inherently complex nature of cancer genomes combined with technical issues directly related to sequencing and alignment can affect either the specificity and/or sensitivity of most callers. The flexibility of SNooPer’s random forest protects against technical bias and systematic errors, and is appealing in that it does not rely on user-defined parameters. The code and user guide can be downloaded at https://sourceforge.net/projects/snooper/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-3281-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5109690
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-51096902016-11-21 SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing Spinella, Jean-François Mehanna, Pamela Vidal, Ramon Saillour, Virginie Cassart, Pauline Richer, Chantal Ouimet, Manon Healy, Jasmine Sinnett, Daniel BMC Genomics Software BACKGROUND: Next-generation sequencing (NGS) allows unbiased, in-depth interrogation of cancer genomes. Many somatic variant callers have been developed yet accurate ascertainment of somatic variants remains a considerable challenge as evidenced by the varying mutation call rates and low concordance among callers. Statistical model-based algorithms that are currently available perform well under ideal scenarios, such as high sequencing depth, homogeneous tumor samples, high somatic variant allele frequency (VAF), but show limited performance with sub-optimal data such as low-pass whole-exome/genome sequencing data. While the goal of any cancer sequencing project is to identify a relevant, and limited, set of somatic variants for further sequence/functional validation, the inherently complex nature of cancer genomes combined with technical issues directly related to sequencing and alignment can affect either the specificity and/or sensitivity of most callers. RESULTS: For these reasons, we developed SNooPer, a versatile machine learning approach that uses Random Forest classification models to accurately call somatic variants in low-depth sequencing data. SNooPer uses a subset of variant positions from the sequencing output for which the class, true variation or sequencing error, is known to train the data-specific model. Here, using a real dataset of 40 childhood acute lymphoblastic leukemia patients, we show how the SNooPer algorithm is not affected by low coverage or low VAFs, and can be used to reduce overall sequencing costs while maintaining high specificity and sensitivity to somatic variant calling. When compared to three benchmarked somatic callers, SNooPer demonstrated the best overall performance. CONCLUSIONS: While the goal of any cancer sequencing project is to identify a relevant, and limited, set of somatic variants for further sequence/functional validation, the inherently complex nature of cancer genomes combined with technical issues directly related to sequencing and alignment can affect either the specificity and/or sensitivity of most callers. The flexibility of SNooPer’s random forest protects against technical bias and systematic errors, and is appealing in that it does not rely on user-defined parameters. The code and user guide can be downloaded at https://sourceforge.net/projects/snooper/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-3281-2) contains supplementary material, which is available to authorized users. BioMed Central 2016-11-14 /pmc/articles/PMC5109690/ /pubmed/27842494 http://dx.doi.org/10.1186/s12864-016-3281-2 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Spinella, Jean-François
Mehanna, Pamela
Vidal, Ramon
Saillour, Virginie
Cassart, Pauline
Richer, Chantal
Ouimet, Manon
Healy, Jasmine
Sinnett, Daniel
SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
title SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
title_full SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
title_fullStr SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
title_full_unstemmed SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
title_short SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
title_sort snooper: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5109690/
https://www.ncbi.nlm.nih.gov/pubmed/27842494
http://dx.doi.org/10.1186/s12864-016-3281-2
work_keys_str_mv AT spinellajeanfrancois snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT mehannapamela snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT vidalramon snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT saillourvirginie snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT cassartpauline snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT richerchantal snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT ouimetmanon snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT healyjasmine snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing
AT sinnettdaniel snooperamachinelearningbasedmethodforsomaticvariantidentificationfromlowpassnextgenerationsequencing