Cargando…

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data

MOTIVATION: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than p...

Descripción completa

Detalles Bibliográficos
Autores principales: Shringarpure, Suyash S, Mathias, Rasika A, Hernandez, Ryan D, O’Connor, Timothy D, Szpiech, Zachary A, Torres, Raul, De La Vega, Francisco M, Bustamante, Carlos D, Barnes, Kathleen C, Taub, Margaret A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408850/
https://www.ncbi.nlm.nih.gov/pubmed/28035032
http://dx.doi.org/10.1093/bioinformatics/btw786
_version_ 1783232378706067456
author Shringarpure, Suyash S
Mathias, Rasika A
Hernandez, Ryan D
O’Connor, Timothy D
Szpiech, Zachary A
Torres, Raul
De La Vega, Francisco M
Bustamante, Carlos D
Barnes, Kathleen C
Taub, Margaret A
author_facet Shringarpure, Suyash S
Mathias, Rasika A
Hernandez, Ryan D
O’Connor, Timothy D
Szpiech, Zachary A
Torres, Raul
De La Vega, Francisco M
Bustamante, Carlos D
Barnes, Kathleen C
Taub, Margaret A
author_sort Shringarpure, Suyash S
collection PubMed
description MOTIVATION: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). RESULTS: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. AVAILABILITY AND IMPLEMENTATION: Code is available on Github at: https://github.com/suyashss/variant_validation SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5408850
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54088502017-05-03 Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data Shringarpure, Suyash S Mathias, Rasika A Hernandez, Ryan D O’Connor, Timothy D Szpiech, Zachary A Torres, Raul De La Vega, Francisco M Bustamante, Carlos D Barnes, Kathleen C Taub, Margaret A Bioinformatics Original Papers MOTIVATION: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). RESULTS: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. AVAILABILITY AND IMPLEMENTATION: Code is available on Github at: https://github.com/suyashss/variant_validation SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-04-15 2016-12-29 /pmc/articles/PMC5408850/ /pubmed/28035032 http://dx.doi.org/10.1093/bioinformatics/btw786 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Shringarpure, Suyash S
Mathias, Rasika A
Hernandez, Ryan D
O’Connor, Timothy D
Szpiech, Zachary A
Torres, Raul
De La Vega, Francisco M
Bustamante, Carlos D
Barnes, Kathleen C
Taub, Margaret A
Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
title Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
title_full Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
title_fullStr Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
title_full_unstemmed Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
title_short Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
title_sort using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408850/
https://www.ncbi.nlm.nih.gov/pubmed/28035032
http://dx.doi.org/10.1093/bioinformatics/btw786
work_keys_str_mv AT shringarpuresuyashs usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT mathiasrasikaa usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT hernandezryand usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT oconnortimothyd usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT szpiechzacharya usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT torresraul usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT delavegafranciscom usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT bustamantecarlosd usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT barneskathleenc usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT taubmargareta usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata
AT usinggenotypearraydatatocomparemultiandsinglesamplevariantcallsandimprovevariantcallsetsfromdeepcoveragewholegenomesequencingdata