Cargando…

Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods

A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of...

Descripción completa

Detalles Bibliográficos
Autores principales: Mu, John C., Tootoonchi Afshar, Pegah, Mohiyuddin, Marghoob, Chen, Xi, Li, Jian, Bani Asadi, Narges, Gerstein, Mark B., Wong, Wing H., Lam, Hugo Y. K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4585973/
https://www.ncbi.nlm.nih.gov/pubmed/26412485
http://dx.doi.org/10.1038/srep14493
_version_ 1782392317172252672
author Mu, John C.
Tootoonchi Afshar, Pegah
Mohiyuddin, Marghoob
Chen, Xi
Li, Jian
Bani Asadi, Narges
Gerstein, Mark B.
Wong, Wing H.
Lam, Hugo Y. K.
author_facet Mu, John C.
Tootoonchi Afshar, Pegah
Mohiyuddin, Marghoob
Chen, Xi
Li, Jian
Bani Asadi, Narges
Gerstein, Mark B.
Wong, Wing H.
Lam, Hugo Y. K.
author_sort Mu, John C.
collection PubMed
description A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.
format Online
Article
Text
id pubmed-4585973
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-45859732015-09-30 Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods Mu, John C. Tootoonchi Afshar, Pegah Mohiyuddin, Marghoob Chen, Xi Li, Jian Bani Asadi, Narges Gerstein, Mark B. Wong, Wing H. Lam, Hugo Y. K. Sci Rep Article A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools. Nature Publishing Group 2015-09-28 /pmc/articles/PMC4585973/ /pubmed/26412485 http://dx.doi.org/10.1038/srep14493 Text en Copyright © 2015, Macmillan Publishers Limited http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Mu, John C.
Tootoonchi Afshar, Pegah
Mohiyuddin, Marghoob
Chen, Xi
Li, Jian
Bani Asadi, Narges
Gerstein, Mark B.
Wong, Wing H.
Lam, Hugo Y. K.
Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
title Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
title_full Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
title_fullStr Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
title_full_unstemmed Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
title_short Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
title_sort leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4585973/
https://www.ncbi.nlm.nih.gov/pubmed/26412485
http://dx.doi.org/10.1038/srep14493
work_keys_str_mv AT mujohnc leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT tootoonchiafsharpegah leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT mohiyuddinmarghoob leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT chenxi leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT lijian leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT baniasadinarges leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT gersteinmarkb leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT wongwingh leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods
AT lamhugoyk leveraginglongreadsequencingfromasingleindividualtoprovideacomprehensiveresourceforbenchmarkingvariantcallingmethods