Cargando…

Estimating sequencing error rates using families

BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately qu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Paskov, Kelley, Jung, Jae-Yoon, Chrisman, Brianna, Stockham, Nate T., Washington, Peter, Varma, Maya, Sun, Min Woo, Wall, Dennis P.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8063364/ https://www.ncbi.nlm.nih.gov/pubmed/33892748 http://dx.doi.org/10.1186/s13040-021-00259-6

_version_	1783681939899678720
author	Paskov, Kelley Jung, Jae-Yoon Chrisman, Brianna Stockham, Nate T. Washington, Peter Varma, Maya Sun, Min Woo Wall, Dennis P.
author_facet	Paskov, Kelley Jung, Jae-Yoon Chrisman, Brianna Stockham, Nate T. Washington, Peter Varma, Maya Sun, Min Woo Wall, Dennis P.
author_sort	Paskov, Kelley
collection	PubMed
description	BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. RESULTS: We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. CONCLUSION: Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.
format	Online Article Text
id	pubmed-8063364
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-80633642021-04-23 Estimating sequencing error rates using families Paskov, Kelley Jung, Jae-Yoon Chrisman, Brianna Stockham, Nate T. Washington, Peter Varma, Maya Sun, Min Woo Wall, Dennis P. BioData Min Research BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. RESULTS: We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. CONCLUSION: Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology. BioMed Central 2021-04-23 /pmc/articles/PMC8063364/ /pubmed/33892748 http://dx.doi.org/10.1186/s13040-021-00259-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Paskov, Kelley Jung, Jae-Yoon Chrisman, Brianna Stockham, Nate T. Washington, Peter Varma, Maya Sun, Min Woo Wall, Dennis P. Estimating sequencing error rates using families
title	Estimating sequencing error rates using families
title_full	Estimating sequencing error rates using families
title_fullStr	Estimating sequencing error rates using families
title_full_unstemmed	Estimating sequencing error rates using families
title_short	Estimating sequencing error rates using families
title_sort	estimating sequencing error rates using families
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8063364/ https://www.ncbi.nlm.nih.gov/pubmed/33892748 http://dx.doi.org/10.1186/s13040-021-00259-6
work_keys_str_mv	AT paskovkelley estimatingsequencingerrorratesusingfamilies AT jungjaeyoon estimatingsequencingerrorratesusingfamilies AT chrismanbrianna estimatingsequencingerrorratesusingfamilies AT stockhamnatet estimatingsequencingerrorratesusingfamilies AT washingtonpeter estimatingsequencingerrorratesusingfamilies AT varmamaya estimatingsequencingerrorratesusingfamilies AT sunminwoo estimatingsequencingerrorratesusingfamilies AT walldennisp estimatingsequencingerrorratesusingfamilies

Estimating sequencing error rates using families

Ejemplares similares