Cargando…

CHARR efficiently estimates contamination from DNA sequencing data

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating th...

Descripción completa

Detalles Bibliográficos
Autores principales: Lu, Wenhan, Gauthier, Laura D., Poterba, Timothy, Giacopuzzi, Edoardo, Goodrich, Julia K., Stevens, Christine R., King, Daniel, Daly, Mark J., Neale, Benjamin M., Karczewski, Konrad J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327099/
https://www.ncbi.nlm.nih.gov/pubmed/37425834
http://dx.doi.org/10.1101/2023.06.28.545801
_version_ 1785069557184987136
author Lu, Wenhan
Gauthier, Laura D.
Poterba, Timothy
Giacopuzzi, Edoardo
Goodrich, Julia K.
Stevens, Christine R.
King, Daniel
Daly, Mark J.
Neale, Benjamin M.
Karczewski, Konrad J.
author_facet Lu, Wenhan
Gauthier, Laura D.
Poterba, Timothy
Giacopuzzi, Edoardo
Goodrich, Julia K.
Stevens, Christine R.
King, Daniel
Daly, Mark J.
Neale, Benjamin M.
Karczewski, Konrad J.
author_sort Lu, Wenhan
collection PubMed
description DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.
format Online
Article
Text
id pubmed-10327099
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-103270992023-07-08 CHARR efficiently estimates contamination from DNA sequencing data Lu, Wenhan Gauthier, Laura D. Poterba, Timothy Giacopuzzi, Edoardo Goodrich, Julia K. Stevens, Christine R. King, Daniel Daly, Mark J. Neale, Benjamin M. Karczewski, Konrad J. bioRxiv Article DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets. Cold Spring Harbor Laboratory 2023-06-28 /pmc/articles/PMC10327099/ /pubmed/37425834 http://dx.doi.org/10.1101/2023.06.28.545801 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Lu, Wenhan
Gauthier, Laura D.
Poterba, Timothy
Giacopuzzi, Edoardo
Goodrich, Julia K.
Stevens, Christine R.
King, Daniel
Daly, Mark J.
Neale, Benjamin M.
Karczewski, Konrad J.
CHARR efficiently estimates contamination from DNA sequencing data
title CHARR efficiently estimates contamination from DNA sequencing data
title_full CHARR efficiently estimates contamination from DNA sequencing data
title_fullStr CHARR efficiently estimates contamination from DNA sequencing data
title_full_unstemmed CHARR efficiently estimates contamination from DNA sequencing data
title_short CHARR efficiently estimates contamination from DNA sequencing data
title_sort charr efficiently estimates contamination from dna sequencing data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327099/
https://www.ncbi.nlm.nih.gov/pubmed/37425834
http://dx.doi.org/10.1101/2023.06.28.545801
work_keys_str_mv AT luwenhan charrefficientlyestimatescontaminationfromdnasequencingdata
AT gauthierlaurad charrefficientlyestimatescontaminationfromdnasequencingdata
AT poterbatimothy charrefficientlyestimatescontaminationfromdnasequencingdata
AT giacopuzziedoardo charrefficientlyestimatescontaminationfromdnasequencingdata
AT goodrichjuliak charrefficientlyestimatescontaminationfromdnasequencingdata
AT stevenschristiner charrefficientlyestimatescontaminationfromdnasequencingdata
AT kingdaniel charrefficientlyestimatescontaminationfromdnasequencingdata
AT dalymarkj charrefficientlyestimatescontaminationfromdnasequencingdata
AT nealebenjaminm charrefficientlyestimatescontaminationfromdnasequencingdata
AT karczewskikonradj charrefficientlyestimatescontaminationfromdnasequencingdata