Cargando…

SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications

Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in...

Descripción completa

Detalles Bibliográficos
Autores principales: Becker, Devan, Champredon, David, Chato, Connor, Gugan, Gopi, Poon, Art
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10124968/
https://www.ncbi.nlm.nih.gov/pubmed/37101658
http://dx.doi.org/10.1093/nargab/lqad038
_version_ 1785029939984072704
author Becker, Devan
Champredon, David
Chato, Connor
Gugan, Gopi
Poon, Art
author_facet Becker, Devan
Champredon, David
Chato, Connor
Gugan, Gopi
Poon, Art
author_sort Becker, Devan
collection PubMed
description Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.
format Online
Article
Text
id pubmed-10124968
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-101249682023-04-25 SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications Becker, Devan Champredon, David Chato, Connor Gugan, Gopi Poon, Art NAR Genom Bioinform Standard Article Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported. Oxford University Press 2023-04-24 /pmc/articles/PMC10124968/ /pubmed/37101658 http://dx.doi.org/10.1093/nargab/lqad038 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Standard Article
Becker, Devan
Champredon, David
Chato, Connor
Gugan, Gopi
Poon, Art
SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
title SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
title_full SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
title_fullStr SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
title_full_unstemmed SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
title_short SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications
title_sort sup: a probabilistic framework to propagate genome sequence uncertainty, with applications
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10124968/
https://www.ncbi.nlm.nih.gov/pubmed/37101658
http://dx.doi.org/10.1093/nargab/lqad038
work_keys_str_mv AT beckerdevan supaprobabilisticframeworktopropagategenomesequenceuncertaintywithapplications
AT champredondavid supaprobabilisticframeworktopropagategenomesequenceuncertaintywithapplications
AT chatoconnor supaprobabilisticframeworktopropagategenomesequenceuncertaintywithapplications
AT gugangopi supaprobabilisticframeworktopropagategenomesequenceuncertaintywithapplications
AT poonart supaprobabilisticframeworktopropagategenomesequenceuncertaintywithapplications