Cargando…

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rahman Hera, Mahmudur, Pierce-Ward, N. Tessa, Koslicki, David
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory Press 2023
Materias:	Methods
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538494/ https://www.ncbi.nlm.nih.gov/pubmed/37344105 http://dx.doi.org/10.1101/gr.277651.123

_version_	1785113318406488064
author	Rahman Hera, Mahmudur Pierce-Ward, N. Tessa Koslicki, David
author_facet	Rahman Hera, Mahmudur Pierce-Ward, N. Tessa Koslicki, David
author_sort	Rahman Hera, Mahmudur
collection	PubMed
description	Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.
format	Online Article Text
id	pubmed-10538494
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory Press
record_format	MEDLINE/PubMed
spelling	pubmed-105384942023-09-29 Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash Rahman Hera, Mahmudur Pierce-Ward, N. Tessa Koslicki, David Genome Res Methods Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances. Cold Spring Harbor Laboratory Press 2023-07 /pmc/articles/PMC10538494/ /pubmed/37344105 http://dx.doi.org/10.1101/gr.277651.123 Text en © 2023 Rahman Hera et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle	Methods Rahman Hera, Mahmudur Pierce-Ward, N. Tessa Koslicki, David Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
title	Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
title_full	Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
title_fullStr	Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
title_full_unstemmed	Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
title_short	Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
title_sort	deriving confidence intervals for mutation rates across a wide range of evolutionary distances using fracminhash
topic	Methods
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538494/ https://www.ncbi.nlm.nih.gov/pubmed/37344105 http://dx.doi.org/10.1101/gr.277651.123
work_keys_str_mv	AT rahmanheramahmudur derivingconfidenceintervalsformutationratesacrossawiderangeofevolutionarydistancesusingfracminhash AT piercewardntessa derivingconfidenceintervalsformutationratesacrossawiderangeofevolutionarydistancesusingfracminhash AT koslickidavid derivingconfidenceintervalsformutationratesacrossawiderangeofevolutionarydistancesusingfracminhash

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash

Ejemplares similares