Cargando…

Mash Screen: high-throughput sequence containment estimation for genome discovery

The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Ondov, Brian D., Starrett, Gabriel J., Sappington, Anna, Kostic, Aleksandra, Koren, Sergey, Buck, Christopher B., Phillippy, Adam M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6833257/
https://www.ncbi.nlm.nih.gov/pubmed/31690338
http://dx.doi.org/10.1186/s13059-019-1841-x
_version_ 1783466340922687488
author Ondov, Brian D.
Starrett, Gabriel J.
Sappington, Anna
Kostic, Aleksandra
Koren, Sergey
Buck, Christopher B.
Phillippy, Adam M.
author_facet Ondov, Brian D.
Starrett, Gabriel J.
Sappington, Anna
Kostic, Aleksandra
Koren, Sergey
Buck, Christopher B.
Phillippy, Adam M.
author_sort Ondov, Brian D.
collection PubMed
description The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome and demonstrate the identification of a novel polyomavirus species from a public metagenome.
format Online
Article
Text
id pubmed-6833257
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-68332572019-11-08 Mash Screen: high-throughput sequence containment estimation for genome discovery Ondov, Brian D. Starrett, Gabriel J. Sappington, Anna Kostic, Aleksandra Koren, Sergey Buck, Christopher B. Phillippy, Adam M. Genome Biol Method The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome and demonstrate the identification of a novel polyomavirus species from a public metagenome. BioMed Central 2019-11-05 /pmc/articles/PMC6833257/ /pubmed/31690338 http://dx.doi.org/10.1186/s13059-019-1841-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Method
Ondov, Brian D.
Starrett, Gabriel J.
Sappington, Anna
Kostic, Aleksandra
Koren, Sergey
Buck, Christopher B.
Phillippy, Adam M.
Mash Screen: high-throughput sequence containment estimation for genome discovery
title Mash Screen: high-throughput sequence containment estimation for genome discovery
title_full Mash Screen: high-throughput sequence containment estimation for genome discovery
title_fullStr Mash Screen: high-throughput sequence containment estimation for genome discovery
title_full_unstemmed Mash Screen: high-throughput sequence containment estimation for genome discovery
title_short Mash Screen: high-throughput sequence containment estimation for genome discovery
title_sort mash screen: high-throughput sequence containment estimation for genome discovery
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6833257/
https://www.ncbi.nlm.nih.gov/pubmed/31690338
http://dx.doi.org/10.1186/s13059-019-1841-x
work_keys_str_mv AT ondovbriand mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT starrettgabrielj mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT sappingtonanna mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT kosticaleksandra mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT korensergey mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT buckchristopherb mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT phillippyadamm mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery