Cargando…

Mash: fast genome and metagenome distance estimation using MinHash

Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from whic...

Descripción completa

Detalles Bibliográficos
Autores principales: Ondov, Brian D., Treangen, Todd J., Melsted, Páll, Mallonee, Adam B., Bergman, Nicholas H., Koren, Sergey, Phillippy, Adam M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4915045/
https://www.ncbi.nlm.nih.gov/pubmed/27323842
http://dx.doi.org/10.1186/s13059-016-0997-x
_version_ 1782438635045388288
author Ondov, Brian D.
Treangen, Todd J.
Melsted, Páll
Mallonee, Adam B.
Bergman, Nicholas H.
Koren, Sergey
Phillippy, Adam M.
author_facet Ondov, Brian D.
Treangen, Todd J.
Melsted, Páll
Mallonee, Adam B.
Bergman, Nicholas H.
Koren, Sergey
Phillippy, Adam M.
author_sort Ondov, Brian D.
collection PubMed
description Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13059-016-0997-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4915045
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-49150452016-06-22 Mash: fast genome and metagenome distance estimation using MinHash Ondov, Brian D. Treangen, Todd J. Melsted, Páll Mallonee, Adam B. Bergman, Nicholas H. Koren, Sergey Phillippy, Adam M. Genome Biol Software Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13059-016-0997-x) contains supplementary material, which is available to authorized users. BioMed Central 2016-06-20 /pmc/articles/PMC4915045/ /pubmed/27323842 http://dx.doi.org/10.1186/s13059-016-0997-x Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Ondov, Brian D.
Treangen, Todd J.
Melsted, Páll
Mallonee, Adam B.
Bergman, Nicholas H.
Koren, Sergey
Phillippy, Adam M.
Mash: fast genome and metagenome distance estimation using MinHash
title Mash: fast genome and metagenome distance estimation using MinHash
title_full Mash: fast genome and metagenome distance estimation using MinHash
title_fullStr Mash: fast genome and metagenome distance estimation using MinHash
title_full_unstemmed Mash: fast genome and metagenome distance estimation using MinHash
title_short Mash: fast genome and metagenome distance estimation using MinHash
title_sort mash: fast genome and metagenome distance estimation using minhash
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4915045/
https://www.ncbi.nlm.nih.gov/pubmed/27323842
http://dx.doi.org/10.1186/s13059-016-0997-x
work_keys_str_mv AT ondovbriand mashfastgenomeandmetagenomedistanceestimationusingminhash
AT treangentoddj mashfastgenomeandmetagenomedistanceestimationusingminhash
AT melstedpall mashfastgenomeandmetagenomedistanceestimationusingminhash
AT malloneeadamb mashfastgenomeandmetagenomedistanceestimationusingminhash
AT bergmannicholash mashfastgenomeandmetagenomedistanceestimationusingminhash
AT korensergey mashfastgenomeandmetagenomedistanceestimationusingminhash
AT phillippyadamm mashfastgenomeandmetagenomedistanceestimationusingminhash