Cargando…

streammd: fast low-memory duplicate marking using a Bloom filter

SUMMARY: Identification of duplicate templates is a common preprocessing step in bulk sequence analysis; for large libraries, this can be resource intensive. Here, we present streammd: a fast, memory-efficient, single-pass duplicate marker operating on the principle of a Bloom filter. streammd close...

Descripción completa

Detalles Bibliográficos
Autor principal: Leonard, Conrad
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10112951/
https://www.ncbi.nlm.nih.gov/pubmed/37027230
http://dx.doi.org/10.1093/bioinformatics/btad181
_version_ 1785027723546066944
author Leonard, Conrad
author_facet Leonard, Conrad
author_sort Leonard, Conrad
collection PubMed
description SUMMARY: Identification of duplicate templates is a common preprocessing step in bulk sequence analysis; for large libraries, this can be resource intensive. Here, we present streammd: a fast, memory-efficient, single-pass duplicate marker operating on the principle of a Bloom filter. streammd closely reproduces outputs from Picard MarkDuplicates while being substantially faster, and requires much less memory than SAMBLASTER. AVAILABILITY AND IMPLEMENTATION: streammd is a C++ program available from GitHub https://github.com/delocalizer/streammd under the MIT license.
format Online
Article
Text
id pubmed-10112951
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-101129512023-04-19 streammd: fast low-memory duplicate marking using a Bloom filter Leonard, Conrad Bioinformatics Applications Note SUMMARY: Identification of duplicate templates is a common preprocessing step in bulk sequence analysis; for large libraries, this can be resource intensive. Here, we present streammd: a fast, memory-efficient, single-pass duplicate marker operating on the principle of a Bloom filter. streammd closely reproduces outputs from Picard MarkDuplicates while being substantially faster, and requires much less memory than SAMBLASTER. AVAILABILITY AND IMPLEMENTATION: streammd is a C++ program available from GitHub https://github.com/delocalizer/streammd under the MIT license. Oxford University Press 2023-04-07 /pmc/articles/PMC10112951/ /pubmed/37027230 http://dx.doi.org/10.1093/bioinformatics/btad181 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Note
Leonard, Conrad
streammd: fast low-memory duplicate marking using a Bloom filter
title streammd: fast low-memory duplicate marking using a Bloom filter
title_full streammd: fast low-memory duplicate marking using a Bloom filter
title_fullStr streammd: fast low-memory duplicate marking using a Bloom filter
title_full_unstemmed streammd: fast low-memory duplicate marking using a Bloom filter
title_short streammd: fast low-memory duplicate marking using a Bloom filter
title_sort streammd: fast low-memory duplicate marking using a bloom filter
topic Applications Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10112951/
https://www.ncbi.nlm.nih.gov/pubmed/37027230
http://dx.doi.org/10.1093/bioinformatics/btad181
work_keys_str_mv AT leonardconrad streammdfastlowmemoryduplicatemarkingusingabloomfilter