Cargando…

Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

BACKGROUND: Unique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. There are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs. However, many tools either cannot handle subst...

Descripción completa

Detalles Bibliográficos
Autor principal:	Liu, Daniel
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2019
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6921982/ https://www.ncbi.nlm.nih.gov/pubmed/31871845 http://dx.doi.org/10.7717/peerj.8275

_version_	1783481263722594304
author	Liu, Daniel
author_facet	Liu, Daniel
author_sort	Liu, Daniel
collection	PubMed
description	BACKGROUND: Unique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. There are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs. However, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets. RESULTS: We reformulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 s, using only a single thread and much less than 10 GB of memory. CONCLUSIONS: We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.
format	Online Article Text
id	pubmed-6921982
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-69219822019-12-23 Algorithms for efficiently collapsing reads with Unique Molecular Identifiers Liu, Daniel PeerJ Bioinformatics BACKGROUND: Unique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. There are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs. However, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets. RESULTS: We reformulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 s, using only a single thread and much less than 10 GB of memory. CONCLUSIONS: We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures. PeerJ Inc. 2019-12-16 /pmc/articles/PMC6921982/ /pubmed/31871845 http://dx.doi.org/10.7717/peerj.8275 Text en ©2019 Liu https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Liu, Daniel Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
title	Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
title_full	Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
title_fullStr	Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
title_full_unstemmed	Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
title_short	Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
title_sort	algorithms for efficiently collapsing reads with unique molecular identifiers
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6921982/ https://www.ncbi.nlm.nih.gov/pubmed/31871845 http://dx.doi.org/10.7717/peerj.8275
work_keys_str_mv	AT liudaniel algorithmsforefficientlycollapsingreadswithuniquemolecularidentifiers

Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

Ejemplares similares