Cargando…

CONSULT: accurate contamination removal using locality-sensitive hashing

A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) p...

Descripción completa

Detalles Bibliográficos
Autores principales: Rachtman, Eleonora, Bafna, Vineet, Mirarab, Siavash
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8340999/
https://www.ncbi.nlm.nih.gov/pubmed/34377979
http://dx.doi.org/10.1093/nargab/lqab071
_version_ 1783733859241689088
author Rachtman, Eleonora
Bafna, Vineet
Mirarab, Siavash
author_facet Rachtman, Eleonora
Bafna, Vineet
Mirarab, Siavash
author_sort Rachtman, Eleonora
collection PubMed
description A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies.
format Online
Article
Text
id pubmed-8340999
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-83409992021-08-09 CONSULT: accurate contamination removal using locality-sensitive hashing Rachtman, Eleonora Bafna, Vineet Mirarab, Siavash NAR Genom Bioinform Standard Article A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies. Oxford University Press 2021-08-05 /pmc/articles/PMC8340999/ /pubmed/34377979 http://dx.doi.org/10.1093/nargab/lqab071 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Standard Article
Rachtman, Eleonora
Bafna, Vineet
Mirarab, Siavash
CONSULT: accurate contamination removal using locality-sensitive hashing
title CONSULT: accurate contamination removal using locality-sensitive hashing
title_full CONSULT: accurate contamination removal using locality-sensitive hashing
title_fullStr CONSULT: accurate contamination removal using locality-sensitive hashing
title_full_unstemmed CONSULT: accurate contamination removal using locality-sensitive hashing
title_short CONSULT: accurate contamination removal using locality-sensitive hashing
title_sort consult: accurate contamination removal using locality-sensitive hashing
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8340999/
https://www.ncbi.nlm.nih.gov/pubmed/34377979
http://dx.doi.org/10.1093/nargab/lqab071
work_keys_str_mv AT rachtmaneleonora consultaccuratecontaminationremovalusinglocalitysensitivehashing
AT bafnavineet consultaccuratecontaminationremovalusinglocalitysensitivehashing
AT mirarabsiavash consultaccuratecontaminationremovalusinglocalitysensitivehashing