Cargando…
CONSULT: accurate contamination removal using locality-sensitive hashing
A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) p...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8340999/ https://www.ncbi.nlm.nih.gov/pubmed/34377979 http://dx.doi.org/10.1093/nargab/lqab071 |
_version_ | 1783733859241689088 |
---|---|
author | Rachtman, Eleonora Bafna, Vineet Mirarab, Siavash |
author_facet | Rachtman, Eleonora Bafna, Vineet Mirarab, Siavash |
author_sort | Rachtman, Eleonora |
collection | PubMed |
description | A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies. |
format | Online Article Text |
id | pubmed-8340999 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-83409992021-08-09 CONSULT: accurate contamination removal using locality-sensitive hashing Rachtman, Eleonora Bafna, Vineet Mirarab, Siavash NAR Genom Bioinform Standard Article A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies. Oxford University Press 2021-08-05 /pmc/articles/PMC8340999/ /pubmed/34377979 http://dx.doi.org/10.1093/nargab/lqab071 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Standard Article Rachtman, Eleonora Bafna, Vineet Mirarab, Siavash CONSULT: accurate contamination removal using locality-sensitive hashing |
title | CONSULT: accurate contamination removal using locality-sensitive hashing |
title_full | CONSULT: accurate contamination removal using locality-sensitive hashing |
title_fullStr | CONSULT: accurate contamination removal using locality-sensitive hashing |
title_full_unstemmed | CONSULT: accurate contamination removal using locality-sensitive hashing |
title_short | CONSULT: accurate contamination removal using locality-sensitive hashing |
title_sort | consult: accurate contamination removal using locality-sensitive hashing |
topic | Standard Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8340999/ https://www.ncbi.nlm.nih.gov/pubmed/34377979 http://dx.doi.org/10.1093/nargab/lqab071 |
work_keys_str_mv | AT rachtmaneleonora consultaccuratecontaminationremovalusinglocalitysensitivehashing AT bafnavineet consultaccuratecontaminationremovalusinglocalitysensitivehashing AT mirarabsiavash consultaccuratecontaminationremovalusinglocalitysensitivehashing |