Cargando…

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chu, Justin, Mohamadi, Hamid, Erhan, Emre, Tse, Jeffery, Chiu, Readman, Yeo, Sarah, Birol, Inanc
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	National Academy of Sciences 2020
Materias:	PNAS Plus
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7382288/ https://www.ncbi.nlm.nih.gov/pubmed/32641514 http://dx.doi.org/10.1073/pnas.1903436117

_version_	1783563216398319616
author	Chu, Justin Mohamadi, Hamid Erhan, Emre Tse, Jeffery Chiu, Readman Yeo, Sarah Birol, Inanc
author_facet	Chu, Justin Mohamadi, Hamid Erhan, Emre Tse, Jeffery Chiu, Readman Yeo, Sarah Birol, Inanc
author_sort	Chu, Justin
collection	PubMed
description	Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.
format	Online Article Text
id	pubmed-7382288
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	National Academy of Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-73822882020-07-30 Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters Chu, Justin Mohamadi, Hamid Erhan, Emre Tse, Jeffery Chiu, Readman Yeo, Sarah Birol, Inanc Proc Natl Acad Sci U S A PNAS Plus Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time. National Academy of Sciences 2020-07-21 2020-07-08 /pmc/articles/PMC7382288/ /pubmed/32641514 http://dx.doi.org/10.1073/pnas.1903436117 Text en Copyright © 2020 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle	PNAS Plus Chu, Justin Mohamadi, Hamid Erhan, Emre Tse, Jeffery Chiu, Readman Yeo, Sarah Birol, Inanc Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters
title	Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters
title_full	Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters
title_fullStr	Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters
title_full_unstemmed	Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters
title_short	Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters
title_sort	mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex bloom filters
topic	PNAS Plus
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7382288/ https://www.ncbi.nlm.nih.gov/pubmed/32641514 http://dx.doi.org/10.1073/pnas.1903436117
work_keys_str_mv	AT chujustin mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters AT mohamadihamid mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters AT erhanemre mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters AT tsejeffery mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters AT chiureadman mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters AT yeosarah mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters AT birolinanc mismatchtolerantalignmentfreesequenceclassificationusingmultiplespacedseedsandmultiindexbloomfilters

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

Ejemplares similares