Cargando…

Seeding with minimized subsequence

MOTIVATION: Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to ha...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Xiang, Shi, Qian, Chen, Ke, Shao, Mingfu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Genome Sequence Analysis
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311335/ https://www.ncbi.nlm.nih.gov/pubmed/37387132 http://dx.doi.org/10.1093/bioinformatics/btad218

_version_	1785066721711751168
author	Li, Xiang Shi, Qian Chen, Ke Shao, Mingfu
author_facet	Li, Xiang Shi, Qian Chen, Ke Shao, Mingfu
author_sort	Li, Xiang
collection	PubMed
description	MOTIVATION: Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. RESULTS: We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. AVAILABILITY AND IMPLEMENTATION: SubseqHash is freely available at https://github.com/Shao-Group/subseqhash.
format	Online Article Text
id	pubmed-10311335
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-103113352023-07-01 Seeding with minimized subsequence Li, Xiang Shi, Qian Chen, Ke Shao, Mingfu Bioinformatics Genome Sequence Analysis MOTIVATION: Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. RESULTS: We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. AVAILABILITY AND IMPLEMENTATION: SubseqHash is freely available at https://github.com/Shao-Group/subseqhash. Oxford University Press 2023-06-30 /pmc/articles/PMC10311335/ /pubmed/37387132 http://dx.doi.org/10.1093/bioinformatics/btad218 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Genome Sequence Analysis Li, Xiang Shi, Qian Chen, Ke Shao, Mingfu Seeding with minimized subsequence
title	Seeding with minimized subsequence
title_full	Seeding with minimized subsequence
title_fullStr	Seeding with minimized subsequence
title_full_unstemmed	Seeding with minimized subsequence
title_short	Seeding with minimized subsequence
title_sort	seeding with minimized subsequence
topic	Genome Sequence Analysis
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311335/ https://www.ncbi.nlm.nih.gov/pubmed/37387132 http://dx.doi.org/10.1093/bioinformatics/btad218
work_keys_str_mv	AT lixiang seedingwithminimizedsubsequence AT shiqian seedingwithminimizedsubsequence AT chenke seedingwithminimizedsubsequence AT shaomingfu seedingwithminimizedsubsequence

Seeding with minimized subsequence

Ejemplares similares