Cargando…

Two Efficient Techniques to Find Approximate Overlaps between Sequences

The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-pref...

Descripción completa

Detalles Bibliográficos
Autor principal:	Haj Rachid, Maan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi Publishing Corporation 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331309/ https://www.ncbi.nlm.nih.gov/pubmed/28293632 http://dx.doi.org/10.1155/2017/2731385

_version_	1782511348433813504
author	Haj Rachid, Maan
author_facet	Haj Rachid, Maan
author_sort	Haj Rachid, Maan
collection	PubMed
description	The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving for the purpose of solving an approximate version of the overlap problem. The main benefit of this direction is the ability to skip time-consuming error-detecting techniques which are applied in the prefiltering stage. In this work, we present and compare two techniques to solve the approximate overlap problem. The first adapts a compact prefix tree to efficiently solve the approximate all-pairs suffix-prefix problem, while the other utilizes a well-known principle, namely, the pigeonhole principle, to identify a potential overlap match in order to ultimately solve the same problem. Our results show that our solution using the pigeonhole principle has better space and time consumption over an FM-based solution, while our solution based on prefix tree has the best space consumption between all three solutions. The number of mismatches (hamming distance) is used to define the approximate matching between strings in our work.
format	Online Article Text
id	pubmed-5331309
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Hindawi Publishing Corporation
record_format	MEDLINE/PubMed
spelling	pubmed-53313092017-03-14 Two Efficient Techniques to Find Approximate Overlaps between Sequences Haj Rachid, Maan Biomed Res Int Research Article The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving for the purpose of solving an approximate version of the overlap problem. The main benefit of this direction is the ability to skip time-consuming error-detecting techniques which are applied in the prefiltering stage. In this work, we present and compare two techniques to solve the approximate overlap problem. The first adapts a compact prefix tree to efficiently solve the approximate all-pairs suffix-prefix problem, while the other utilizes a well-known principle, namely, the pigeonhole principle, to identify a potential overlap match in order to ultimately solve the same problem. Our results show that our solution using the pigeonhole principle has better space and time consumption over an FM-based solution, while our solution based on prefix tree has the best space consumption between all three solutions. The number of mismatches (hamming distance) is used to define the approximate matching between strings in our work. Hindawi Publishing Corporation 2017 2017-02-15 /pmc/articles/PMC5331309/ /pubmed/28293632 http://dx.doi.org/10.1155/2017/2731385 Text en Copyright © 2017 Maan Haj Rachid. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Haj Rachid, Maan Two Efficient Techniques to Find Approximate Overlaps between Sequences
title	Two Efficient Techniques to Find Approximate Overlaps between Sequences
title_full	Two Efficient Techniques to Find Approximate Overlaps between Sequences
title_fullStr	Two Efficient Techniques to Find Approximate Overlaps between Sequences
title_full_unstemmed	Two Efficient Techniques to Find Approximate Overlaps between Sequences
title_short	Two Efficient Techniques to Find Approximate Overlaps between Sequences
title_sort	two efficient techniques to find approximate overlaps between sequences
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331309/ https://www.ncbi.nlm.nih.gov/pubmed/28293632 http://dx.doi.org/10.1155/2017/2731385
work_keys_str_mv	AT hajrachidmaan twoefficienttechniquestofindapproximateoverlapsbetweensequences

Two Efficient Techniques to Find Approximate Overlaps between Sequences

Ejemplares similares