Cargando…

Simultaneous identification of long similar substrings in large sets of sequences

BACKGROUND: Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneou...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kleffe, Jürgen, Möller, Friedrich, Wittig, Burghardt
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892095/ https://www.ncbi.nlm.nih.gov/pubmed/17570866 http://dx.doi.org/10.1186/1471-2105-8-S5-S7

_version_	1782133826010480640
author	Kleffe, Jürgen Möller, Friedrich Wittig, Burghardt
author_facet	Kleffe, Jürgen Möller, Friedrich Wittig, Burghardt
author_sort	Kleffe, Jürgen
collection	PubMed
description	BACKGROUND: Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. RESULTS: We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at . CONCLUSION: The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.
format	Text
id	pubmed-1892095
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18920952007-06-15 Simultaneous identification of long similar substrings in large sets of sequences Kleffe, Jürgen Möller, Friedrich Wittig, Burghardt BMC Bioinformatics Research BACKGROUND: Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. RESULTS: We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at . CONCLUSION: The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use. BioMed Central 2007-05-24 /pmc/articles/PMC1892095/ /pubmed/17570866 http://dx.doi.org/10.1186/1471-2105-8-S5-S7 Text en Copyright © 2007 Kleffe et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Kleffe, Jürgen Möller, Friedrich Wittig, Burghardt Simultaneous identification of long similar substrings in large sets of sequences
title	Simultaneous identification of long similar substrings in large sets of sequences
title_full	Simultaneous identification of long similar substrings in large sets of sequences
title_fullStr	Simultaneous identification of long similar substrings in large sets of sequences
title_full_unstemmed	Simultaneous identification of long similar substrings in large sets of sequences
title_short	Simultaneous identification of long similar substrings in large sets of sequences
title_sort	simultaneous identification of long similar substrings in large sets of sequences
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892095/ https://www.ncbi.nlm.nih.gov/pubmed/17570866 http://dx.doi.org/10.1186/1471-2105-8-S5-S7
work_keys_str_mv	AT kleffejurgen simultaneousidentificationoflongsimilarsubstringsinlargesetsofsequences AT mollerfriedrich simultaneousidentificationoflongsimilarsubstringsinlargesetsofsequences AT wittigburghardt simultaneousidentificationoflongsimilarsubstringsinlargesetsofsequences

Simultaneous identification of long similar substrings in large sets of sequences

Ejemplares similares