Cargando…

An improved string composition method for sequence comparison

BACKGROUND: Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lu, Guoqing, Zhang, Shunpu, Fang, Xiang
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2423438/ https://www.ncbi.nlm.nih.gov/pubmed/18541050 http://dx.doi.org/10.1186/1471-2105-9-S6-S15

_version_	1782156099449782272
author	Lu, Guoqing Zhang, Shunpu Fang, Xiang
author_facet	Lu, Guoqing Zhang, Shunpu Fang, Xiang
author_sort	Lu, Guoqing
collection	PubMed
description	BACKGROUND: Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. RESULTS: We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. CONCLUSION: We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.
format	Text
id	pubmed-2423438
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-24234382008-06-11 An improved string composition method for sequence comparison Lu, Guoqing Zhang, Shunpu Fang, Xiang BMC Bioinformatics Research BACKGROUND: Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. RESULTS: We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. CONCLUSION: We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV. BioMed Central 2008-05-28 /pmc/articles/PMC2423438/ /pubmed/18541050 http://dx.doi.org/10.1186/1471-2105-9-S6-S15 Text en Copyright © 2008 Lu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lu, Guoqing Zhang, Shunpu Fang, Xiang An improved string composition method for sequence comparison
title	An improved string composition method for sequence comparison
title_full	An improved string composition method for sequence comparison
title_fullStr	An improved string composition method for sequence comparison
title_full_unstemmed	An improved string composition method for sequence comparison
title_short	An improved string composition method for sequence comparison
title_sort	improved string composition method for sequence comparison
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2423438/ https://www.ncbi.nlm.nih.gov/pubmed/18541050 http://dx.doi.org/10.1186/1471-2105-9-S6-S15
work_keys_str_mv	AT luguoqing animprovedstringcompositionmethodforsequencecomparison AT zhangshunpu animprovedstringcompositionmethodforsequencecomparison AT fangxiang animprovedstringcompositionmethodforsequencecomparison AT luguoqing improvedstringcompositionmethodforsequencecomparison AT zhangshunpu improvedstringcompositionmethodforsequencecomparison AT fangxiang improvedstringcompositionmethodforsequencecomparison

An improved string composition method for sequence comparison

Ejemplares similares