Cargando…

Optimal neighborhood indexing for protein similarity search

BACKGROUND: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by stori...

Descripción completa

Detalles Bibliográficos
Autores principales:	Peterlongo, Pierre, Noé, Laurent, Lavenier, Dominique, Nguyen, Van Hoa, Kucherov, Gregory, Giraud, Mathieu
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2640386/ https://www.ncbi.nlm.nih.gov/pubmed/19087280 http://dx.doi.org/10.1186/1471-2105-9-534

_version_	1782164568851611648
author	Peterlongo, Pierre Noé, Laurent Lavenier, Dominique Nguyen, Van Hoa Kucherov, Gregory Giraud, Mathieu
author_facet	Peterlongo, Pierre Noé, Laurent Lavenier, Dominique Nguyen, Van Hoa Kucherov, Gregory Giraud, Mathieu
author_sort	Peterlongo, Pierre
collection	PubMed
description	BACKGROUND: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet. RESULTS: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website . CONCLUSION: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.
format	Text
id	pubmed-2640386
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26403862009-02-12 Optimal neighborhood indexing for protein similarity search Peterlongo, Pierre Noé, Laurent Lavenier, Dominique Nguyen, Van Hoa Kucherov, Gregory Giraud, Mathieu BMC Bioinformatics Research Article BACKGROUND: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet. RESULTS: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website . CONCLUSION: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction. BioMed Central 2008-12-16 /pmc/articles/PMC2640386/ /pubmed/19087280 http://dx.doi.org/10.1186/1471-2105-9-534 Text en Copyright © 2008 Peterlongo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Peterlongo, Pierre Noé, Laurent Lavenier, Dominique Nguyen, Van Hoa Kucherov, Gregory Giraud, Mathieu Optimal neighborhood indexing for protein similarity search
title	Optimal neighborhood indexing for protein similarity search
title_full	Optimal neighborhood indexing for protein similarity search
title_fullStr	Optimal neighborhood indexing for protein similarity search
title_full_unstemmed	Optimal neighborhood indexing for protein similarity search
title_short	Optimal neighborhood indexing for protein similarity search
title_sort	optimal neighborhood indexing for protein similarity search
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2640386/ https://www.ncbi.nlm.nih.gov/pubmed/19087280 http://dx.doi.org/10.1186/1471-2105-9-534
work_keys_str_mv	AT peterlongopierre optimalneighborhoodindexingforproteinsimilaritysearch AT noelaurent optimalneighborhoodindexingforproteinsimilaritysearch AT lavenierdominique optimalneighborhoodindexingforproteinsimilaritysearch AT nguyenvanhoa optimalneighborhoodindexingforproteinsimilaritysearch AT kucherovgregory optimalneighborhoodindexingforproteinsimilaritysearch AT giraudmathieu optimalneighborhoodindexingforproteinsimilaritysearch

Optimal neighborhood indexing for protein similarity search

Ejemplares similares