Cargando…

Discrete profile comparison using information bottleneck

Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order...

Descripción completa

Detalles Bibliográficos
Autores principales:	O'Rourke, Sean, Chechik, Gal, Friedman, Robin, Eskin, Eleazar
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810319/ https://www.ncbi.nlm.nih.gov/pubmed/16723011 http://dx.doi.org/10.1186/1471-2105-7-S1-S8

_version_	1782132577554923520
author	O'Rourke, Sean Chechik, Gal Friedman, Robin Eskin, Eleazar
author_facet	O'Rourke, Sean Chechik, Gal Friedman, Robin Eskin, Eleazar
author_sort	O'Rourke, Sean
collection	PubMed
description	Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible.
format	Text
id	pubmed-1810319
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18103192007-03-14 Discrete profile comparison using information bottleneck O'Rourke, Sean Chechik, Gal Friedman, Robin Eskin, Eleazar BMC Bioinformatics Proceedings Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible. BioMed Central 2006-03-20 /pmc/articles/PMC1810319/ /pubmed/16723011 http://dx.doi.org/10.1186/1471-2105-7-S1-S8 Text en
spellingShingle	Proceedings O'Rourke, Sean Chechik, Gal Friedman, Robin Eskin, Eleazar Discrete profile comparison using information bottleneck
title	Discrete profile comparison using information bottleneck
title_full	Discrete profile comparison using information bottleneck
title_fullStr	Discrete profile comparison using information bottleneck
title_full_unstemmed	Discrete profile comparison using information bottleneck
title_short	Discrete profile comparison using information bottleneck
title_sort	discrete profile comparison using information bottleneck
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810319/ https://www.ncbi.nlm.nih.gov/pubmed/16723011 http://dx.doi.org/10.1186/1471-2105-7-S1-S8
work_keys_str_mv	AT orourkesean discreteprofilecomparisonusinginformationbottleneck AT chechikgal discreteprofilecomparisonusinginformationbottleneck AT friedmanrobin discreteprofilecomparisonusinginformationbottleneck AT eskineleazar discreteprofilecomparisonusinginformationbottleneck

Discrete profile comparison using information bottleneck

Ejemplares similares