Cargando…

Automatic discovery of cross-family sequence features associated with protein function

BACKGROUND: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Brameier, Markus, Haan, Josien, Krings , Andrea, MacCallum, Robert M
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1395344/ https://www.ncbi.nlm.nih.gov/pubmed/16409628 http://dx.doi.org/10.1186/1471-2105-7-16

_version_	1782126951962509312
author	Brameier, Markus Haan, Josien Krings , Andrea MacCallum, Robert M
author_facet	Brameier, Markus Haan, Josien Krings , Andrea MacCallum, Robert M
author_sort	Brameier, Markus
collection	PubMed
description	BACKGROUND: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. RESULTS: We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. CONCLUSION: We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription.
format	Text
id	pubmed-1395344
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-13953442006-04-21 Automatic discovery of cross-family sequence features associated with protein function Brameier, Markus Haan, Josien Krings , Andrea MacCallum, Robert M BMC Bioinformatics Methodology Article BACKGROUND: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. RESULTS: We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. CONCLUSION: We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription. BioMed Central 2006-01-12 /pmc/articles/PMC1395344/ /pubmed/16409628 http://dx.doi.org/10.1186/1471-2105-7-16 Text en Copyright © 2006 Brameier et al; licensee BioMed Central Ltd.
spellingShingle	Methodology Article Brameier, Markus Haan, Josien Krings , Andrea MacCallum, Robert M Automatic discovery of cross-family sequence features associated with protein function
title	Automatic discovery of cross-family sequence features associated with protein function
title_full	Automatic discovery of cross-family sequence features associated with protein function
title_fullStr	Automatic discovery of cross-family sequence features associated with protein function
title_full_unstemmed	Automatic discovery of cross-family sequence features associated with protein function
title_short	Automatic discovery of cross-family sequence features associated with protein function
title_sort	automatic discovery of cross-family sequence features associated with protein function
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1395344/ https://www.ncbi.nlm.nih.gov/pubmed/16409628 http://dx.doi.org/10.1186/1471-2105-7-16
work_keys_str_mv	AT brameiermarkus automaticdiscoveryofcrossfamilysequencefeaturesassociatedwithproteinfunction AT haanjosien automaticdiscoveryofcrossfamilysequencefeaturesassociatedwithproteinfunction AT kringsandrea automaticdiscoveryofcrossfamilysequencefeaturesassociatedwithproteinfunction AT maccallumrobertm automaticdiscoveryofcrossfamilysequencefeaturesassociatedwithproteinfunction

Automatic discovery of cross-family sequence features associated with protein function

Ejemplares similares