Cargando…

Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families

Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipat...

Descripción completa

Detalles Bibliográficos
Autores principales: Vicedomini, R., Bouly, J.P., Laine, E., Falciatore, A., Carbone, A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9016551/
https://www.ncbi.nlm.nih.gov/pubmed/35353898
http://dx.doi.org/10.1093/molbev/msac070
_version_ 1784688552922054656
author Vicedomini, R.
Bouly, J.P.
Laine, E.
Falciatore, A.
Carbone, A.
author_facet Vicedomini, R.
Bouly, J.P.
Laine, E.
Falciatore, A.
Carbone, A.
author_sort Vicedomini, R.
collection PubMed
description Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.
format Online
Article
Text
id pubmed-9016551
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-90165512022-04-20 Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families Vicedomini, R. Bouly, J.P. Laine, E. Falciatore, A. Carbone, A. Mol Biol Evol Methods Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter. Oxford University Press 2022-03-30 /pmc/articles/PMC9016551/ /pubmed/35353898 http://dx.doi.org/10.1093/molbev/msac070 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods
Vicedomini, R.
Bouly, J.P.
Laine, E.
Falciatore, A.
Carbone, A.
Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
title Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
title_full Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
title_fullStr Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
title_full_unstemmed Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
title_short Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
title_sort multiple profile models extract features from protein sequence data and resolve functional diversity of very different protein families
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9016551/
https://www.ncbi.nlm.nih.gov/pubmed/35353898
http://dx.doi.org/10.1093/molbev/msac070
work_keys_str_mv AT vicedominir multipleprofilemodelsextractfeaturesfromproteinsequencedataandresolvefunctionaldiversityofverydifferentproteinfamilies
AT boulyjp multipleprofilemodelsextractfeaturesfromproteinsequencedataandresolvefunctionaldiversityofverydifferentproteinfamilies
AT lainee multipleprofilemodelsextractfeaturesfromproteinsequencedataandresolvefunctionaldiversityofverydifferentproteinfamilies
AT falciatorea multipleprofilemodelsextractfeaturesfromproteinsequencedataandresolvefunctionaldiversityofverydifferentproteinfamilies
AT carbonea multipleprofilemodelsextractfeaturesfromproteinsequencedataandresolvefunctionaldiversityofverydifferentproteinfamilies