Cargando…
An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space
The deluge of genomic data raises various challenges for computational protein annotation. The definition of superfamilies, based on conserved folds, or of families, showing more recent homology signatures, allow a first categorization of the sequence space. However, for precise functional annotatio...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10461819/ https://www.ncbi.nlm.nih.gov/pubmed/37585436 http://dx.doi.org/10.1371/journal.pcbi.1010881 |
_version_ | 1785097916319268864 |
---|---|
author | Hornung, Bastian Volker Helmut Terrapon, Nicolas |
author_facet | Hornung, Bastian Volker Helmut Terrapon, Nicolas |
author_sort | Hornung, Bastian Volker Helmut |
collection | PubMed |
description | The deluge of genomic data raises various challenges for computational protein annotation. The definition of superfamilies, based on conserved folds, or of families, showing more recent homology signatures, allow a first categorization of the sequence space. However, for precise functional annotation or the identification of the unexplored parts within a family, a division into subfamilies is essential. As curators of an expert database, the Carbohydrate Active Enzymes database (CAZy), we began, more than 15 years ago, to manually define subfamilies based on phylogeny reconstruction. However, facing the increasing amount of sequence and functional data, we required more scalable and reproducible methods. The recently popularized sequence similarity networks (SSNs), allows to cope with very large families and computation of many subfamily schemes. Still, the choice of the optimal SSN subfamily scheme only relies on expert knowledge so far, without any data-driven guidance from within the network. In this study, we therefore decided to investigate several network properties to determine a criterion which can be used by curators to evaluate the quality of subfamily assignments. The performance of the closeness centrality criterion, a network property to indicate the connectedness within the network, shows high similarity to the decisions of expert curators from eight distinct protein families. Closeness centrality also suggests that in some cases multiple levels of subfamilies could be possible, depending on the granularity of the research question, while it indicates when no subfamily emerged in some family evolution. We finally used closeness centrality to create subfamilies in four families of the CAZy database, providing a finer functional annotation and highlighting subfamilies without biochemically characterized members for potential future discoveries. |
format | Online Article Text |
id | pubmed-10461819 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-104618192023-08-29 An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space Hornung, Bastian Volker Helmut Terrapon, Nicolas PLoS Comput Biol Research Article The deluge of genomic data raises various challenges for computational protein annotation. The definition of superfamilies, based on conserved folds, or of families, showing more recent homology signatures, allow a first categorization of the sequence space. However, for precise functional annotation or the identification of the unexplored parts within a family, a division into subfamilies is essential. As curators of an expert database, the Carbohydrate Active Enzymes database (CAZy), we began, more than 15 years ago, to manually define subfamilies based on phylogeny reconstruction. However, facing the increasing amount of sequence and functional data, we required more scalable and reproducible methods. The recently popularized sequence similarity networks (SSNs), allows to cope with very large families and computation of many subfamily schemes. Still, the choice of the optimal SSN subfamily scheme only relies on expert knowledge so far, without any data-driven guidance from within the network. In this study, we therefore decided to investigate several network properties to determine a criterion which can be used by curators to evaluate the quality of subfamily assignments. The performance of the closeness centrality criterion, a network property to indicate the connectedness within the network, shows high similarity to the decisions of expert curators from eight distinct protein families. Closeness centrality also suggests that in some cases multiple levels of subfamilies could be possible, depending on the granularity of the research question, while it indicates when no subfamily emerged in some family evolution. We finally used closeness centrality to create subfamilies in four families of the CAZy database, providing a finer functional annotation and highlighting subfamilies without biochemically characterized members for potential future discoveries. Public Library of Science 2023-08-16 /pmc/articles/PMC10461819/ /pubmed/37585436 http://dx.doi.org/10.1371/journal.pcbi.1010881 Text en © 2023 Hornung, Terrapon https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Hornung, Bastian Volker Helmut Terrapon, Nicolas An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
title | An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
title_full | An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
title_fullStr | An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
title_full_unstemmed | An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
title_short | An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
title_sort | objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10461819/ https://www.ncbi.nlm.nih.gov/pubmed/37585436 http://dx.doi.org/10.1371/journal.pcbi.1010881 |
work_keys_str_mv | AT hornungbastianvolkerhelmut anobjectivecriteriontoevaluatesequencesimilaritynetworkshelpsindividingtheproteinfamilysequencespace AT terraponnicolas anobjectivecriteriontoevaluatesequencesimilaritynetworkshelpsindividingtheproteinfamilysequencespace AT hornungbastianvolkerhelmut objectivecriteriontoevaluatesequencesimilaritynetworkshelpsindividingtheproteinfamilysequencespace AT terraponnicolas objectivecriteriontoevaluatesequencesimilaritynetworkshelpsindividingtheproteinfamilysequencespace |