Cargando…
EnsembleFam: towards more accurate protein family prediction in the twilight zone
BACKGROUND: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with know...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8919565/ https://www.ncbi.nlm.nih.gov/pubmed/35287576 http://dx.doi.org/10.1186/s12859-022-04626-w |
_version_ | 1784668959757303808 |
---|---|
author | Kabir, Mohammad Neamul Wong, Limsoon |
author_facet | Kabir, Mohammad Neamul Wong, Limsoon |
author_sort | Kabir, Mohammad Neamul |
collection | PubMed |
description | BACKGROUND: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. RESULTS: We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. CONCLUSIONS: EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods. |
format | Online Article Text |
id | pubmed-8919565 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-89195652022-03-16 EnsembleFam: towards more accurate protein family prediction in the twilight zone Kabir, Mohammad Neamul Wong, Limsoon BMC Bioinformatics Research BACKGROUND: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. RESULTS: We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. CONCLUSIONS: EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods. BioMed Central 2022-03-14 /pmc/articles/PMC8919565/ /pubmed/35287576 http://dx.doi.org/10.1186/s12859-022-04626-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Kabir, Mohammad Neamul Wong, Limsoon EnsembleFam: towards more accurate protein family prediction in the twilight zone |
title | EnsembleFam: towards more accurate protein family prediction in the twilight zone |
title_full | EnsembleFam: towards more accurate protein family prediction in the twilight zone |
title_fullStr | EnsembleFam: towards more accurate protein family prediction in the twilight zone |
title_full_unstemmed | EnsembleFam: towards more accurate protein family prediction in the twilight zone |
title_short | EnsembleFam: towards more accurate protein family prediction in the twilight zone |
title_sort | ensemblefam: towards more accurate protein family prediction in the twilight zone |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8919565/ https://www.ncbi.nlm.nih.gov/pubmed/35287576 http://dx.doi.org/10.1186/s12859-022-04626-w |
work_keys_str_mv | AT kabirmohammadneamul ensemblefamtowardsmoreaccurateproteinfamilypredictioninthetwilightzone AT wonglimsoon ensemblefamtowardsmoreaccurateproteinfamilypredictioninthetwilightzone |