Cargando…

A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction

Prediction of virus-host protein-protein interactions (PPI) is a broad research area where various machine-learning-based classifiers are developed. Transforming biological data into machine-usable features is a preliminary step in constructing these virus-host PPI prediction tools. In this study, w...

Descripción completa

Detalles Bibliográficos
Autores principales: Ibrahim, Ahmed Hassan, Karabulut, Onur Can, Karpuzcu, Betül Asiye, Türk, Erdem, Süzek, Barış Ethem
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153705/
https://www.ncbi.nlm.nih.gov/pubmed/37130110
http://dx.doi.org/10.1371/journal.pone.0285168
_version_ 1785035974029344768
author Ibrahim, Ahmed Hassan
Karabulut, Onur Can
Karpuzcu, Betül Asiye
Türk, Erdem
Süzek, Barış Ethem
author_facet Ibrahim, Ahmed Hassan
Karabulut, Onur Can
Karpuzcu, Betül Asiye
Türk, Erdem
Süzek, Barış Ethem
author_sort Ibrahim, Ahmed Hassan
collection PubMed
description Prediction of virus-host protein-protein interactions (PPI) is a broad research area where various machine-learning-based classifiers are developed. Transforming biological data into machine-usable features is a preliminary step in constructing these virus-host PPI prediction tools. In this study, we have adopted a virus-host PPI dataset and a reduced amino acids alphabet to create tripeptide features and introduced a correlation coefficient-based feature selection. We applied feature selection across several correlation coefficient metrics and statistically tested their relevance in a structural context. We compared the performance of feature-selection models against that of the baseline virus-host PPI prediction models created using different classification algorithms without the feature selection. We also tested the performance of these baseline models against the previously available tools to ensure their predictive power is acceptable. Here, the Pearson coefficient provides the best performance with respect to the baseline model as measured by AUPR; a drop of 0.003 in AUPR while achieving a 73.3% (from 686 to 183) reduction in the number of tripeptides features for random forest. The results suggest our correlation coefficient-based feature selection approach, while decreasing the computation time and space complexity, has a limited impact on the prediction performance of virus-host PPI prediction tools.
format Online
Article
Text
id pubmed-10153705
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-101537052023-05-03 A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction Ibrahim, Ahmed Hassan Karabulut, Onur Can Karpuzcu, Betül Asiye Türk, Erdem Süzek, Barış Ethem PLoS One Research Article Prediction of virus-host protein-protein interactions (PPI) is a broad research area where various machine-learning-based classifiers are developed. Transforming biological data into machine-usable features is a preliminary step in constructing these virus-host PPI prediction tools. In this study, we have adopted a virus-host PPI dataset and a reduced amino acids alphabet to create tripeptide features and introduced a correlation coefficient-based feature selection. We applied feature selection across several correlation coefficient metrics and statistically tested their relevance in a structural context. We compared the performance of feature-selection models against that of the baseline virus-host PPI prediction models created using different classification algorithms without the feature selection. We also tested the performance of these baseline models against the previously available tools to ensure their predictive power is acceptable. Here, the Pearson coefficient provides the best performance with respect to the baseline model as measured by AUPR; a drop of 0.003 in AUPR while achieving a 73.3% (from 686 to 183) reduction in the number of tripeptides features for random forest. The results suggest our correlation coefficient-based feature selection approach, while decreasing the computation time and space complexity, has a limited impact on the prediction performance of virus-host PPI prediction tools. Public Library of Science 2023-05-02 /pmc/articles/PMC10153705/ /pubmed/37130110 http://dx.doi.org/10.1371/journal.pone.0285168 Text en © 2023 Ibrahim et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Ibrahim, Ahmed Hassan
Karabulut, Onur Can
Karpuzcu, Betül Asiye
Türk, Erdem
Süzek, Barış Ethem
A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
title A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
title_full A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
title_fullStr A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
title_full_unstemmed A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
title_short A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
title_sort correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153705/
https://www.ncbi.nlm.nih.gov/pubmed/37130110
http://dx.doi.org/10.1371/journal.pone.0285168
work_keys_str_mv AT ibrahimahmedhassan acorrelationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT karabulutonurcan acorrelationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT karpuzcubetulasiye acorrelationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT turkerdem acorrelationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT suzekbarısethem acorrelationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT ibrahimahmedhassan correlationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT karabulutonurcan correlationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT karpuzcubetulasiye correlationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT turkerdem correlationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction
AT suzekbarısethem correlationcoefficientbasedfeatureselectionapproachforvirushostproteinproteininteractionprediction