Cargando…

Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

BACKGROUND: There are two significant problems associated with predicting protein-protein interactions using the sequences of amino acids. The first problem is representing each sequence as a feature vector, and the second is designing a model that can identify the protein interactions. Thus, effect...

Descripción completa

Detalles Bibliográficos
Autores principales: Bustamam, Alhadi, Musti, Mohamad I. S., Hartomo, Susilo, Aprilia, Shirley, Tampubolon, Patuan P., Lestari, Dian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929266/
https://www.ncbi.nlm.nih.gov/pubmed/31874636
http://dx.doi.org/10.1186/s12864-019-6304-y
_version_ 1783482664554070016
author Bustamam, Alhadi
Musti, Mohamad I. S.
Hartomo, Susilo
Aprilia, Shirley
Tampubolon, Patuan P.
Lestari, Dian
author_facet Bustamam, Alhadi
Musti, Mohamad I. S.
Hartomo, Susilo
Aprilia, Shirley
Tampubolon, Patuan P.
Lestari, Dian
author_sort Bustamam, Alhadi
collection PubMed
description BACKGROUND: There are two significant problems associated with predicting protein-protein interactions using the sequences of amino acids. The first problem is representing each sequence as a feature vector, and the second is designing a model that can identify the protein interactions. Thus, effective feature extraction methods can lead to improved model performance. In this study, we used two types of feature extraction methods—global encoding and pseudo-substitution matrix representation (PseudoSMR)—to represent the sequences of amino acids in human proteins and Human Immunodeficiency Virus type 1 (HIV-1) to address the classification problem of predicting protein-protein interactions. We also compared principal component analysis (PCA) with independent principal component analysis (IPCA) as methods for transforming Rotation Forest. RESULTS: The results show that using global encoding and PseudoSMR as a feature extraction method successfully represents the amino acid sequence for the Rotation Forest classifier with PCA or with IPCA. This can be seen from the comparison of the results of evaluation metrics, which were >73% across the six different parameters. The accuracy of both methods was >74%. The results for the other model performance criteria, such as sensitivity, specificity, precision, and F1-score, were all >73%. The data used in this study can be accessed using the following link: https://www.dsc.ui.ac.id/research/amino-acid-pred/. CONCLUSIONS: Both global encoding and PseudoSMR can successfully represent the sequences of amino acids. Rotation Forest (PCA) performed better than Rotation Forest (IPCA) in terms of predicting protein-protein interactions between HIV-1 and human proteins. Both the Rotation Forest (PCA) classifier and the Rotation Forest IPCA classifier performed better than other classifiers, such as Gradient Boosting, K-Nearest Neighbor, Logistic Regression, Random Forest, and Support Vector Machine (SVM). Rotation Forest (PCA) and Rotation Forest (IPCA) have accuracy, sensitivity, specificity, precision, and F1-score values >70% while the other classifiers have values <70%.
format Online
Article
Text
id pubmed-6929266
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-69292662019-12-30 Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences Bustamam, Alhadi Musti, Mohamad I. S. Hartomo, Susilo Aprilia, Shirley Tampubolon, Patuan P. Lestari, Dian BMC Genomics Research BACKGROUND: There are two significant problems associated with predicting protein-protein interactions using the sequences of amino acids. The first problem is representing each sequence as a feature vector, and the second is designing a model that can identify the protein interactions. Thus, effective feature extraction methods can lead to improved model performance. In this study, we used two types of feature extraction methods—global encoding and pseudo-substitution matrix representation (PseudoSMR)—to represent the sequences of amino acids in human proteins and Human Immunodeficiency Virus type 1 (HIV-1) to address the classification problem of predicting protein-protein interactions. We also compared principal component analysis (PCA) with independent principal component analysis (IPCA) as methods for transforming Rotation Forest. RESULTS: The results show that using global encoding and PseudoSMR as a feature extraction method successfully represents the amino acid sequence for the Rotation Forest classifier with PCA or with IPCA. This can be seen from the comparison of the results of evaluation metrics, which were >73% across the six different parameters. The accuracy of both methods was >74%. The results for the other model performance criteria, such as sensitivity, specificity, precision, and F1-score, were all >73%. The data used in this study can be accessed using the following link: https://www.dsc.ui.ac.id/research/amino-acid-pred/. CONCLUSIONS: Both global encoding and PseudoSMR can successfully represent the sequences of amino acids. Rotation Forest (PCA) performed better than Rotation Forest (IPCA) in terms of predicting protein-protein interactions between HIV-1 and human proteins. Both the Rotation Forest (PCA) classifier and the Rotation Forest IPCA classifier performed better than other classifiers, such as Gradient Boosting, K-Nearest Neighbor, Logistic Regression, Random Forest, and Support Vector Machine (SVM). Rotation Forest (PCA) and Rotation Forest (IPCA) have accuracy, sensitivity, specificity, precision, and F1-score values >70% while the other classifiers have values <70%. BioMed Central 2019-12-24 /pmc/articles/PMC6929266/ /pubmed/31874636 http://dx.doi.org/10.1186/s12864-019-6304-y Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Bustamam, Alhadi
Musti, Mohamad I. S.
Hartomo, Susilo
Aprilia, Shirley
Tampubolon, Patuan P.
Lestari, Dian
Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
title Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
title_full Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
title_fullStr Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
title_full_unstemmed Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
title_short Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
title_sort performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929266/
https://www.ncbi.nlm.nih.gov/pubmed/31874636
http://dx.doi.org/10.1186/s12864-019-6304-y
work_keys_str_mv AT bustamamalhadi performanceofrotationforestensembleclassifierandfeatureextractorinpredictingproteininteractionsusingaminoacidsequences
AT mustimohamadis performanceofrotationforestensembleclassifierandfeatureextractorinpredictingproteininteractionsusingaminoacidsequences
AT hartomosusilo performanceofrotationforestensembleclassifierandfeatureextractorinpredictingproteininteractionsusingaminoacidsequences
AT apriliashirley performanceofrotationforestensembleclassifierandfeatureextractorinpredictingproteininteractionsusingaminoacidsequences
AT tampubolonpatuanp performanceofrotationforestensembleclassifierandfeatureextractorinpredictingproteininteractionsusingaminoacidsequences
AT lestaridian performanceofrotationforestensembleclassifierandfeatureextractorinpredictingproteininteractionsusingaminoacidsequences