Cargando…

Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellu...

Descripción completa

Detalles Bibliográficos
Autores principales: Acquaah-Mensah, George K., Leach, Sonia M., Guda, Chittibabu
Formato: Texto
Lenguaje:English
Publicado: Elsevier 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709537/
https://www.ncbi.nlm.nih.gov/pubmed/16970551
http://dx.doi.org/10.1016/S1672-0229(06)60023-5
_version_ 1782169299703562240
author Acquaah-Mensah, George K.
Leach, Sonia M.
Guda, Chittibabu
author_facet Acquaah-Mensah, George K.
Leach, Sonia M.
Guda, Chittibabu
author_sort Acquaah-Mensah, George K.
collection PubMed
description Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naïve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.
format Text
id pubmed-2709537
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-27095372009-07-13 Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis Acquaah-Mensah, George K. Leach, Sonia M. Guda, Chittibabu Genomics Proteomics Bioinformatics Article Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naïve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences. Elsevier 2006 2006-08-22 /pmc/articles/PMC2709537/ /pubmed/16970551 http://dx.doi.org/10.1016/S1672-0229(06)60023-5 Text en © 2006 Beijing Institute of Genomics http://creativecommons.org/licenses/by-nc-sa/3.0/ This is an open access article under the CC BY-NC-SA license (http://creativecommons.org/licenses/by-nc-sa/3.0/).
spellingShingle Article
Acquaah-Mensah, George K.
Leach, Sonia M.
Guda, Chittibabu
Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
title Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
title_full Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
title_fullStr Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
title_full_unstemmed Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
title_short Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
title_sort predicting the subcellular localization of human proteins using machine learning and exploratory data analysis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709537/
https://www.ncbi.nlm.nih.gov/pubmed/16970551
http://dx.doi.org/10.1016/S1672-0229(06)60023-5
work_keys_str_mv AT acquaahmensahgeorgek predictingthesubcellularlocalizationofhumanproteinsusingmachinelearningandexploratorydataanalysis
AT leachsoniam predictingthesubcellularlocalizationofhumanproteinsusingmachinelearningandexploratorydataanalysis
AT gudachittibabu predictingthesubcellularlocalizationofhumanproteinsusingmachinelearningandexploratorydataanalysis