Cargando…

Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning

BACKGROUND: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also...

Descripción completa

Detalles Bibliográficos
Autores principales: Kaundal, Rakesh, Sahu, Sitanshu S, Verma, Ruchi, Weirick, Tyler
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851450/
https://www.ncbi.nlm.nih.gov/pubmed/24266945
http://dx.doi.org/10.1186/1471-2105-14-S14-S7
_version_ 1782294285818789888
author Kaundal, Rakesh
Sahu, Sitanshu S
Verma, Ruchi
Weirick, Tyler
author_facet Kaundal, Rakesh
Sahu, Sitanshu S
Verma, Ruchi
Weirick, Tyler
author_sort Kaundal, Rakesh
collection PubMed
description BACKGROUND: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning. RESULTS: In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, N(terminal)-Center-C(terminal )composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms. CONCLUSION: The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.
format Online
Article
Text
id pubmed-3851450
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38514502013-12-20 Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning Kaundal, Rakesh Sahu, Sitanshu S Verma, Ruchi Weirick, Tyler BMC Bioinformatics Proceedings BACKGROUND: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning. RESULTS: In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, N(terminal)-Center-C(terminal )composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms. CONCLUSION: The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes. BioMed Central 2013-10-09 /pmc/articles/PMC3851450/ /pubmed/24266945 http://dx.doi.org/10.1186/1471-2105-14-S14-S7 Text en Copyright © 2013 Kaundal et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Kaundal, Rakesh
Sahu, Sitanshu S
Verma, Ruchi
Weirick, Tyler
Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
title Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
title_full Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
title_fullStr Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
title_full_unstemmed Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
title_short Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
title_sort identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851450/
https://www.ncbi.nlm.nih.gov/pubmed/24266945
http://dx.doi.org/10.1186/1471-2105-14-S14-S7
work_keys_str_mv AT kaundalrakesh identificationandcharacterizationofplastidtypeproteinsfromsequenceattributedfeaturesusingmachinelearning
AT sahusitanshus identificationandcharacterizationofplastidtypeproteinsfromsequenceattributedfeaturesusingmachinelearning
AT vermaruchi identificationandcharacterizationofplastidtypeproteinsfromsequenceattributedfeaturesusingmachinelearning
AT weiricktyler identificationandcharacterizationofplastidtypeproteinsfromsequenceattributedfeaturesusingmachinelearning