Cargando…

A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins

BACKGROUND: Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments i...

Descripción completa

Detalles Bibliográficos
Autores principales: Verma, Ruchi, Melcher, Ulrich
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439722/
https://www.ncbi.nlm.nih.gov/pubmed/23046503
http://dx.doi.org/10.1186/1471-2105-13-S15-S9
_version_ 1782243054226243584
author Verma, Ruchi
Melcher, Ulrich
author_facet Verma, Ruchi
Melcher, Ulrich
author_sort Verma, Ruchi
collection PubMed
description BACKGROUND: Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM). RESULT: The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity. CONCLUSION: The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences.
format Online
Article
Text
id pubmed-3439722
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34397222012-09-17 A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins Verma, Ruchi Melcher, Ulrich BMC Bioinformatics Proceedings BACKGROUND: Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM). RESULT: The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity. CONCLUSION: The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences. BioMed Central 2012-09-11 /pmc/articles/PMC3439722/ /pubmed/23046503 http://dx.doi.org/10.1186/1471-2105-13-S15-S9 Text en Copyright ©2012 Verma and Melcher; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Verma, Ruchi
Melcher, Ulrich
A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
title A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
title_full A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
title_fullStr A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
title_full_unstemmed A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
title_short A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
title_sort support vector machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439722/
https://www.ncbi.nlm.nih.gov/pubmed/23046503
http://dx.doi.org/10.1186/1471-2105-13-S15-S9
work_keys_str_mv AT vermaruchi asupportvectormachinebasedmethodtodistinguishproteobacterialproteinsfromeukaryoticplantproteins
AT melcherulrich asupportvectormachinebasedmethodtodistinguishproteobacterialproteinsfromeukaryoticplantproteins
AT vermaruchi supportvectormachinebasedmethodtodistinguishproteobacterialproteinsfromeukaryoticplantproteins
AT melcherulrich supportvectormachinebasedmethodtodistinguishproteobacterialproteinsfromeukaryoticplantproteins