Cargando…

Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression

BACKGROUND: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for th...

Descripción completa

Detalles Bibliográficos
Autores principales: Pedersen, Bjørn P., Ifrim, Georgiana, Liboriussen, Poul, Axelsen, Kristian B., Palmgren, Michael G., Nissen, Poul, Wiuf, Carsten, Pedersen, Christian N. S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896382/
https://www.ncbi.nlm.nih.gov/pubmed/24465495
http://dx.doi.org/10.1371/journal.pone.0085139
_version_ 1782300071877935104
author Pedersen, Bjørn P.
Ifrim, Georgiana
Liboriussen, Poul
Axelsen, Kristian B.
Palmgren, Michael G.
Nissen, Poul
Wiuf, Carsten
Pedersen, Christian N. S.
author_facet Pedersen, Bjørn P.
Ifrim, Georgiana
Liboriussen, Poul
Axelsen, Kristian B.
Palmgren, Michael G.
Nissen, Poul
Wiuf, Carsten
Pedersen, Christian N. S.
author_sort Pedersen, Bjørn P.
collection PubMed
description BACKGROUND: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. RESULTS: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. CONCLUSIONS: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
format Online
Article
Text
id pubmed-3896382
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-38963822014-01-24 Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression Pedersen, Bjørn P. Ifrim, Georgiana Liboriussen, Poul Axelsen, Kristian B. Palmgren, Michael G. Nissen, Poul Wiuf, Carsten Pedersen, Christian N. S. PLoS One Research Article BACKGROUND: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. RESULTS: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. CONCLUSIONS: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis. Public Library of Science 2014-01-20 /pmc/articles/PMC3896382/ /pubmed/24465495 http://dx.doi.org/10.1371/journal.pone.0085139 Text en © 2014 Pedersen et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Pedersen, Bjørn P.
Ifrim, Georgiana
Liboriussen, Poul
Axelsen, Kristian B.
Palmgren, Michael G.
Nissen, Poul
Wiuf, Carsten
Pedersen, Christian N. S.
Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
title Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
title_full Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
title_fullStr Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
title_full_unstemmed Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
title_short Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
title_sort large scale identification and categorization of protein sequences using structured logistic regression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896382/
https://www.ncbi.nlm.nih.gov/pubmed/24465495
http://dx.doi.org/10.1371/journal.pone.0085139
work_keys_str_mv AT pedersenbjørnp largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT ifrimgeorgiana largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT liboriussenpoul largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT axelsenkristianb largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT palmgrenmichaelg largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT nissenpoul largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT wiufcarsten largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT pedersenchristianns largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression