Cargando…

A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions

BACKGROUND: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolution...

Descripción completa

Detalles Bibliográficos
Autores principales: Abnousi, Armen, Broschat, Shira L., Kalyanaraman, Ananth
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995020/
https://www.ncbi.nlm.nih.gov/pubmed/27552220
http://dx.doi.org/10.1371/journal.pone.0161338
_version_ 1782449404957949952
author Abnousi, Armen
Broschat, Shira L.
Kalyanaraman, Ananth
author_facet Abnousi, Armen
Broschat, Shira L.
Kalyanaraman, Ananth
author_sort Abnousi, Armen
collection PubMed
description BACKGROUND: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. METHODS: In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. RESULTS: We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.
format Online
Article
Text
id pubmed-4995020
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-49950202016-09-12 A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions Abnousi, Armen Broschat, Shira L. Kalyanaraman, Ananth PLoS One Research Article BACKGROUND: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. METHODS: In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. RESULTS: We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences. Public Library of Science 2016-08-23 /pmc/articles/PMC4995020/ /pubmed/27552220 http://dx.doi.org/10.1371/journal.pone.0161338 Text en © 2016 Abnousi et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Abnousi, Armen
Broschat, Shira L.
Kalyanaraman, Ananth
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
title A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
title_full A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
title_fullStr A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
title_full_unstemmed A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
title_short A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
title_sort fast alignment-free approach for de novo detection of protein conserved regions
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995020/
https://www.ncbi.nlm.nih.gov/pubmed/27552220
http://dx.doi.org/10.1371/journal.pone.0161338
work_keys_str_mv AT abnousiarmen afastalignmentfreeapproachfordenovodetectionofproteinconservedregions
AT broschatshiral afastalignmentfreeapproachfordenovodetectionofproteinconservedregions
AT kalyanaramanananth afastalignmentfreeapproachfordenovodetectionofproteinconservedregions
AT abnousiarmen fastalignmentfreeapproachfordenovodetectionofproteinconservedregions
AT broschatshiral fastalignmentfreeapproachfordenovodetectionofproteinconservedregions
AT kalyanaramanananth fastalignmentfreeapproachfordenovodetectionofproteinconservedregions