Cargando…

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues

We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lockwood, Svetlana, Brayton, Kelly A., Daily, Jeff A., Broschat, Shira L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2019
Materias:	Microbiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6403173/ https://www.ncbi.nlm.nih.gov/pubmed/30873148 http://dx.doi.org/10.3389/fmicb.2019.00383

_version_	1783400530352013312
author	Lockwood, Svetlana Brayton, Kelly A. Daily, Jeff A. Broschat, Shira L.
author_facet	Lockwood, Svetlana Brayton, Kelly A. Daily, Jeff A. Broschat, Shira L.
author_sort	Lockwood, Svetlana
collection	PubMed
description	We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta′ (RpoB/RpoB′), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the groEL gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB′ were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB′ proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB′ were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.
format	Online Article Text
id	pubmed-6403173
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-64031732019-03-14 Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues Lockwood, Svetlana Brayton, Kelly A. Daily, Jeff A. Broschat, Shira L. Front Microbiol Microbiology We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta′ (RpoB/RpoB′), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the groEL gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB′ were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB′ proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB′ were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses. Frontiers Media S.A. 2019-02-28 /pmc/articles/PMC6403173/ /pubmed/30873148 http://dx.doi.org/10.3389/fmicb.2019.00383 Text en Copyright © 2019 Lockwood, Brayton, Daily and Broschat. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Microbiology Lockwood, Svetlana Brayton, Kelly A. Daily, Jeff A. Broschat, Shira L. Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues
title	Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues
title_full	Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues
title_fullStr	Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues
title_full_unstemmed	Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues
title_short	Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues
title_sort	whole proteome clustering of 2,307 proteobacterial genomes reveals conserved proteins and significant annotation issues
topic	Microbiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6403173/ https://www.ncbi.nlm.nih.gov/pubmed/30873148 http://dx.doi.org/10.3389/fmicb.2019.00383
work_keys_str_mv	AT lockwoodsvetlana wholeproteomeclusteringof2307proteobacterialgenomesrevealsconservedproteinsandsignificantannotationissues AT braytonkellya wholeproteomeclusteringof2307proteobacterialgenomesrevealsconservedproteinsandsignificantannotationissues AT dailyjeffa wholeproteomeclusteringof2307proteobacterialgenomesrevealsconservedproteinsandsignificantannotationissues AT broschatshiral wholeproteomeclusteringof2307proteobacterialgenomesrevealsconservedproteinsandsignificantannotationissues

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues

Ejemplares similares