Cargando…

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open re...

Descripción completa

Detalles Bibliográficos
Autores principales: Omasits, Ulrich, Varadarajan, Adithi R., Schmid, Michael, Goetze, Sandra, Melidis, Damianos, Bourqui, Marc, Nikolayeva, Olga, Québatte, Maxime, Patrignani, Andrea, Dehio, Christoph, Frey, Juerg E., Robinson, Mark D., Wollscheid, Bernd, Ahrens, Christian H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741054/
https://www.ncbi.nlm.nih.gov/pubmed/29141959
http://dx.doi.org/10.1101/gr.218255.116
_version_ 1783288136579678208
author Omasits, Ulrich
Varadarajan, Adithi R.
Schmid, Michael
Goetze, Sandra
Melidis, Damianos
Bourqui, Marc
Nikolayeva, Olga
Québatte, Maxime
Patrignani, Andrea
Dehio, Christoph
Frey, Juerg E.
Robinson, Mark D.
Wollscheid, Bernd
Ahrens, Christian H.
author_facet Omasits, Ulrich
Varadarajan, Adithi R.
Schmid, Michael
Goetze, Sandra
Melidis, Damianos
Bourqui, Marc
Nikolayeva, Olga
Québatte, Maxime
Patrignani, Andrea
Dehio, Christoph
Frey, Juerg E.
Robinson, Mark D.
Wollscheid, Bernd
Ahrens, Christian H.
author_sort Omasits, Ulrich
collection PubMed
description Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.
format Online
Article
Text
id pubmed-5741054
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-57410542018-01-23 An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics Omasits, Ulrich Varadarajan, Adithi R. Schmid, Michael Goetze, Sandra Melidis, Damianos Bourqui, Marc Nikolayeva, Olga Québatte, Maxime Patrignani, Andrea Dehio, Christoph Frey, Juerg E. Robinson, Mark D. Wollscheid, Bernd Ahrens, Christian H. Genome Res Method Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote. Cold Spring Harbor Laboratory Press 2017-12 /pmc/articles/PMC5741054/ /pubmed/29141959 http://dx.doi.org/10.1101/gr.218255.116 Text en © 2017 Omasits et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Method
Omasits, Ulrich
Varadarajan, Adithi R.
Schmid, Michael
Goetze, Sandra
Melidis, Damianos
Bourqui, Marc
Nikolayeva, Olga
Québatte, Maxime
Patrignani, Andrea
Dehio, Christoph
Frey, Juerg E.
Robinson, Mark D.
Wollscheid, Bernd
Ahrens, Christian H.
An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
title An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
title_full An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
title_fullStr An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
title_full_unstemmed An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
title_short An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
title_sort integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741054/
https://www.ncbi.nlm.nih.gov/pubmed/29141959
http://dx.doi.org/10.1101/gr.218255.116
work_keys_str_mv AT omasitsulrich anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT varadarajanadithir anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT schmidmichael anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT goetzesandra anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT melidisdamianos anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT bourquimarc anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT nikolayevaolga anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT quebattemaxime anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT patrignaniandrea anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT dehiochristoph anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT freyjuerge anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT robinsonmarkd anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT wollscheidbernd anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT ahrenschristianh anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT omasitsulrich integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT varadarajanadithir integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT schmidmichael integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT goetzesandra integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT melidisdamianos integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT bourquimarc integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT nikolayevaolga integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT quebattemaxime integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT patrignaniandrea integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT dehiochristoph integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT freyjuerge integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT robinsonmarkd integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT wollscheidbernd integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics
AT ahrenschristianh integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics