Cargando…
An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open re...
Autores principales: | , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741054/ https://www.ncbi.nlm.nih.gov/pubmed/29141959 http://dx.doi.org/10.1101/gr.218255.116 |
_version_ | 1783288136579678208 |
---|---|
author | Omasits, Ulrich Varadarajan, Adithi R. Schmid, Michael Goetze, Sandra Melidis, Damianos Bourqui, Marc Nikolayeva, Olga Québatte, Maxime Patrignani, Andrea Dehio, Christoph Frey, Juerg E. Robinson, Mark D. Wollscheid, Bernd Ahrens, Christian H. |
author_facet | Omasits, Ulrich Varadarajan, Adithi R. Schmid, Michael Goetze, Sandra Melidis, Damianos Bourqui, Marc Nikolayeva, Olga Québatte, Maxime Patrignani, Andrea Dehio, Christoph Frey, Juerg E. Robinson, Mark D. Wollscheid, Bernd Ahrens, Christian H. |
author_sort | Omasits, Ulrich |
collection | PubMed |
description | Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote. |
format | Online Article Text |
id | pubmed-5741054 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-57410542018-01-23 An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics Omasits, Ulrich Varadarajan, Adithi R. Schmid, Michael Goetze, Sandra Melidis, Damianos Bourqui, Marc Nikolayeva, Olga Québatte, Maxime Patrignani, Andrea Dehio, Christoph Frey, Juerg E. Robinson, Mark D. Wollscheid, Bernd Ahrens, Christian H. Genome Res Method Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote. Cold Spring Harbor Laboratory Press 2017-12 /pmc/articles/PMC5741054/ /pubmed/29141959 http://dx.doi.org/10.1101/gr.218255.116 Text en © 2017 Omasits et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. |
spellingShingle | Method Omasits, Ulrich Varadarajan, Adithi R. Schmid, Michael Goetze, Sandra Melidis, Damianos Bourqui, Marc Nikolayeva, Olga Québatte, Maxime Patrignani, Andrea Dehio, Christoph Frey, Juerg E. Robinson, Mark D. Wollscheid, Bernd Ahrens, Christian H. An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
title | An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
title_full | An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
title_fullStr | An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
title_full_unstemmed | An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
title_short | An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
title_sort | integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics |
topic | Method |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741054/ https://www.ncbi.nlm.nih.gov/pubmed/29141959 http://dx.doi.org/10.1101/gr.218255.116 |
work_keys_str_mv | AT omasitsulrich anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT varadarajanadithir anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT schmidmichael anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT goetzesandra anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT melidisdamianos anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT bourquimarc anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT nikolayevaolga anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT quebattemaxime anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT patrignaniandrea anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT dehiochristoph anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT freyjuerge anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT robinsonmarkd anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT wollscheidbernd anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT ahrenschristianh anintegrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT omasitsulrich integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT varadarajanadithir integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT schmidmichael integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT goetzesandra integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT melidisdamianos integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT bourquimarc integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT nikolayevaolga integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT quebattemaxime integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT patrignaniandrea integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT dehiochristoph integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT freyjuerge integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT robinsonmarkd integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT wollscheidbernd integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics AT ahrenschristianh integrativestrategytoidentifytheentireproteincodingpotentialofprokaryoticgenomesbyproteogenomics |