Cargando…

A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations

BACKGROUND: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because gen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Anders, John, Petruschke, Hannes, Jehmlich, Nico, Haange, Sven-Bastiaan, von Bergen, Martin, Stadler, Peter F
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8157683/ https://www.ncbi.nlm.nih.gov/pubmed/34039272 http://dx.doi.org/10.1186/s12859-021-04159-8

_version_	1783699736719523840
author	Anders, John Petruschke, Hannes Jehmlich, Nico Haange, Sven-Bastiaan von Bergen, Martin Stadler, Peter F
author_facet	Anders, John Petruschke, Hannes Jehmlich, Nico Haange, Sven-Bastiaan von Bergen, Martin Stadler, Peter F
author_sort	Anders, John
collection	PubMed
description	BACKGROUND: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. RESULTS: We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. CONCLUSIONS: The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04159-8.
format	Online Article Text
id	pubmed-8157683
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-81576832021-05-28 A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations Anders, John Petruschke, Hannes Jehmlich, Nico Haange, Sven-Bastiaan von Bergen, Martin Stadler, Peter F BMC Bioinformatics Research BACKGROUND: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. RESULTS: We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. CONCLUSIONS: The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04159-8. BioMed Central 2021-05-26 /pmc/articles/PMC8157683/ /pubmed/34039272 http://dx.doi.org/10.1186/s12859-021-04159-8 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Anders, John Petruschke, Hannes Jehmlich, Nico Haange, Sven-Bastiaan von Bergen, Martin Stadler, Peter F A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title	A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_full	A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_fullStr	A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_full_unstemmed	A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_short	A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_sort	workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8157683/ https://www.ncbi.nlm.nih.gov/pubmed/34039272 http://dx.doi.org/10.1186/s12859-021-04159-8
work_keys_str_mv	AT andersjohn aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT petruschkehannes aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT jehmlichnico aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT haangesvenbastiaan aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT vonbergenmartin aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT stadlerpeterf aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT andersjohn workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT petruschkehannes workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT jehmlichnico workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT haangesvenbastiaan workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT vonbergenmartin workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations AT stadlerpeterf workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations

A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations

Ejemplares similares