Cargando…

Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides

SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other no...

Descripción completa

Detalles Bibliográficos
Autores principales: Umer, Husen M, Audain, Enrique, Zhu, Yafeng, Pfeuffer, Julianus, Sachsenberg, Timo, Lehtiö, Janne, Branca, Rui M, Perez-Riverol, Yasset
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825679/
https://www.ncbi.nlm.nih.gov/pubmed/34904638
http://dx.doi.org/10.1093/bioinformatics/btab838
_version_ 1784647274649878528
author Umer, Husen M
Audain, Enrique
Zhu, Yafeng
Pfeuffer, Julianus
Sachsenberg, Timo
Lehtiö, Janne
Branca, Rui M
Perez-Riverol, Yasset
author_facet Umer, Husen M
Audain, Enrique
Zhu, Yafeng
Pfeuffer, Julianus
Sachsenberg, Timo
Lehtiö, Janne
Branca, Rui M
Perez-Riverol, Yasset
author_sort Umer, Husen M
collection PubMed
description SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8825679
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-88256792022-02-09 Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides Umer, Husen M Audain, Enrique Zhu, Yafeng Pfeuffer, Julianus Sachsenberg, Timo Lehtiö, Janne Branca, Rui M Perez-Riverol, Yasset Bioinformatics Applications Notes SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-12-14 /pmc/articles/PMC8825679/ /pubmed/34904638 http://dx.doi.org/10.1093/bioinformatics/btab838 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Applications Notes
Umer, Husen M
Audain, Enrique
Zhu, Yafeng
Pfeuffer, Julianus
Sachsenberg, Timo
Lehtiö, Janne
Branca, Rui M
Perez-Riverol, Yasset
Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
title Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
title_full Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
title_fullStr Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
title_full_unstemmed Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
title_short Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
title_sort generation of ensembl-based proteogenomics databases boosts the identification of non-canonical peptides
topic Applications Notes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825679/
https://www.ncbi.nlm.nih.gov/pubmed/34904638
http://dx.doi.org/10.1093/bioinformatics/btab838
work_keys_str_mv AT umerhusenm generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT audainenrique generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT zhuyafeng generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT pfeufferjulianus generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT sachsenbergtimo generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT lehtiojanne generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT brancaruim generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides
AT perezriverolyasset generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides