Cargando…
Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other no...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825679/ https://www.ncbi.nlm.nih.gov/pubmed/34904638 http://dx.doi.org/10.1093/bioinformatics/btab838 |
_version_ | 1784647274649878528 |
---|---|
author | Umer, Husen M Audain, Enrique Zhu, Yafeng Pfeuffer, Julianus Sachsenberg, Timo Lehtiö, Janne Branca, Rui M Perez-Riverol, Yasset |
author_facet | Umer, Husen M Audain, Enrique Zhu, Yafeng Pfeuffer, Julianus Sachsenberg, Timo Lehtiö, Janne Branca, Rui M Perez-Riverol, Yasset |
author_sort | Umer, Husen M |
collection | PubMed |
description | SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8825679 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-88256792022-02-09 Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides Umer, Husen M Audain, Enrique Zhu, Yafeng Pfeuffer, Julianus Sachsenberg, Timo Lehtiö, Janne Branca, Rui M Perez-Riverol, Yasset Bioinformatics Applications Notes SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-12-14 /pmc/articles/PMC8825679/ /pubmed/34904638 http://dx.doi.org/10.1093/bioinformatics/btab838 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Applications Notes Umer, Husen M Audain, Enrique Zhu, Yafeng Pfeuffer, Julianus Sachsenberg, Timo Lehtiö, Janne Branca, Rui M Perez-Riverol, Yasset Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides |
title | Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides |
title_full | Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides |
title_fullStr | Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides |
title_full_unstemmed | Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides |
title_short | Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides |
title_sort | generation of ensembl-based proteogenomics databases boosts the identification of non-canonical peptides |
topic | Applications Notes |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825679/ https://www.ncbi.nlm.nih.gov/pubmed/34904638 http://dx.doi.org/10.1093/bioinformatics/btab838 |
work_keys_str_mv | AT umerhusenm generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT audainenrique generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT zhuyafeng generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT pfeufferjulianus generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT sachsenbergtimo generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT lehtiojanne generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT brancaruim generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides AT perezriverolyasset generationofensemblbasedproteogenomicsdatabasesbooststheidentificationofnoncanonicalpeptides |