Cargando…

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic...

Descripción completa

Detalles Bibliográficos
Autores principales: TARIQ, MUHAMMAD USMAN, HASEEB, MUHAMMAD, ALEDHARI, MOHAMMED, RAZZAK, REHMA, PARIZI, REZA M., SAEED, FAHAD
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7853650/
https://www.ncbi.nlm.nih.gov/pubmed/33537181
http://dx.doi.org/10.1109/ACCESS.2020.3047588
_version_ 1783646006212034560
author TARIQ, MUHAMMAD USMAN
HASEEB, MUHAMMAD
ALEDHARI, MOHAMMED
RAZZAK, REHMA
PARIZI, REZA M.
SAEED, FAHAD
author_facet TARIQ, MUHAMMAD USMAN
HASEEB, MUHAMMAD
ALEDHARI, MOHAMMED
RAZZAK, REHMA
PARIZI, REZA M.
SAEED, FAHAD
author_sort TARIQ, MUHAMMAD USMAN
collection PubMed
description Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques’ relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
format Online
Article
Text
id pubmed-7853650
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-78536502021-02-02 Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey TARIQ, MUHAMMAD USMAN HASEEB, MUHAMMAD ALEDHARI, MOHAMMED RAZZAK, REHMA PARIZI, REZA M. SAEED, FAHAD IEEE Access Article Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques’ relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis. 2020-12-25 2021 /pmc/articles/PMC7853650/ /pubmed/33537181 http://dx.doi.org/10.1109/ACCESS.2020.3047588 Text en This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
spellingShingle Article
TARIQ, MUHAMMAD USMAN
HASEEB, MUHAMMAD
ALEDHARI, MOHAMMED
RAZZAK, REHMA
PARIZI, REZA M.
SAEED, FAHAD
Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_full Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_fullStr Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_full_unstemmed Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_short Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_sort methods for proteogenomics data analysis, challenges, and scalability bottlenecks: a survey
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7853650/
https://www.ncbi.nlm.nih.gov/pubmed/33537181
http://dx.doi.org/10.1109/ACCESS.2020.3047588
work_keys_str_mv AT tariqmuhammadusman methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT haseebmuhammad methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT aledharimohammed methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT razzakrehma methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT parizirezam methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT saeedfahad methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey