Cargando…
Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data
BACKGROUND: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3916084/ https://www.ncbi.nlm.nih.gov/pubmed/24467687 http://dx.doi.org/10.1186/1471-2105-15-28 |
_version_ | 1782302663742849024 |
---|---|
author | Cole, Charles Krampis, Konstantinos Karagiannis, Konstantinos Almeida, Jonas S Faison, William J Motwani, Mona Wan, Quan Golikov, Anton Pan, Yang Simonyan, Vahan Mazumder, Raja |
author_facet | Cole, Charles Krampis, Konstantinos Karagiannis, Konstantinos Almeida, Jonas S Faison, William J Motwani, Mona Wan, Quan Golikov, Anton Pan, Yang Simonyan, Vahan Mazumder, Raja |
author_sort | Cole, Charles |
collection | PubMed |
description | BACKGROUND: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS: To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS: Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides. |
format | Online Article Text |
id | pubmed-3916084 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-39160842014-02-07 Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data Cole, Charles Krampis, Konstantinos Karagiannis, Konstantinos Almeida, Jonas S Faison, William J Motwani, Mona Wan, Quan Golikov, Anton Pan, Yang Simonyan, Vahan Mazumder, Raja BMC Bioinformatics Methodology Article BACKGROUND: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS: To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS: Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides. BioMed Central 2014-01-27 /pmc/articles/PMC3916084/ /pubmed/24467687 http://dx.doi.org/10.1186/1471-2105-15-28 Text en Copyright © 2014 Cole et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Cole, Charles Krampis, Konstantinos Karagiannis, Konstantinos Almeida, Jonas S Faison, William J Motwani, Mona Wan, Quan Golikov, Anton Pan, Yang Simonyan, Vahan Mazumder, Raja Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data |
title | Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data |
title_full | Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data |
title_fullStr | Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data |
title_full_unstemmed | Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data |
title_short | Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data |
title_sort | non-synonymous variations in cancer and their effects on the human proteome: workflow for ngs data biocuration and proteome-wide analysis of tcga data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3916084/ https://www.ncbi.nlm.nih.gov/pubmed/24467687 http://dx.doi.org/10.1186/1471-2105-15-28 |
work_keys_str_mv | AT colecharles nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT krampiskonstantinos nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT karagianniskonstantinos nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT almeidajonass nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT faisonwilliamj nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT motwanimona nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT wanquan nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT golikovanton nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT panyang nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT simonyanvahan nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata AT mazumderraja nonsynonymousvariationsincancerandtheireffectsonthehumanproteomeworkflowforngsdatabiocurationandproteomewideanalysisoftcgadata |