Cargando…

A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data

BACKGROUND: The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles. We aim to identify a set of genes whose expression patterns can distinguish diverse tumor types. Those features may serve as biomarkers for tumor diagnosis and drug development. METHODS: Using RNA-seq expressi...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yuanyuan, Kang, Kai, Krahn, Juno M., Croutwater, Nicole, Lee, Kevin, Umbach, David M., Li, Leping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5496318/
https://www.ncbi.nlm.nih.gov/pubmed/28673244
http://dx.doi.org/10.1186/s12864-017-3906-0
_version_ 1783247952633921536
author Li, Yuanyuan
Kang, Kai
Krahn, Juno M.
Croutwater, Nicole
Lee, Kevin
Umbach, David M.
Li, Leping
author_facet Li, Yuanyuan
Kang, Kai
Krahn, Juno M.
Croutwater, Nicole
Lee, Kevin
Umbach, David M.
Li, Leping
author_sort Li, Yuanyuan
collection PubMed
description BACKGROUND: The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles. We aim to identify a set of genes whose expression patterns can distinguish diverse tumor types. Those features may serve as biomarkers for tumor diagnosis and drug development. METHODS: Using RNA-seq expression data, we undertook a pan-cancer classification of 9,096 TCGA tumor samples representing 31 tumor types. We randomly assigned 75% of samples into training and 25% into testing, proportionally allocating samples from each tumor type. RESULTS: We could correctly classify more than 90% of the test set samples. Accuracies were high for all but three of the 31 tumor types, in particular, for READ (rectum adenocarcinoma) which was largely indistinguishable from COAD (colon adenocarcinoma). We also carried out pan-cancer classification, separately for males and females, on 23 sex non-specific tumor types (those unrelated to reproductive organs). Results from these gender-specific analyses largely recapitulated results when gender was ignored. Remarkably, more than 80% of the 100 most discriminative genes selected from each gender separately overlapped. Genes that were differentially expressed between genders included BNC1, FAT2, FOXA1, and HOXA11. FOXA1 has been shown to play a role for sexual dimorphism in liver cancer. The differentially discriminative genes we identified might be important for the gender differences in tumor incidence and survival. CONCLUSIONS: We were able to identify many sets of 20 genes that could correctly classify more than 90% of the samples from 31 different tumor types using TCGA RNA-seq data. This accuracy is remarkable given the number of the tumor types and the total number of samples involved. We achieved similar results when we analyzed 23 non-sex-specific tumor types separately for males and females. We regard the frequency with which a gene appeared in those sets as measuring its importance for tumor classification. One third of the 50 most frequently appearing genes were pseudogenes; the degree of enrichment may be indicative of their importance in tumor classification. Lastly, we identified a few genes that might play a role in sexual dimorphism in certain cancers. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-3906-0) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5496318
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54963182017-07-05 A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data Li, Yuanyuan Kang, Kai Krahn, Juno M. Croutwater, Nicole Lee, Kevin Umbach, David M. Li, Leping BMC Genomics Research Article BACKGROUND: The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles. We aim to identify a set of genes whose expression patterns can distinguish diverse tumor types. Those features may serve as biomarkers for tumor diagnosis and drug development. METHODS: Using RNA-seq expression data, we undertook a pan-cancer classification of 9,096 TCGA tumor samples representing 31 tumor types. We randomly assigned 75% of samples into training and 25% into testing, proportionally allocating samples from each tumor type. RESULTS: We could correctly classify more than 90% of the test set samples. Accuracies were high for all but three of the 31 tumor types, in particular, for READ (rectum adenocarcinoma) which was largely indistinguishable from COAD (colon adenocarcinoma). We also carried out pan-cancer classification, separately for males and females, on 23 sex non-specific tumor types (those unrelated to reproductive organs). Results from these gender-specific analyses largely recapitulated results when gender was ignored. Remarkably, more than 80% of the 100 most discriminative genes selected from each gender separately overlapped. Genes that were differentially expressed between genders included BNC1, FAT2, FOXA1, and HOXA11. FOXA1 has been shown to play a role for sexual dimorphism in liver cancer. The differentially discriminative genes we identified might be important for the gender differences in tumor incidence and survival. CONCLUSIONS: We were able to identify many sets of 20 genes that could correctly classify more than 90% of the samples from 31 different tumor types using TCGA RNA-seq data. This accuracy is remarkable given the number of the tumor types and the total number of samples involved. We achieved similar results when we analyzed 23 non-sex-specific tumor types separately for males and females. We regard the frequency with which a gene appeared in those sets as measuring its importance for tumor classification. One third of the 50 most frequently appearing genes were pseudogenes; the degree of enrichment may be indicative of their importance in tumor classification. Lastly, we identified a few genes that might play a role in sexual dimorphism in certain cancers. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-3906-0) contains supplementary material, which is available to authorized users. BioMed Central 2017-07-03 /pmc/articles/PMC5496318/ /pubmed/28673244 http://dx.doi.org/10.1186/s12864-017-3906-0 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Li, Yuanyuan
Kang, Kai
Krahn, Juno M.
Croutwater, Nicole
Lee, Kevin
Umbach, David M.
Li, Leping
A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data
title A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data
title_full A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data
title_fullStr A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data
title_full_unstemmed A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data
title_short A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data
title_sort comprehensive genomic pan-cancer classification using the cancer genome atlas gene expression data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5496318/
https://www.ncbi.nlm.nih.gov/pubmed/28673244
http://dx.doi.org/10.1186/s12864-017-3906-0
work_keys_str_mv AT liyuanyuan acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT kangkai acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT krahnjunom acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT croutwaternicole acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT leekevin acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT umbachdavidm acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT lileping acomprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT liyuanyuan comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT kangkai comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT krahnjunom comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT croutwaternicole comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT leekevin comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT umbachdavidm comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata
AT lileping comprehensivegenomicpancancerclassificationusingthecancergenomeatlasgeneexpressiondata