Cargando…

TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression

Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the acc...

Descripción completa

Detalles Bibliográficos
Autores principales: He, Binsheng, Lang, Jidong, Wang, Bo, Liu, Xiaojun, Lu, Qingqing, He, Jianjun, Gao, Wei, Bing, Pingping, Tian, Geng, Yang, Jialiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7248358/
https://www.ncbi.nlm.nih.gov/pubmed/32509741
http://dx.doi.org/10.3389/fbioe.2020.00394
_version_ 1783538355008438272
author He, Binsheng
Lang, Jidong
Wang, Bo
Liu, Xiaojun
Lu, Qingqing
He, Jianjun
Gao, Wei
Bing, Pingping
Tian, Geng
Yang, Jialiang
author_facet He, Binsheng
Lang, Jidong
Wang, Bo
Liu, Xiaojun
Lu, Qingqing
He, Jianjun
Gao, Wei
Bing, Pingping
Tian, Geng
Yang, Jialiang
author_sort He, Binsheng
collection PubMed
description Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the accumulation of big cancer data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it is now feasible to predict cancer tissue-of-origin by computational tools. Metastatic tumor inherits characteristics from its tissue-of-origin, and both gene expression profile and somatic mutation have tissue specificity. Thus, we developed a computational framework to infer tumor tissue-of-origin by integrating both gene mutation and expression (TOOme). Specifically, we first perform feature selection on both gene expressions and mutations by a random forest method. The selected features are then used to build up a multi-label classification model to infer cancer tissue-of-origin. We adopt a few popular multiple-label classification methods, which are compared by the 10-fold cross validation process. We applied TOOme to the TCGA data containing 7,008 non-metastatic samples across 20 solid tumors. Seventy four genes by gene expression profile and six genes by gene mutation are selected by the random forest process, which can be divided into two categories: (1) cancer type specific genes and (2) those expressed or mutated in several cancers with different levels of expression or mutation rates. Function analysis indicates that the selected genes are significantly enriched in gland development, urogenital system development, hormone metabolic process, thyroid hormone generation prostate hormone generation and so on. According to the multiple-label classification method, random forest performs the best with a 10-fold cross-validation prediction accuracy of 96%. We also use the 19 metastatic samples from TCGA and 256 cancer samples downloaded from GEO as independent testing data, for which TOOme achieves a prediction accuracy of 89%. The cross-validation validation accuracy is better than those using gene expression (i.e., 95%) and gene mutation (53%) alone. In conclusion, TOOme provides a quick yet accurate alternative to traditional medical methods in inferring cancer tissue-of-origin. In addition, the methods combining somatic mutation and gene expressions outperform those using gene expression or mutation alone.
format Online
Article
Text
id pubmed-7248358
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-72483582020-06-05 TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression He, Binsheng Lang, Jidong Wang, Bo Liu, Xiaojun Lu, Qingqing He, Jianjun Gao, Wei Bing, Pingping Tian, Geng Yang, Jialiang Front Bioeng Biotechnol Bioengineering and Biotechnology Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the accumulation of big cancer data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it is now feasible to predict cancer tissue-of-origin by computational tools. Metastatic tumor inherits characteristics from its tissue-of-origin, and both gene expression profile and somatic mutation have tissue specificity. Thus, we developed a computational framework to infer tumor tissue-of-origin by integrating both gene mutation and expression (TOOme). Specifically, we first perform feature selection on both gene expressions and mutations by a random forest method. The selected features are then used to build up a multi-label classification model to infer cancer tissue-of-origin. We adopt a few popular multiple-label classification methods, which are compared by the 10-fold cross validation process. We applied TOOme to the TCGA data containing 7,008 non-metastatic samples across 20 solid tumors. Seventy four genes by gene expression profile and six genes by gene mutation are selected by the random forest process, which can be divided into two categories: (1) cancer type specific genes and (2) those expressed or mutated in several cancers with different levels of expression or mutation rates. Function analysis indicates that the selected genes are significantly enriched in gland development, urogenital system development, hormone metabolic process, thyroid hormone generation prostate hormone generation and so on. According to the multiple-label classification method, random forest performs the best with a 10-fold cross-validation prediction accuracy of 96%. We also use the 19 metastatic samples from TCGA and 256 cancer samples downloaded from GEO as independent testing data, for which TOOme achieves a prediction accuracy of 89%. The cross-validation validation accuracy is better than those using gene expression (i.e., 95%) and gene mutation (53%) alone. In conclusion, TOOme provides a quick yet accurate alternative to traditional medical methods in inferring cancer tissue-of-origin. In addition, the methods combining somatic mutation and gene expressions outperform those using gene expression or mutation alone. Frontiers Media S.A. 2020-05-19 /pmc/articles/PMC7248358/ /pubmed/32509741 http://dx.doi.org/10.3389/fbioe.2020.00394 Text en Copyright © 2020 He, Lang, Wang, Liu, Lu, He, Gao, Bing, Tian and Yang. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Bioengineering and Biotechnology
He, Binsheng
Lang, Jidong
Wang, Bo
Liu, Xiaojun
Lu, Qingqing
He, Jianjun
Gao, Wei
Bing, Pingping
Tian, Geng
Yang, Jialiang
TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression
title TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression
title_full TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression
title_fullStr TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression
title_full_unstemmed TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression
title_short TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression
title_sort toome: a novel computational framework to infer cancer tissue-of-origin by integrating both gene mutation and expression
topic Bioengineering and Biotechnology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7248358/
https://www.ncbi.nlm.nih.gov/pubmed/32509741
http://dx.doi.org/10.3389/fbioe.2020.00394
work_keys_str_mv AT hebinsheng toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT langjidong toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT wangbo toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT liuxiaojun toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT luqingqing toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT hejianjun toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT gaowei toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT bingpingping toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT tiangeng toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression
AT yangjialiang toomeanovelcomputationalframeworktoinfercancertissueoforiginbyintegratingbothgenemutationandexpression