Cargando…

A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data

Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum...

Descripción completa

Detalles Bibliográficos
Autores principales: Lu, Qingfeng, Chen, Fengxia, Li, Qianyue, Chen, Lihong, Tong, Ling, Tian, Geng, Zhou, Xiaohong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9071249/
https://www.ncbi.nlm.nih.gov/pubmed/35530331
http://dx.doi.org/10.3389/fonc.2022.832567
_version_ 1784700811938365440
author Lu, Qingfeng
Chen, Fengxia
Li, Qianyue
Chen, Lihong
Tong, Ling
Tian, Geng
Zhou, Xiaohong
author_facet Lu, Qingfeng
Chen, Fengxia
Li, Qianyue
Chen, Lihong
Tong, Ling
Tian, Geng
Zhou, Xiaohong
author_sort Lu, Qingfeng
collection PubMed
description Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage.
format Online
Article
Text
id pubmed-9071249
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-90712492022-05-06 A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data Lu, Qingfeng Chen, Fengxia Li, Qianyue Chen, Lihong Tong, Ling Tian, Geng Zhou, Xiaohong Front Oncol Oncology Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage. Frontiers Media S.A. 2022-04-21 /pmc/articles/PMC9071249/ /pubmed/35530331 http://dx.doi.org/10.3389/fonc.2022.832567 Text en Copyright © 2022 Lu, Chen, Li, Chen, Tong, Tian and Zhou https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Oncology
Lu, Qingfeng
Chen, Fengxia
Li, Qianyue
Chen, Lihong
Tong, Ling
Tian, Geng
Zhou, Xiaohong
A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
title A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
title_full A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
title_fullStr A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
title_full_unstemmed A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
title_short A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
title_sort machine learning method to trace cancer primary lesion using microarray-based gene expression data
topic Oncology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9071249/
https://www.ncbi.nlm.nih.gov/pubmed/35530331
http://dx.doi.org/10.3389/fonc.2022.832567
work_keys_str_mv AT luqingfeng amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT chenfengxia amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT liqianyue amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT chenlihong amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT tongling amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT tiangeng amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT zhouxiaohong amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT luqingfeng machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT chenfengxia machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT liqianyue machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT chenlihong machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT tongling machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT tiangeng machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata
AT zhouxiaohong machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata