Cargando…
A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data
Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9071249/ https://www.ncbi.nlm.nih.gov/pubmed/35530331 http://dx.doi.org/10.3389/fonc.2022.832567 |
_version_ | 1784700811938365440 |
---|---|
author | Lu, Qingfeng Chen, Fengxia Li, Qianyue Chen, Lihong Tong, Ling Tian, Geng Zhou, Xiaohong |
author_facet | Lu, Qingfeng Chen, Fengxia Li, Qianyue Chen, Lihong Tong, Ling Tian, Geng Zhou, Xiaohong |
author_sort | Lu, Qingfeng |
collection | PubMed |
description | Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage. |
format | Online Article Text |
id | pubmed-9071249 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-90712492022-05-06 A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data Lu, Qingfeng Chen, Fengxia Li, Qianyue Chen, Lihong Tong, Ling Tian, Geng Zhou, Xiaohong Front Oncol Oncology Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage. Frontiers Media S.A. 2022-04-21 /pmc/articles/PMC9071249/ /pubmed/35530331 http://dx.doi.org/10.3389/fonc.2022.832567 Text en Copyright © 2022 Lu, Chen, Li, Chen, Tong, Tian and Zhou https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Oncology Lu, Qingfeng Chen, Fengxia Li, Qianyue Chen, Lihong Tong, Ling Tian, Geng Zhou, Xiaohong A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data |
title | A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data |
title_full | A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data |
title_fullStr | A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data |
title_full_unstemmed | A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data |
title_short | A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data |
title_sort | machine learning method to trace cancer primary lesion using microarray-based gene expression data |
topic | Oncology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9071249/ https://www.ncbi.nlm.nih.gov/pubmed/35530331 http://dx.doi.org/10.3389/fonc.2022.832567 |
work_keys_str_mv | AT luqingfeng amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT chenfengxia amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT liqianyue amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT chenlihong amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT tongling amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT tiangeng amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT zhouxiaohong amachinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT luqingfeng machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT chenfengxia machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT liqianyue machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT chenlihong machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT tongling machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT tiangeng machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata AT zhouxiaohong machinelearningmethodtotracecancerprimarylesionusingmicroarraybasedgeneexpressiondata |