Cargando…

A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations

The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Yulin, Feng, Tong, Wang, Shudong, Dong, Ruyi, Yang, Jialiang, Su, Jionglong, Wang, Bo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7716814/
https://www.ncbi.nlm.nih.gov/pubmed/33329723
http://dx.doi.org/10.3389/fgene.2020.585029
_version_ 1783619236296392704
author Zhang, Yulin
Feng, Tong
Wang, Shudong
Dong, Ruyi
Yang, Jialiang
Su, Jionglong
Wang, Bo
author_facet Zhang, Yulin
Feng, Tong
Wang, Shudong
Dong, Ruyi
Yang, Jialiang
Su, Jionglong
Wang, Bo
author_sort Zhang, Yulin
collection PubMed
description The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer.
format Online
Article
Text
id pubmed-7716814
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-77168142020-12-15 A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations Zhang, Yulin Feng, Tong Wang, Shudong Dong, Ruyi Yang, Jialiang Su, Jionglong Wang, Bo Front Genet Genetics The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer. Frontiers Media S.A. 2020-11-20 /pmc/articles/PMC7716814/ /pubmed/33329723 http://dx.doi.org/10.3389/fgene.2020.585029 Text en Copyright © 2020 Zhang, Feng, Wang, Dong, Yang, Su and Wang. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Zhang, Yulin
Feng, Tong
Wang, Shudong
Dong, Ruyi
Yang, Jialiang
Su, Jionglong
Wang, Bo
A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_full A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_fullStr A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_full_unstemmed A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_short A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_sort novel xgboost method to identify cancer tissue-of-origin based on copy number variations
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7716814/
https://www.ncbi.nlm.nih.gov/pubmed/33329723
http://dx.doi.org/10.3389/fgene.2020.585029
work_keys_str_mv AT zhangyulin anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT fengtong anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT wangshudong anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT dongruyi anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT yangjialiang anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT sujionglong anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT wangbo anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT zhangyulin novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT fengtong novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT wangshudong novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT dongruyi novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT yangjialiang novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT sujionglong novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT wangbo novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations