Cargando…

Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations

BACKGROUND: Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool...

Descripción completa

Detalles Bibliográficos
Autores principales: Zelli, Veronica, Manno, Andrea, Compagnoni, Chiara, Ibraheem, Rasheed Oyewole, Zazzeroni, Francesca, Alesse, Edoardo, Rossi, Fabrizio, Arbib, Claudio, Tessitore, Alessandra
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664515/
https://www.ncbi.nlm.nih.gov/pubmed/37990214
http://dx.doi.org/10.1186/s12967-023-04720-4
_version_ 1785148749218054144
author Zelli, Veronica
Manno, Andrea
Compagnoni, Chiara
Ibraheem, Rasheed Oyewole
Zazzeroni, Francesca
Alesse, Edoardo
Rossi, Fabrizio
Arbib, Claudio
Tessitore, Alessandra
author_facet Zelli, Veronica
Manno, Andrea
Compagnoni, Chiara
Ibraheem, Rasheed Oyewole
Zazzeroni, Francesca
Alesse, Edoardo
Rossi, Fabrizio
Arbib, Claudio
Tessitore, Alessandra
author_sort Zelli, Veronica
collection PubMed
description BACKGROUND: Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin. METHODS: TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores. RESULTS: The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches. CONCLUSIONS: A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier’s performance, for example by considering more features and dividing tumors into their main molecular subtypes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12967-023-04720-4.
format Online
Article
Text
id pubmed-10664515
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-106645152023-11-21 Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations Zelli, Veronica Manno, Andrea Compagnoni, Chiara Ibraheem, Rasheed Oyewole Zazzeroni, Francesca Alesse, Edoardo Rossi, Fabrizio Arbib, Claudio Tessitore, Alessandra J Transl Med Research BACKGROUND: Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin. METHODS: TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores. RESULTS: The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches. CONCLUSIONS: A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier’s performance, for example by considering more features and dividing tumors into their main molecular subtypes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12967-023-04720-4. BioMed Central 2023-11-21 /pmc/articles/PMC10664515/ /pubmed/37990214 http://dx.doi.org/10.1186/s12967-023-04720-4 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Zelli, Veronica
Manno, Andrea
Compagnoni, Chiara
Ibraheem, Rasheed Oyewole
Zazzeroni, Francesca
Alesse, Edoardo
Rossi, Fabrizio
Arbib, Claudio
Tessitore, Alessandra
Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations
title Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations
title_full Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations
title_fullStr Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations
title_full_unstemmed Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations
title_short Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations
title_sort classification of tumor types using xgboost machine learning model: a vector space transformation of genomic alterations
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664515/
https://www.ncbi.nlm.nih.gov/pubmed/37990214
http://dx.doi.org/10.1186/s12967-023-04720-4
work_keys_str_mv AT zelliveronica classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT mannoandrea classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT compagnonichiara classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT ibraheemrasheedoyewole classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT zazzeronifrancesca classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT alesseedoardo classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT rossifabrizio classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT arbibclaudio classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations
AT tessitorealessandra classificationoftumortypesusingxgboostmachinelearningmodelavectorspacetransformationofgenomicalterations