Cargando…

Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning

Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient’s normal tissue sample is sequ...

Descripción completa

Detalles Bibliográficos
Autores principales:	McLaughlin, R. Tyler, Asthana, Maansi, Di Meo, Marc, Ceccarelli, Michele, Jacob, Howard J., Masica, David L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825621/ https://www.ncbi.nlm.nih.gov/pubmed/36611079 http://dx.doi.org/10.1038/s41698-022-00340-1

_version_	1784866673195483136
author	McLaughlin, R. Tyler Asthana, Maansi Di Meo, Marc Ceccarelli, Michele Jacob, Howard J. Masica, David L.
author_facet	McLaughlin, R. Tyler Asthana, Maansi Di Meo, Marc Ceccarelli, Michele Jacob, Howard J. Masica, David L.
author_sort	McLaughlin, R. Tyler
collection	PubMed
description	Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient’s normal tissue sample is sequenced), accurately distinguishing somatic mutations from germline variants is a challenging problem that, when unaddressed, results in unreliable, biased, and inflated TMB estimates. Here, we apply machine learning to the task of somatic vs germline classification in tumor-only solid tumor samples using TabNet, XGBoost, and LightGBM, three machine-learning models for tabular data. We constructed a training set for supervised classification using features derived exclusively from tumor-only variant calling and drawing somatic and germline truth labels from an independent pipeline using the patient-matched normal samples. All three trained models achieved state-of-the-art performance on two holdout test datasets: a TCGA dataset including sarcoma, breast adenocarcinoma, and endometrial carcinoma samples (AUC > 94%), and a metastatic melanoma dataset (AUC > 85%). Concordance between matched-normal and tumor-only TMB improves from R(2) = 0.006 to 0.71–0.76 with the addition of a machine-learning classifier, with LightGBM performing best. Notably, these machine-learning models generalize across cancer subtypes and capture kits with a call rate of 100%. We reproduce the recent finding that tumor-only TMB estimates for Black patients are extremely inflated relative to that of white patients due to the racial biases of germline databases. We show that our approach with XGBoost and LightGBM eliminates this significant racial bias in tumor-only variant calling.
format	Online Article Text
id	pubmed-9825621
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-98256212023-01-09 Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning McLaughlin, R. Tyler Asthana, Maansi Di Meo, Marc Ceccarelli, Michele Jacob, Howard J. Masica, David L. NPJ Precis Oncol Article Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient’s normal tissue sample is sequenced), accurately distinguishing somatic mutations from germline variants is a challenging problem that, when unaddressed, results in unreliable, biased, and inflated TMB estimates. Here, we apply machine learning to the task of somatic vs germline classification in tumor-only solid tumor samples using TabNet, XGBoost, and LightGBM, three machine-learning models for tabular data. We constructed a training set for supervised classification using features derived exclusively from tumor-only variant calling and drawing somatic and germline truth labels from an independent pipeline using the patient-matched normal samples. All three trained models achieved state-of-the-art performance on two holdout test datasets: a TCGA dataset including sarcoma, breast adenocarcinoma, and endometrial carcinoma samples (AUC > 94%), and a metastatic melanoma dataset (AUC > 85%). Concordance between matched-normal and tumor-only TMB improves from R(2) = 0.006 to 0.71–0.76 with the addition of a machine-learning classifier, with LightGBM performing best. Notably, these machine-learning models generalize across cancer subtypes and capture kits with a call rate of 100%. We reproduce the recent finding that tumor-only TMB estimates for Black patients are extremely inflated relative to that of white patients due to the racial biases of germline databases. We show that our approach with XGBoost and LightGBM eliminates this significant racial bias in tumor-only variant calling. Nature Publishing Group UK 2023-01-07 /pmc/articles/PMC9825621/ /pubmed/36611079 http://dx.doi.org/10.1038/s41698-022-00340-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article McLaughlin, R. Tyler Asthana, Maansi Di Meo, Marc Ceccarelli, Michele Jacob, Howard J. Masica, David L. Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title	Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_full	Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_fullStr	Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_full_unstemmed	Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_short	Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
title_sort	fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825621/ https://www.ncbi.nlm.nih.gov/pubmed/36611079 http://dx.doi.org/10.1038/s41698-022-00340-1
work_keys_str_mv	AT mclaughlinrtyler fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning AT asthanamaansi fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning AT dimeomarc fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning AT ceccarellimichele fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning AT jacobhowardj fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning AT masicadavidl fastaccurateandraciallyunbiasedpancancertumoronlyvariantcallingwithtabularmachinelearning

Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning

Ejemplares similares