Cargando…
Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7807764/ https://www.ncbi.nlm.nih.gov/pubmed/33441128 http://dx.doi.org/10.1186/s12915-020-00930-0 |
_version_ | 1783636812686688256 |
---|---|
author | Raimondi, Daniele Passemiers, Antoine Fariselli, Piero Moreau, Yves |
author_facet | Raimondi, Daniele Passemiers, Antoine Fariselli, Piero Moreau, Yves |
author_sort | Raimondi, Daniele |
collection | PubMed |
description | BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12915-020-00930-0). |
format | Online Article Text |
id | pubmed-7807764 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-78077642021-01-15 Current cancer driver variant predictors learn to recognize driver genes instead of functional variants Raimondi, Daniele Passemiers, Antoine Fariselli, Piero Moreau, Yves BMC Biol Research Article BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12915-020-00930-0). BioMed Central 2021-01-13 /pmc/articles/PMC7807764/ /pubmed/33441128 http://dx.doi.org/10.1186/s12915-020-00930-0 Text en © The Author(s) 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Raimondi, Daniele Passemiers, Antoine Fariselli, Piero Moreau, Yves Current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
title | Current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
title_full | Current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
title_fullStr | Current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
title_full_unstemmed | Current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
title_short | Current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
title_sort | current cancer driver variant predictors learn to recognize driver genes instead of functional variants |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7807764/ https://www.ncbi.nlm.nih.gov/pubmed/33441128 http://dx.doi.org/10.1186/s12915-020-00930-0 |
work_keys_str_mv | AT raimondidaniele currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants AT passemiersantoine currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants AT farisellipiero currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants AT moreauyves currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants |