Cargando…

Current cancer driver variant predictors learn to recognize driver genes instead of functional variants

BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics...

Descripción completa

Detalles Bibliográficos
Autores principales: Raimondi, Daniele, Passemiers, Antoine, Fariselli, Piero, Moreau, Yves
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7807764/
https://www.ncbi.nlm.nih.gov/pubmed/33441128
http://dx.doi.org/10.1186/s12915-020-00930-0
_version_ 1783636812686688256
author Raimondi, Daniele
Passemiers, Antoine
Fariselli, Piero
Moreau, Yves
author_facet Raimondi, Daniele
Passemiers, Antoine
Fariselli, Piero
Moreau, Yves
author_sort Raimondi, Daniele
collection PubMed
description BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12915-020-00930-0).
format Online
Article
Text
id pubmed-7807764
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-78077642021-01-15 Current cancer driver variant predictors learn to recognize driver genes instead of functional variants Raimondi, Daniele Passemiers, Antoine Fariselli, Piero Moreau, Yves BMC Biol Research Article BACKGROUND: Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. RESULTS: In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. CONCLUSIONS: To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12915-020-00930-0). BioMed Central 2021-01-13 /pmc/articles/PMC7807764/ /pubmed/33441128 http://dx.doi.org/10.1186/s12915-020-00930-0 Text en © The Author(s) 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Raimondi, Daniele
Passemiers, Antoine
Fariselli, Piero
Moreau, Yves
Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
title Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
title_full Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
title_fullStr Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
title_full_unstemmed Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
title_short Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
title_sort current cancer driver variant predictors learn to recognize driver genes instead of functional variants
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7807764/
https://www.ncbi.nlm.nih.gov/pubmed/33441128
http://dx.doi.org/10.1186/s12915-020-00930-0
work_keys_str_mv AT raimondidaniele currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants
AT passemiersantoine currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants
AT farisellipiero currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants
AT moreauyves currentcancerdrivervariantpredictorslearntorecognizedrivergenesinsteadoffunctionalvariants