Cargando…

Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings

Over the last decades, molecular signatures have become increasingly important in oncology and are opening up a new area of personalized medicine. Nevertheless, biological relevance and statistical tools necessary for the development of these signatures have been called into question in the literatu...

Descripción completa

Detalles Bibliográficos
Autores principales: Gilhodes, Julia, Dalenc, Florence, Gal, Jocelyn, Zemmour, Christophe, Leconte, Eve, Boher, Jean-Marie, Filleron, Thomas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7350178/
https://www.ncbi.nlm.nih.gov/pubmed/32670394
http://dx.doi.org/10.1155/2020/6795392
_version_ 1783557212253192192
author Gilhodes, Julia
Dalenc, Florence
Gal, Jocelyn
Zemmour, Christophe
Leconte, Eve
Boher, Jean-Marie
Filleron, Thomas
author_facet Gilhodes, Julia
Dalenc, Florence
Gal, Jocelyn
Zemmour, Christophe
Leconte, Eve
Boher, Jean-Marie
Filleron, Thomas
author_sort Gilhodes, Julia
collection PubMed
description Over the last decades, molecular signatures have become increasingly important in oncology and are opening up a new area of personalized medicine. Nevertheless, biological relevance and statistical tools necessary for the development of these signatures have been called into question in the literature. Here, we investigate six typical selection methods for high-dimensional settings and survival endpoints, including LASSO and some of its extensions, component-wise boosting, and random survival forests (RSF). A resampling algorithm based on data splitting was used on nine high-dimensional simulated datasets to assess selection stability on training sets and the intersection between selection methods. Prognostic performances were evaluated on respective validation sets. Finally, one application on a real breast cancer dataset has been proposed. The false discovery rate (FDR) was high for each selection method, and the intersection between lists of predictors was very poor. RSF selects many more variables than the other methods and thus becomes less efficient on validation sets. Due to the complex correlation structure in genomic data, stability in the selection procedure is generally poor for selected predictors, but can be improved with a higher training sample size. In a very high-dimensional setting, we recommend the LASSO-pcvl method since it outperforms other methods by reducing the number of selected genes and minimizing FDR in most scenarios. Nevertheless, this method still gives a high rate of false positives. Further work is thus necessary to propose new methods to overcome this issue where numerous predictors are present. Pluridisciplinary discussion between clinicians and statisticians is necessary to ensure both statistical and biological relevance of the predictors included in molecular signatures.
format Online
Article
Text
id pubmed-7350178
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-73501782020-07-14 Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings Gilhodes, Julia Dalenc, Florence Gal, Jocelyn Zemmour, Christophe Leconte, Eve Boher, Jean-Marie Filleron, Thomas Comput Math Methods Med Research Article Over the last decades, molecular signatures have become increasingly important in oncology and are opening up a new area of personalized medicine. Nevertheless, biological relevance and statistical tools necessary for the development of these signatures have been called into question in the literature. Here, we investigate six typical selection methods for high-dimensional settings and survival endpoints, including LASSO and some of its extensions, component-wise boosting, and random survival forests (RSF). A resampling algorithm based on data splitting was used on nine high-dimensional simulated datasets to assess selection stability on training sets and the intersection between selection methods. Prognostic performances were evaluated on respective validation sets. Finally, one application on a real breast cancer dataset has been proposed. The false discovery rate (FDR) was high for each selection method, and the intersection between lists of predictors was very poor. RSF selects many more variables than the other methods and thus becomes less efficient on validation sets. Due to the complex correlation structure in genomic data, stability in the selection procedure is generally poor for selected predictors, but can be improved with a higher training sample size. In a very high-dimensional setting, we recommend the LASSO-pcvl method since it outperforms other methods by reducing the number of selected genes and minimizing FDR in most scenarios. Nevertheless, this method still gives a high rate of false positives. Further work is thus necessary to propose new methods to overcome this issue where numerous predictors are present. Pluridisciplinary discussion between clinicians and statisticians is necessary to ensure both statistical and biological relevance of the predictors included in molecular signatures. Hindawi 2020-07-01 /pmc/articles/PMC7350178/ /pubmed/32670394 http://dx.doi.org/10.1155/2020/6795392 Text en Copyright © 2020 Julia Gilhodes et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Gilhodes, Julia
Dalenc, Florence
Gal, Jocelyn
Zemmour, Christophe
Leconte, Eve
Boher, Jean-Marie
Filleron, Thomas
Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings
title Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings
title_full Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings
title_fullStr Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings
title_full_unstemmed Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings
title_short Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings
title_sort comparison of variable selection methods for time-to-event data in high-dimensional settings
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7350178/
https://www.ncbi.nlm.nih.gov/pubmed/32670394
http://dx.doi.org/10.1155/2020/6795392
work_keys_str_mv AT gilhodesjulia comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings
AT dalencflorence comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings
AT galjocelyn comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings
AT zemmourchristophe comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings
AT leconteeve comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings
AT boherjeanmarie comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings
AT filleronthomas comparisonofvariableselectionmethodsfortimetoeventdatainhighdimensionalsettings