Cargando…

A comparative study of forest methods for time-to-event data: variable selection and predictive performance

BACKGROUND: As a hot method in machine learning field, the forests approach is an attractive alternative approach to Cox model. Random survival forests (RSF) methodology is the most popular survival forests method, whereas its drawbacks exist such as a selection bias towards covariates with many pos...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Yingxin, Zhou, Shiyu, Wei, Hongxia, An, Shengli
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8465777/ https://www.ncbi.nlm.nih.gov/pubmed/34563138 http://dx.doi.org/10.1186/s12874-021-01386-8

_version_	1784572962855190528
author	Liu, Yingxin Zhou, Shiyu Wei, Hongxia An, Shengli
author_facet	Liu, Yingxin Zhou, Shiyu Wei, Hongxia An, Shengli
author_sort	Liu, Yingxin
collection	PubMed
description	BACKGROUND: As a hot method in machine learning field, the forests approach is an attractive alternative approach to Cox model. Random survival forests (RSF) methodology is the most popular survival forests method, whereas its drawbacks exist such as a selection bias towards covariates with many possible split points. Conditional inference forests (CIF) methodology is known to reduce the selection bias via a two-step split procedure implementing hypothesis tests as it separates the variable selection and splitting, but its computation costs too much time. Random forests with maximally selected rank statistics (MSR-RF) methodology proposed recently seems to be a great improvement on RSF and CIF. METHODS: In this paper we used simulation study and real data application to compare prediction performances and variable selection performances among three survival forests methods, including RSF, CIF and MSR-RF. To evaluate the performance of variable selection, we combined all simulations to calculate the frequency of ranking top of the variable importance measures of the correct variables, where higher frequency means better selection ability. We used Integrated Brier Score (IBS) and c-index to measure the prediction accuracy of all three methods. The smaller IBS value, the greater the prediction. RESULTS: Simulations show that three forests methods differ slightly in prediction performance. MSR-RF and RSF might perform better than CIF when there are only continuous or binary variables in the datasets. For variable selection performance, When there are multiple categorical variables in the datasets, the selection frequency of RSF seems to be lowest in most cases. MSR-RF and CIF have higher selection rates, and CIF perform well especially with the interaction term. The fact that correlation degree of the variables has little effect on the selection frequency indicates that three forest methods can handle data with correlation. When there are only continuous variables in the datasets, MSR-RF perform better. When there are only binary variables in the datasets, RSF and MSR-RF have more advantages than CIF. When the variable dimension increases, MSR-RF and RSF seem to be more robustthan CIF CONCLUSIONS: All three methods show advantages in prediction performances and variable selection performances under different situations. The recent proposed methodology MSR-RF possess practical value and is well worth popularizing. It is important to identify the appropriate method in real use according to the research aim and the nature of covariates. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01386-8.
format	Online Article Text
id	pubmed-8465777
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-84657772021-09-27 A comparative study of forest methods for time-to-event data: variable selection and predictive performance Liu, Yingxin Zhou, Shiyu Wei, Hongxia An, Shengli BMC Med Res Methodol Research BACKGROUND: As a hot method in machine learning field, the forests approach is an attractive alternative approach to Cox model. Random survival forests (RSF) methodology is the most popular survival forests method, whereas its drawbacks exist such as a selection bias towards covariates with many possible split points. Conditional inference forests (CIF) methodology is known to reduce the selection bias via a two-step split procedure implementing hypothesis tests as it separates the variable selection and splitting, but its computation costs too much time. Random forests with maximally selected rank statistics (MSR-RF) methodology proposed recently seems to be a great improvement on RSF and CIF. METHODS: In this paper we used simulation study and real data application to compare prediction performances and variable selection performances among three survival forests methods, including RSF, CIF and MSR-RF. To evaluate the performance of variable selection, we combined all simulations to calculate the frequency of ranking top of the variable importance measures of the correct variables, where higher frequency means better selection ability. We used Integrated Brier Score (IBS) and c-index to measure the prediction accuracy of all three methods. The smaller IBS value, the greater the prediction. RESULTS: Simulations show that three forests methods differ slightly in prediction performance. MSR-RF and RSF might perform better than CIF when there are only continuous or binary variables in the datasets. For variable selection performance, When there are multiple categorical variables in the datasets, the selection frequency of RSF seems to be lowest in most cases. MSR-RF and CIF have higher selection rates, and CIF perform well especially with the interaction term. The fact that correlation degree of the variables has little effect on the selection frequency indicates that three forest methods can handle data with correlation. When there are only continuous variables in the datasets, MSR-RF perform better. When there are only binary variables in the datasets, RSF and MSR-RF have more advantages than CIF. When the variable dimension increases, MSR-RF and RSF seem to be more robustthan CIF CONCLUSIONS: All three methods show advantages in prediction performances and variable selection performances under different situations. The recent proposed methodology MSR-RF possess practical value and is well worth popularizing. It is important to identify the appropriate method in real use according to the research aim and the nature of covariates. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01386-8. BioMed Central 2021-09-25 /pmc/articles/PMC8465777/ /pubmed/34563138 http://dx.doi.org/10.1186/s12874-021-01386-8 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Liu, Yingxin Zhou, Shiyu Wei, Hongxia An, Shengli A comparative study of forest methods for time-to-event data: variable selection and predictive performance
title	A comparative study of forest methods for time-to-event data: variable selection and predictive performance
title_full	A comparative study of forest methods for time-to-event data: variable selection and predictive performance
title_fullStr	A comparative study of forest methods for time-to-event data: variable selection and predictive performance
title_full_unstemmed	A comparative study of forest methods for time-to-event data: variable selection and predictive performance
title_short	A comparative study of forest methods for time-to-event data: variable selection and predictive performance
title_sort	comparative study of forest methods for time-to-event data: variable selection and predictive performance
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8465777/ https://www.ncbi.nlm.nih.gov/pubmed/34563138 http://dx.doi.org/10.1186/s12874-021-01386-8
work_keys_str_mv	AT liuyingxin acomparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT zhoushiyu acomparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT weihongxia acomparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT anshengli acomparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT liuyingxin comparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT zhoushiyu comparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT weihongxia comparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance AT anshengli comparativestudyofforestmethodsfortimetoeventdatavariableselectionandpredictiveperformance

A comparative study of forest methods for time-to-event data: variable selection and predictive performance

Ejemplares similares