Cargando…

Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes

BACKGROUND: Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pi, Lira, Halabi, Susan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6214199/ https://www.ncbi.nlm.nih.gov/pubmed/30393771 http://dx.doi.org/10.1186/s41512-018-0043-4

_version_	1783367940789239808
author	Pi, Lira Halabi, Susan
author_facet	Pi, Lira Halabi, Susan
author_sort	Pi, Lira
collection	PubMed
description	BACKGROUND: Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS), and principled SIS (PSIS), have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables. METHODS: We conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO), and random survival forest (RSF), in ultra-high dimensional setting data for the purpose of developing prognostic models for a time-to-event outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell’s concordance statistic), calibration, and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer. RESULTS: When n = 300, ISIS-LASSO and ISIS-ALASSO chose all the informative variables which resulted in the highest Harrell’s c-index (> 0.80). On the other hand, with a small sample size (n = 150), ALASSO performed better than any other combinations as demonstrated by the highest c-index and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest discrimination with c-index of 0.67. CONCLUSIONS: Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.
format	Online Article Text
id	pubmed-6214199
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-62141992018-11-02 Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes Pi, Lira Halabi, Susan Diagn Progn Res Methodology BACKGROUND: Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS), and principled SIS (PSIS), have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables. METHODS: We conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO), and random survival forest (RSF), in ultra-high dimensional setting data for the purpose of developing prognostic models for a time-to-event outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell’s concordance statistic), calibration, and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer. RESULTS: When n = 300, ISIS-LASSO and ISIS-ALASSO chose all the informative variables which resulted in the highest Harrell’s c-index (> 0.80). On the other hand, with a small sample size (n = 150), ALASSO performed better than any other combinations as demonstrated by the highest c-index and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest discrimination with c-index of 0.67. CONCLUSIONS: Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes. BioMed Central 2018-09-26 /pmc/articles/PMC6214199/ /pubmed/30393771 http://dx.doi.org/10.1186/s41512-018-0043-4 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Pi, Lira Halabi, Susan Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
title	Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
title_full	Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
title_fullStr	Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
title_full_unstemmed	Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
title_short	Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
title_sort	combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6214199/ https://www.ncbi.nlm.nih.gov/pubmed/30393771 http://dx.doi.org/10.1186/s41512-018-0043-4
work_keys_str_mv	AT pilira combinedperformanceofscreeningandvariableselectionmethodsinultrahighdimensionaldatainpredictingtimetoeventoutcomes AT halabisusan combinedperformanceofscreeningandvariableselectionmethodsinultrahighdimensionaldatainpredictingtimetoeventoutcomes

Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes

Ejemplares similares