Cargando…

Simulation of complex data structures for planning of studies with focus on biomarker comparison

BACKGROUND: There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the...

Descripción completa

Detalles Bibliográficos
Autores principales: Schulz, Andreas, Zöller, Daniela, Nickels, Stefan, Beutel, Manfred E., Blettner, Maria, Wild, Philipp S., Binder, Harald
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5470184/
https://www.ncbi.nlm.nih.gov/pubmed/28610631
http://dx.doi.org/10.1186/s12874-017-0364-y
_version_ 1783243726238253056
author Schulz, Andreas
Zöller, Daniela
Nickels, Stefan
Beutel, Manfred E.
Blettner, Maria
Wild, Philipp S.
Binder, Harald
author_facet Schulz, Andreas
Zöller, Daniela
Nickels, Stefan
Beutel, Manfred E.
Blettner, Maria
Wild, Philipp S.
Binder, Harald
author_sort Schulz, Andreas
collection PubMed
description BACKGROUND: There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure. METHODS: In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated. RESULTS: We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach. CONCLUSIONS: We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.
format Online
Article
Text
id pubmed-5470184
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54701842017-06-19 Simulation of complex data structures for planning of studies with focus on biomarker comparison Schulz, Andreas Zöller, Daniela Nickels, Stefan Beutel, Manfred E. Blettner, Maria Wild, Philipp S. Binder, Harald BMC Med Res Methodol Research Article BACKGROUND: There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure. METHODS: In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated. RESULTS: We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach. CONCLUSIONS: We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies. BioMed Central 2017-06-13 /pmc/articles/PMC5470184/ /pubmed/28610631 http://dx.doi.org/10.1186/s12874-017-0364-y Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Schulz, Andreas
Zöller, Daniela
Nickels, Stefan
Beutel, Manfred E.
Blettner, Maria
Wild, Philipp S.
Binder, Harald
Simulation of complex data structures for planning of studies with focus on biomarker comparison
title Simulation of complex data structures for planning of studies with focus on biomarker comparison
title_full Simulation of complex data structures for planning of studies with focus on biomarker comparison
title_fullStr Simulation of complex data structures for planning of studies with focus on biomarker comparison
title_full_unstemmed Simulation of complex data structures for planning of studies with focus on biomarker comparison
title_short Simulation of complex data structures for planning of studies with focus on biomarker comparison
title_sort simulation of complex data structures for planning of studies with focus on biomarker comparison
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5470184/
https://www.ncbi.nlm.nih.gov/pubmed/28610631
http://dx.doi.org/10.1186/s12874-017-0364-y
work_keys_str_mv AT schulzandreas simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison
AT zollerdaniela simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison
AT nickelsstefan simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison
AT beutelmanfrede simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison
AT blettnermaria simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison
AT wildphilipps simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison
AT binderharald simulationofcomplexdatastructuresforplanningofstudieswithfocusonbiomarkercomparison