Cargando…

Statistical tests and identifiability conditions for pooling and analyzing multisite datasets

When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Hao Henry, Singh, Vikas, Johnson, Sterling C., Wahba, Grace
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5816202/
https://www.ncbi.nlm.nih.gov/pubmed/29386387
http://dx.doi.org/10.1073/pnas.1719747115
_version_ 1783300630265200640
author Zhou, Hao Henry
Singh, Vikas
Johnson, Sterling C.
Wahba, Grace
author_facet Zhou, Hao Henry
Singh, Vikas
Johnson, Sterling C.
Wahba, Grace
author_sort Zhou, Hao Henry
collection PubMed
description When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer’s disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies.
format Online
Article
Text
id pubmed-5816202
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-58162022018-02-21 Statistical tests and identifiability conditions for pooling and analyzing multisite datasets Zhou, Hao Henry Singh, Vikas Johnson, Sterling C. Wahba, Grace Proc Natl Acad Sci U S A Physical Sciences When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer’s disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. National Academy of Sciences 2018-02-13 2018-01-31 /pmc/articles/PMC5816202/ /pubmed/29386387 http://dx.doi.org/10.1073/pnas.1719747115 Text en Copyright © 2018 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Physical Sciences
Zhou, Hao Henry
Singh, Vikas
Johnson, Sterling C.
Wahba, Grace
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
title Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
title_full Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
title_fullStr Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
title_full_unstemmed Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
title_short Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
title_sort statistical tests and identifiability conditions for pooling and analyzing multisite datasets
topic Physical Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5816202/
https://www.ncbi.nlm.nih.gov/pubmed/29386387
http://dx.doi.org/10.1073/pnas.1719747115
work_keys_str_mv AT zhouhaohenry statisticaltestsandidentifiabilityconditionsforpoolingandanalyzingmultisitedatasets
AT singhvikas statisticaltestsandidentifiabilityconditionsforpoolingandanalyzingmultisitedatasets
AT johnsonsterlingc statisticaltestsandidentifiabilityconditionsforpoolingandanalyzingmultisitedatasets
AT wahbagrace statisticaltestsandidentifiabilityconditionsforpoolingandanalyzingmultisitedatasets
AT statisticaltestsandidentifiabilityconditionsforpoolingandanalyzingmultisitedatasets