Cargando…

Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features

Informatics researchers often need to combine data from many different sources to increase statistical power and study subtle or complicated effects. Perfect overlap of measurements across academic studies is rare since virtually every dataset is collected for a unique purpose and without coordinati...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bartlett, Christopher W., Klamer, Brett G., Buyske, Steven, Petrill, Stephen A., Ray, William C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6771148/ https://www.ncbi.nlm.nih.gov/pubmed/31546899 http://dx.doi.org/10.3390/genes10090727

_version_	1783455636008206336
author	Bartlett, Christopher W. Klamer, Brett G. Buyske, Steven Petrill, Stephen A. Ray, William C.
author_facet	Bartlett, Christopher W. Klamer, Brett G. Buyske, Steven Petrill, Stephen A. Ray, William C.
author_sort	Bartlett, Christopher W.
collection	PubMed
description	Informatics researchers often need to combine data from many different sources to increase statistical power and study subtle or complicated effects. Perfect overlap of measurements across academic studies is rare since virtually every dataset is collected for a unique purpose and without coordination across parties not-at-hand (i.e., informatics researchers in the future). Thus, incomplete concordance of measurements across datasets poses a major challenge for researchers seeking to combine public databases. In any given field, some measurements are fairly standard, but every organization collecting data makes unique decisions on instruments, protocols, and methods of processing the data. This typically denies literal concatenation of the raw data since constituent cohorts do not have the same measurements (i.e., columns of data). When measurements across datasets are similar prima facie, there is a desire to combine the data to increase power, but mixing non-identical measurements could greatly reduce the sensitivity of the downstream analysis. Here, we discuss a statistical method that is applicable when certain patterns of missing data are found; namely, it is possible to combine datasets that measure the same underlying constructs (or latent traits) when there is only partial overlap of measurements across the constituent datasets. Our method, ROSETTA empirically derives a set of common latent trait metrics for each related measurement domain using a novel variation of factor analysis to ensure equivalence across the constituent datasets. The advantage of combining datasets this way is the simplicity, statistical power, and modeling flexibility of a single joint analysis of all the data. Three simulation studies show the performance of ROSETTA on datasets with only partially overlapping measurements (i.e., systematically missing information), benchmarked to a condition of perfectly overlapped data (i.e., full information). The first study examined a range of correlations, while the second study was modeled after the observed correlations in a well-characterized clinical, behavioral cohort. Both studies consistently show significant correlations >0.94, often >0.96, indicating the robustness of the method and validating the general approach. The third study varied within and between domain correlations and compared ROSETTA to multiple imputation and meta-analysis as two commonly used methods that ostensibly solve the same data integration problem. We provide one alternative to meta-analysis and multiple imputation by developing a method that statistically equates similar but distinct manifest metrics into a set of empirically derived metrics that can be used for analysis across all datasets.
format	Online Article Text
id	pubmed-6771148
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-67711482019-10-30 Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features Bartlett, Christopher W. Klamer, Brett G. Buyske, Steven Petrill, Stephen A. Ray, William C. Genes (Basel) Article Informatics researchers often need to combine data from many different sources to increase statistical power and study subtle or complicated effects. Perfect overlap of measurements across academic studies is rare since virtually every dataset is collected for a unique purpose and without coordination across parties not-at-hand (i.e., informatics researchers in the future). Thus, incomplete concordance of measurements across datasets poses a major challenge for researchers seeking to combine public databases. In any given field, some measurements are fairly standard, but every organization collecting data makes unique decisions on instruments, protocols, and methods of processing the data. This typically denies literal concatenation of the raw data since constituent cohorts do not have the same measurements (i.e., columns of data). When measurements across datasets are similar prima facie, there is a desire to combine the data to increase power, but mixing non-identical measurements could greatly reduce the sensitivity of the downstream analysis. Here, we discuss a statistical method that is applicable when certain patterns of missing data are found; namely, it is possible to combine datasets that measure the same underlying constructs (or latent traits) when there is only partial overlap of measurements across the constituent datasets. Our method, ROSETTA empirically derives a set of common latent trait metrics for each related measurement domain using a novel variation of factor analysis to ensure equivalence across the constituent datasets. The advantage of combining datasets this way is the simplicity, statistical power, and modeling flexibility of a single joint analysis of all the data. Three simulation studies show the performance of ROSETTA on datasets with only partially overlapping measurements (i.e., systematically missing information), benchmarked to a condition of perfectly overlapped data (i.e., full information). The first study examined a range of correlations, while the second study was modeled after the observed correlations in a well-characterized clinical, behavioral cohort. Both studies consistently show significant correlations >0.94, often >0.96, indicating the robustness of the method and validating the general approach. The third study varied within and between domain correlations and compared ROSETTA to multiple imputation and meta-analysis as two commonly used methods that ostensibly solve the same data integration problem. We provide one alternative to meta-analysis and multiple imputation by developing a method that statistically equates similar but distinct manifest metrics into a set of empirically derived metrics that can be used for analysis across all datasets. MDPI 2019-09-19 /pmc/articles/PMC6771148/ /pubmed/31546899 http://dx.doi.org/10.3390/genes10090727 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Bartlett, Christopher W. Klamer, Brett G. Buyske, Steven Petrill, Stephen A. Ray, William C. Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features
title	Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features
title_full	Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features
title_fullStr	Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features
title_full_unstemmed	Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features
title_short	Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features
title_sort	forming big datasets through latent class concatenation of imperfectly matched databases features
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6771148/ https://www.ncbi.nlm.nih.gov/pubmed/31546899 http://dx.doi.org/10.3390/genes10090727
work_keys_str_mv	AT bartlettchristopherw formingbigdatasetsthroughlatentclassconcatenationofimperfectlymatcheddatabasesfeatures AT klamerbrettg formingbigdatasetsthroughlatentclassconcatenationofimperfectlymatcheddatabasesfeatures AT buyskesteven formingbigdatasetsthroughlatentclassconcatenationofimperfectlymatcheddatabasesfeatures AT petrillstephena formingbigdatasetsthroughlatentclassconcatenationofimperfectlymatcheddatabasesfeatures AT raywilliamc formingbigdatasetsthroughlatentclassconcatenationofimperfectlymatcheddatabasesfeatures

Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features

Ejemplares similares