Cargando…

Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms

We study the question of how to represent or summarize raw laboratory data taken from an electronic health record (EHR) using parametric model selection to reduce or cope with biases induced through clinical care. It has been previously demonstrated that the health care process (Hripcsak and Albers,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Albers, D.J., Elhadad, N., Claassen, J., Perotte, R., Goldstein, A., Hripcsak, G.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5856130/ https://www.ncbi.nlm.nih.gov/pubmed/29369797 http://dx.doi.org/10.1016/j.jbi.2018.01.004

_version_	1783307253538881536
author	Albers, D.J. Elhadad, N. Claassen, J. Perotte, R. Goldstein, A. Hripcsak, G.
author_facet	Albers, D.J. Elhadad, N. Claassen, J. Perotte, R. Goldstein, A. Hripcsak, G.
author_sort	Albers, D.J.
collection	PubMed
description	We study the question of how to represent or summarize raw laboratory data taken from an electronic health record (EHR) using parametric model selection to reduce or cope with biases induced through clinical care. It has been previously demonstrated that the health care process (Hripcsak and Albers, 2012, 2013), as defined by measurement context (Hripcsak and Albers, 2013; Albers et al., 2012) and measurement patterns (Albers and Hripcsak, 2010, 2012), can influence how EHR data are distributed statistically (Kohane and Weber, 2013; Pivovarov et al., 2014). We construct an algorithm, PopKLD, which is based on information criterion model selection (Burnham and Anderson, 2002; Claeskens and Hjort, 2008), is intended to reduce and cope with health care process biases and to produce an intuitively understandable continuous summary. The PopKLD algorithm can be automated and is designed to be applicable in high-throughput settings; for example, the output of the PopKLD algorithm can be used as input for phenotyping algorithms. Moreover, we develop the PopKLD-CAT algorithm that transforms the continuous PopKLD summary into a categorical summary useful for applications that require categorical data such as topic modeling. We evaluate our methodology in two ways. First, we apply the method to laboratory data collected in two different health care contexts, primary versus intensive care. We show that the PopKLD preserves known physiologic features in the data that are lost when summarizing the data using more common laboratory data summaries such as mean and standard deviation. Second, for three disease-laboratory measurement pairs, we perform a phenotyping task: we use the PopKLD and PopKLD-CAT algorithms to define high and low values of the laboratory variable that are used for defining a disease state. We then compare the relationship between the PopKLD-CAT summary disease predictions and the same predictions using empirically estimated mean and standard deviation to a gold standard generated by clinical review of patient records. We find that the PopKLD laboratory data summary is substantially better at predicting disease state. The PopKLD or PopKLD-CAT algorithms are not meant to be used as phenotyping algorithms, but we use the phenotyping task to show what information can be gained when using a more informative laboratory data summary. In the process of evaluation our method we show that the different clinical contexts and laboratory measurements necessitate different statistical summaries. Similarly, leveraging the principle of maximum entropy we argue that while some laboratory data only have sufficient information to estimate a mean and standard deviation, other laboratory data captured in an EHR contain substantially more information than can be captured in higher-parameter models.
format	Online Article Text
id	pubmed-5856130
institution	National Center for Biotechnology Information
language	English
publishDate	2018
record_format	MEDLINE/PubMed
spelling	pubmed-58561302018-03-16 Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms Albers, D.J. Elhadad, N. Claassen, J. Perotte, R. Goldstein, A. Hripcsak, G. J Biomed Inform Article We study the question of how to represent or summarize raw laboratory data taken from an electronic health record (EHR) using parametric model selection to reduce or cope with biases induced through clinical care. It has been previously demonstrated that the health care process (Hripcsak and Albers, 2012, 2013), as defined by measurement context (Hripcsak and Albers, 2013; Albers et al., 2012) and measurement patterns (Albers and Hripcsak, 2010, 2012), can influence how EHR data are distributed statistically (Kohane and Weber, 2013; Pivovarov et al., 2014). We construct an algorithm, PopKLD, which is based on information criterion model selection (Burnham and Anderson, 2002; Claeskens and Hjort, 2008), is intended to reduce and cope with health care process biases and to produce an intuitively understandable continuous summary. The PopKLD algorithm can be automated and is designed to be applicable in high-throughput settings; for example, the output of the PopKLD algorithm can be used as input for phenotyping algorithms. Moreover, we develop the PopKLD-CAT algorithm that transforms the continuous PopKLD summary into a categorical summary useful for applications that require categorical data such as topic modeling. We evaluate our methodology in two ways. First, we apply the method to laboratory data collected in two different health care contexts, primary versus intensive care. We show that the PopKLD preserves known physiologic features in the data that are lost when summarizing the data using more common laboratory data summaries such as mean and standard deviation. Second, for three disease-laboratory measurement pairs, we perform a phenotyping task: we use the PopKLD and PopKLD-CAT algorithms to define high and low values of the laboratory variable that are used for defining a disease state. We then compare the relationship between the PopKLD-CAT summary disease predictions and the same predictions using empirically estimated mean and standard deviation to a gold standard generated by clinical review of patient records. We find that the PopKLD laboratory data summary is substantially better at predicting disease state. The PopKLD or PopKLD-CAT algorithms are not meant to be used as phenotyping algorithms, but we use the phenotyping task to show what information can be gained when using a more informative laboratory data summary. In the process of evaluation our method we show that the different clinical contexts and laboratory measurements necessitate different statistical summaries. Similarly, leveraging the principle of maximum entropy we argue that while some laboratory data only have sufficient information to estimate a mean and standard deviation, other laboratory data captured in an EHR contain substantially more information than can be captured in higher-parameter models. 2018-01-31 2018-02 /pmc/articles/PMC5856130/ /pubmed/29369797 http://dx.doi.org/10.1016/j.jbi.2018.01.004 Text en https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/BY/4.0/ (https://creativecommons.org/licenses/by/4.0/) ).
spellingShingle	Article Albers, D.J. Elhadad, N. Claassen, J. Perotte, R. Goldstein, A. Hripcsak, G. Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
title	Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
title_full	Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
title_fullStr	Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
title_full_unstemmed	Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
title_short	Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
title_sort	estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5856130/ https://www.ncbi.nlm.nih.gov/pubmed/29369797 http://dx.doi.org/10.1016/j.jbi.2018.01.004
work_keys_str_mv	AT albersdj estimatingsummarystatisticsforelectronichealthrecordlaboratorydataforuseinhighthroughputphenotypingalgorithms AT elhadadn estimatingsummarystatisticsforelectronichealthrecordlaboratorydataforuseinhighthroughputphenotypingalgorithms AT claassenj estimatingsummarystatisticsforelectronichealthrecordlaboratorydataforuseinhighthroughputphenotypingalgorithms AT perotter estimatingsummarystatisticsforelectronichealthrecordlaboratorydataforuseinhighthroughputphenotypingalgorithms AT goldsteina estimatingsummarystatisticsforelectronichealthrecordlaboratorydataforuseinhighthroughputphenotypingalgorithms AT hripcsakg estimatingsummarystatisticsforelectronichealthrecordlaboratorydataforuseinhighthroughputphenotypingalgorithms

Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms

Ejemplares similares