Cargando…

Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data

BACKGROUND: Availability of linked biomedical and social science data has risen dramatically in past decades, facilitating holistic and systems-based analyses. Among these, Bayesian networks have great potential to tackle complex interdisciplinary problems, because they can easily model inter-relati...

Descripción completa

Detalles Bibliográficos
Autores principales: Ke, Xuejia, Keenan, Katherine, Smith, V. Anne
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761946/
https://www.ncbi.nlm.nih.gov/pubmed/36536286
http://dx.doi.org/10.1186/s12874-022-01781-9
_version_ 1784852771427581952
author Ke, Xuejia
Keenan, Katherine
Smith, V. Anne
author_facet Ke, Xuejia
Keenan, Katherine
Smith, V. Anne
author_sort Ke, Xuejia
collection PubMed
description BACKGROUND: Availability of linked biomedical and social science data has risen dramatically in past decades, facilitating holistic and systems-based analyses. Among these, Bayesian networks have great potential to tackle complex interdisciplinary problems, because they can easily model inter-relations between variables. They work by encoding conditional independence relationships discovered via advanced inference algorithms. One challenge is dealing with missing data, ubiquitous in survey or biomedical datasets. Missing data is rarely addressed in an advanced way in Bayesian networks; the most common approach is to discard all samples containing missing measurements. This can lead to biased estimates. Here, we examine how Bayesian network structure learning can incorporate missing data. METHODS: We use a simulation approach to compare a commonly used method in frequentist statistics, multiple imputation by chained equations (MICE), with one specific for Bayesian network learning, structural expectation-maximization (SEM). We simulate multiple incomplete categorical (discrete) data sets with different missingness mechanisms, variable numbers, data amount, and missingness proportions. We evaluate performance of MICE and SEM in capturing network structure. We then apply SEM combined with community analysis to a real-world dataset of linked biomedical and social data to investigate associations between socio-demographic factors and multiple chronic conditions in the US elderly population. RESULTS: We find that applying either method (MICE or SEM) provides better structure recovery than doing nothing, and SEM in general outperforms MICE. This finding is robust across missingness mechanisms, variable numbers, data amount and missingness proportions. We also find that imputed data from SEM is more accurate than from MICE. Our real-world application recovers known inter-relationships among socio-demographic factors and common multimorbidities. This network analysis also highlights potential areas of investigation, such as links between cancer and cognitive impairment and disconnect between self-assessed memory decline and standard cognitive impairment measurement. CONCLUSION: Our simulation results suggest taking advantage of the additional information provided by network structure during SEM improves the performance of Bayesian networks; this might be especially useful for social science and other interdisciplinary analyses. Our case study show that comorbidities of different diseases interact with each other and are closely associated with socio-demographic factors. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01781-9.
format Online
Article
Text
id pubmed-9761946
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-97619462022-12-20 Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data Ke, Xuejia Keenan, Katherine Smith, V. Anne BMC Med Res Methodol Research BACKGROUND: Availability of linked biomedical and social science data has risen dramatically in past decades, facilitating holistic and systems-based analyses. Among these, Bayesian networks have great potential to tackle complex interdisciplinary problems, because they can easily model inter-relations between variables. They work by encoding conditional independence relationships discovered via advanced inference algorithms. One challenge is dealing with missing data, ubiquitous in survey or biomedical datasets. Missing data is rarely addressed in an advanced way in Bayesian networks; the most common approach is to discard all samples containing missing measurements. This can lead to biased estimates. Here, we examine how Bayesian network structure learning can incorporate missing data. METHODS: We use a simulation approach to compare a commonly used method in frequentist statistics, multiple imputation by chained equations (MICE), with one specific for Bayesian network learning, structural expectation-maximization (SEM). We simulate multiple incomplete categorical (discrete) data sets with different missingness mechanisms, variable numbers, data amount, and missingness proportions. We evaluate performance of MICE and SEM in capturing network structure. We then apply SEM combined with community analysis to a real-world dataset of linked biomedical and social data to investigate associations between socio-demographic factors and multiple chronic conditions in the US elderly population. RESULTS: We find that applying either method (MICE or SEM) provides better structure recovery than doing nothing, and SEM in general outperforms MICE. This finding is robust across missingness mechanisms, variable numbers, data amount and missingness proportions. We also find that imputed data from SEM is more accurate than from MICE. Our real-world application recovers known inter-relationships among socio-demographic factors and common multimorbidities. This network analysis also highlights potential areas of investigation, such as links between cancer and cognitive impairment and disconnect between self-assessed memory decline and standard cognitive impairment measurement. CONCLUSION: Our simulation results suggest taking advantage of the additional information provided by network structure during SEM improves the performance of Bayesian networks; this might be especially useful for social science and other interdisciplinary analyses. Our case study show that comorbidities of different diseases interact with each other and are closely associated with socio-demographic factors. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01781-9. BioMed Central 2022-12-19 /pmc/articles/PMC9761946/ /pubmed/36536286 http://dx.doi.org/10.1186/s12874-022-01781-9 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Ke, Xuejia
Keenan, Katherine
Smith, V. Anne
Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data
title Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data
title_full Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data
title_fullStr Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data
title_full_unstemmed Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data
title_short Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data
title_sort treatment of missing data in bayesian network structure learning: an application to linked biomedical and social survey data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761946/
https://www.ncbi.nlm.nih.gov/pubmed/36536286
http://dx.doi.org/10.1186/s12874-022-01781-9
work_keys_str_mv AT kexuejia treatmentofmissingdatainbayesiannetworkstructurelearninganapplicationtolinkedbiomedicalandsocialsurveydata
AT keenankatherine treatmentofmissingdatainbayesiannetworkstructurelearninganapplicationtolinkedbiomedicalandsocialsurveydata
AT smithvanne treatmentofmissingdatainbayesiannetworkstructurelearninganapplicationtolinkedbiomedicalandsocialsurveydata