Cargando…

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis

OBJECTIVE: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR d...

Descripción completa

Detalles Bibliográficos
Autores principales: Gan, Ziming, Zhou, Doudou, Rush, Everett, Panickan, Vidul A., Ho, Yuk-Lam, Ostrouchov, George, Xu, Zhiwei, Shen, Shuting, Xiong, Xin, Greco, Kimberly F., Hong, Chuan, Bonzel, Clara-Lea, Wen, Jun, Costa, Lauren, Cai, Tianrun, Begoli, Edmon, Xia, Zongqi, Gaziano, J. Michael, Liao, Katherine P., Cho, Kelly, Cai, Tianxi, Lu, Junwei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246054/
https://www.ncbi.nlm.nih.gov/pubmed/37293026
http://dx.doi.org/10.1101/2023.05.14.23289955
_version_ 1785054968779112448
author Gan, Ziming
Zhou, Doudou
Rush, Everett
Panickan, Vidul A.
Ho, Yuk-Lam
Ostrouchov, George
Xu, Zhiwei
Shen, Shuting
Xiong, Xin
Greco, Kimberly F.
Hong, Chuan
Bonzel, Clara-Lea
Wen, Jun
Costa, Lauren
Cai, Tianrun
Begoli, Edmon
Xia, Zongqi
Gaziano, J. Michael
Liao, Katherine P.
Cho, Kelly
Cai, Tianxi
Lu, Junwei
author_facet Gan, Ziming
Zhou, Doudou
Rush, Everett
Panickan, Vidul A.
Ho, Yuk-Lam
Ostrouchov, George
Xu, Zhiwei
Shen, Shuting
Xiong, Xin
Greco, Kimberly F.
Hong, Chuan
Bonzel, Clara-Lea
Wen, Jun
Costa, Lauren
Cai, Tianrun
Begoli, Edmon
Xia, Zongqi
Gaziano, J. Michael
Liao, Katherine P.
Cho, Kelly
Cai, Tianxi
Lu, Junwei
author_sort Gan, Ziming
collection PubMed
description OBJECTIVE: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. METHODS: The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated [Formula: see text]-values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer’s disease patients. RESULTS: ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the [Formula: see text]-values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate. CONCLUSIONS: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
format Online
Article
Text
id pubmed-10246054
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-102460542023-06-08 ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis Gan, Ziming Zhou, Doudou Rush, Everett Panickan, Vidul A. Ho, Yuk-Lam Ostrouchov, George Xu, Zhiwei Shen, Shuting Xiong, Xin Greco, Kimberly F. Hong, Chuan Bonzel, Clara-Lea Wen, Jun Costa, Lauren Cai, Tianrun Begoli, Edmon Xia, Zongqi Gaziano, J. Michael Liao, Katherine P. Cho, Kelly Cai, Tianxi Lu, Junwei medRxiv Article OBJECTIVE: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. METHODS: The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated [Formula: see text]-values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer’s disease patients. RESULTS: ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the [Formula: see text]-values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate. CONCLUSIONS: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks. Cold Spring Harbor Laboratory 2023-05-21 /pmc/articles/PMC10246054/ /pubmed/37293026 http://dx.doi.org/10.1101/2023.05.14.23289955 Text en https://creativecommons.org/licenses/by-nc/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Gan, Ziming
Zhou, Doudou
Rush, Everett
Panickan, Vidul A.
Ho, Yuk-Lam
Ostrouchov, George
Xu, Zhiwei
Shen, Shuting
Xiong, Xin
Greco, Kimberly F.
Hong, Chuan
Bonzel, Clara-Lea
Wen, Jun
Costa, Lauren
Cai, Tianrun
Begoli, Edmon
Xia, Zongqi
Gaziano, J. Michael
Liao, Katherine P.
Cho, Kelly
Cai, Tianxi
Lu, Junwei
ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis
title ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis
title_full ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis
title_fullStr ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis
title_full_unstemmed ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis
title_short ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis
title_sort arch: large-scale knowledge graph via aggregated narrative codified health records analysis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246054/
https://www.ncbi.nlm.nih.gov/pubmed/37293026
http://dx.doi.org/10.1101/2023.05.14.23289955
work_keys_str_mv AT ganziming archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT zhoudoudou archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT rusheverett archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT panickanvidula archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT hoyuklam archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT ostrouchovgeorge archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT xuzhiwei archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT shenshuting archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT xiongxin archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT grecokimberlyf archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT hongchuan archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT bonzelclaralea archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT wenjun archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT costalauren archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT caitianrun archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT begoliedmon archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT xiazongqi archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT gazianojmichael archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT liaokatherinep archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT chokelly archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT caitianxi archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis
AT lujunwei archlargescaleknowledgegraphviaaggregatednarrativecodifiedhealthrecordsanalysis