Cargando…

Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study

BACKGROUND: The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data ap...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Yanqun, Wang, Ni, Zhang, Zhiqiang, Liu, Honglei, Fei, Xiaolu, Wei, Lan, Chen, Hui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8367145/
https://www.ncbi.nlm.nih.gov/pubmed/34297000
http://dx.doi.org/10.2196/19905
_version_ 1783739019957370880
author Huang, Yanqun
Wang, Ni
Zhang, Zhiqiang
Liu, Honglei
Fei, Xiaolu
Wei, Lan
Chen, Hui
author_facet Huang, Yanqun
Wang, Ni
Zhang, Zhiqiang
Liu, Honglei
Fei, Xiaolu
Wei, Lan
Chen, Hui
author_sort Huang, Yanqun
collection PubMed
description BACKGROUND: The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data applications. OBJECTIVE: We aimed to apply the embedding technique used in the natural language processing domain for the sEMR data representation and to explore the feasibility and superiority of the embedding-based feature and patient representations in clinical application. METHODS: The entire training corpus consisted of records of 104,752 hospitalized patients with 13,757 medical concepts of disease diagnoses, physical examinations and procedures, laboratory tests, medications, etc. Each medical concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm with some adaptive changes from shuffling the medical concepts in a record 20 times. The average of vectors for all medical concepts in a patient record represented the patient. For embedding-based feature representation evaluation, we used the cosine similarities among the medical concept vectors to capture the latent clinical associations among the medical concepts. We further conducted a clustering analysis on stroke patients to evaluate and compare the embedding-based patient representations. The Hopkins statistic, Silhouette index (SI), and Davies-Bouldin index were used for the unsupervised evaluation, and the precision, recall, and F1 score were used for the supervised evaluation. RESULTS: The dimension of patient representation was reduced from 13,757 to 200 using the embedding-based representation. The average cosine similarity of the selected disease (subarachnoid hemorrhage) and its 15 clinically relevant medical concepts was 0.973. Stroke patients were clustered into two clusters with the highest SI (0.852). Clustering analyses conducted on patients with the embedding representations showed higher applicability (Hopkins statistic 0.931), higher aggregation (SI 0.862), and lower dispersion (Davies-Bouldin index 0.551) than those conducted on patients with reference representation methods. The clustering solutions for patients with the embedding-based representation achieved the highest F1 scores of 0.944 and 0.717 for two clusters. CONCLUSIONS: The feature-level embedding-based representations can reflect the potential clinical associations among medical concepts effectively. The patient-level embedding-based representation is easy to use as continuous input to standard machine learning algorithms and can bring performance improvements. It is expected that the embedding-based representation will be helpful in a wide range of secondary uses of sEMR data.
format Online
Article
Text
id pubmed-8367145
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-83671452021-08-24 Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study Huang, Yanqun Wang, Ni Zhang, Zhiqiang Liu, Honglei Fei, Xiaolu Wei, Lan Chen, Hui JMIR Med Inform Original Paper BACKGROUND: The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data applications. OBJECTIVE: We aimed to apply the embedding technique used in the natural language processing domain for the sEMR data representation and to explore the feasibility and superiority of the embedding-based feature and patient representations in clinical application. METHODS: The entire training corpus consisted of records of 104,752 hospitalized patients with 13,757 medical concepts of disease diagnoses, physical examinations and procedures, laboratory tests, medications, etc. Each medical concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm with some adaptive changes from shuffling the medical concepts in a record 20 times. The average of vectors for all medical concepts in a patient record represented the patient. For embedding-based feature representation evaluation, we used the cosine similarities among the medical concept vectors to capture the latent clinical associations among the medical concepts. We further conducted a clustering analysis on stroke patients to evaluate and compare the embedding-based patient representations. The Hopkins statistic, Silhouette index (SI), and Davies-Bouldin index were used for the unsupervised evaluation, and the precision, recall, and F1 score were used for the supervised evaluation. RESULTS: The dimension of patient representation was reduced from 13,757 to 200 using the embedding-based representation. The average cosine similarity of the selected disease (subarachnoid hemorrhage) and its 15 clinically relevant medical concepts was 0.973. Stroke patients were clustered into two clusters with the highest SI (0.852). Clustering analyses conducted on patients with the embedding representations showed higher applicability (Hopkins statistic 0.931), higher aggregation (SI 0.862), and lower dispersion (Davies-Bouldin index 0.551) than those conducted on patients with reference representation methods. The clustering solutions for patients with the embedding-based representation achieved the highest F1 scores of 0.944 and 0.717 for two clusters. CONCLUSIONS: The feature-level embedding-based representations can reflect the potential clinical associations among medical concepts effectively. The patient-level embedding-based representation is easy to use as continuous input to standard machine learning algorithms and can bring performance improvements. It is expected that the embedding-based representation will be helpful in a wide range of secondary uses of sEMR data. JMIR Publications 2021-07-23 /pmc/articles/PMC8367145/ /pubmed/34297000 http://dx.doi.org/10.2196/19905 Text en ©Yanqun Huang, Ni Wang, Zhiqiang Zhang, Honglei Liu, Xiaolu Fei, Lan Wei, Hui Chen. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 23.07.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Huang, Yanqun
Wang, Ni
Zhang, Zhiqiang
Liu, Honglei
Fei, Xiaolu
Wei, Lan
Chen, Hui
Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
title Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
title_full Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
title_fullStr Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
title_full_unstemmed Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
title_short Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
title_sort patient representation from structured electronic medical records based on embedding technique: development and validation study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8367145/
https://www.ncbi.nlm.nih.gov/pubmed/34297000
http://dx.doi.org/10.2196/19905
work_keys_str_mv AT huangyanqun patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy
AT wangni patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy
AT zhangzhiqiang patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy
AT liuhonglei patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy
AT feixiaolu patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy
AT weilan patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy
AT chenhui patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy