Cargando…
Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study
BACKGROUND: The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data ap...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8367145/ https://www.ncbi.nlm.nih.gov/pubmed/34297000 http://dx.doi.org/10.2196/19905 |
_version_ | 1783739019957370880 |
---|---|
author | Huang, Yanqun Wang, Ni Zhang, Zhiqiang Liu, Honglei Fei, Xiaolu Wei, Lan Chen, Hui |
author_facet | Huang, Yanqun Wang, Ni Zhang, Zhiqiang Liu, Honglei Fei, Xiaolu Wei, Lan Chen, Hui |
author_sort | Huang, Yanqun |
collection | PubMed |
description | BACKGROUND: The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data applications. OBJECTIVE: We aimed to apply the embedding technique used in the natural language processing domain for the sEMR data representation and to explore the feasibility and superiority of the embedding-based feature and patient representations in clinical application. METHODS: The entire training corpus consisted of records of 104,752 hospitalized patients with 13,757 medical concepts of disease diagnoses, physical examinations and procedures, laboratory tests, medications, etc. Each medical concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm with some adaptive changes from shuffling the medical concepts in a record 20 times. The average of vectors for all medical concepts in a patient record represented the patient. For embedding-based feature representation evaluation, we used the cosine similarities among the medical concept vectors to capture the latent clinical associations among the medical concepts. We further conducted a clustering analysis on stroke patients to evaluate and compare the embedding-based patient representations. The Hopkins statistic, Silhouette index (SI), and Davies-Bouldin index were used for the unsupervised evaluation, and the precision, recall, and F1 score were used for the supervised evaluation. RESULTS: The dimension of patient representation was reduced from 13,757 to 200 using the embedding-based representation. The average cosine similarity of the selected disease (subarachnoid hemorrhage) and its 15 clinically relevant medical concepts was 0.973. Stroke patients were clustered into two clusters with the highest SI (0.852). Clustering analyses conducted on patients with the embedding representations showed higher applicability (Hopkins statistic 0.931), higher aggregation (SI 0.862), and lower dispersion (Davies-Bouldin index 0.551) than those conducted on patients with reference representation methods. The clustering solutions for patients with the embedding-based representation achieved the highest F1 scores of 0.944 and 0.717 for two clusters. CONCLUSIONS: The feature-level embedding-based representations can reflect the potential clinical associations among medical concepts effectively. The patient-level embedding-based representation is easy to use as continuous input to standard machine learning algorithms and can bring performance improvements. It is expected that the embedding-based representation will be helpful in a wide range of secondary uses of sEMR data. |
format | Online Article Text |
id | pubmed-8367145 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-83671452021-08-24 Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study Huang, Yanqun Wang, Ni Zhang, Zhiqiang Liu, Honglei Fei, Xiaolu Wei, Lan Chen, Hui JMIR Med Inform Original Paper BACKGROUND: The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data applications. OBJECTIVE: We aimed to apply the embedding technique used in the natural language processing domain for the sEMR data representation and to explore the feasibility and superiority of the embedding-based feature and patient representations in clinical application. METHODS: The entire training corpus consisted of records of 104,752 hospitalized patients with 13,757 medical concepts of disease diagnoses, physical examinations and procedures, laboratory tests, medications, etc. Each medical concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm with some adaptive changes from shuffling the medical concepts in a record 20 times. The average of vectors for all medical concepts in a patient record represented the patient. For embedding-based feature representation evaluation, we used the cosine similarities among the medical concept vectors to capture the latent clinical associations among the medical concepts. We further conducted a clustering analysis on stroke patients to evaluate and compare the embedding-based patient representations. The Hopkins statistic, Silhouette index (SI), and Davies-Bouldin index were used for the unsupervised evaluation, and the precision, recall, and F1 score were used for the supervised evaluation. RESULTS: The dimension of patient representation was reduced from 13,757 to 200 using the embedding-based representation. The average cosine similarity of the selected disease (subarachnoid hemorrhage) and its 15 clinically relevant medical concepts was 0.973. Stroke patients were clustered into two clusters with the highest SI (0.852). Clustering analyses conducted on patients with the embedding representations showed higher applicability (Hopkins statistic 0.931), higher aggregation (SI 0.862), and lower dispersion (Davies-Bouldin index 0.551) than those conducted on patients with reference representation methods. The clustering solutions for patients with the embedding-based representation achieved the highest F1 scores of 0.944 and 0.717 for two clusters. CONCLUSIONS: The feature-level embedding-based representations can reflect the potential clinical associations among medical concepts effectively. The patient-level embedding-based representation is easy to use as continuous input to standard machine learning algorithms and can bring performance improvements. It is expected that the embedding-based representation will be helpful in a wide range of secondary uses of sEMR data. JMIR Publications 2021-07-23 /pmc/articles/PMC8367145/ /pubmed/34297000 http://dx.doi.org/10.2196/19905 Text en ©Yanqun Huang, Ni Wang, Zhiqiang Zhang, Honglei Liu, Xiaolu Fei, Lan Wei, Hui Chen. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 23.07.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Huang, Yanqun Wang, Ni Zhang, Zhiqiang Liu, Honglei Fei, Xiaolu Wei, Lan Chen, Hui Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study |
title | Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study |
title_full | Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study |
title_fullStr | Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study |
title_full_unstemmed | Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study |
title_short | Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study |
title_sort | patient representation from structured electronic medical records based on embedding technique: development and validation study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8367145/ https://www.ncbi.nlm.nih.gov/pubmed/34297000 http://dx.doi.org/10.2196/19905 |
work_keys_str_mv | AT huangyanqun patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy AT wangni patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy AT zhangzhiqiang patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy AT liuhonglei patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy AT feixiaolu patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy AT weilan patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy AT chenhui patientrepresentationfromstructuredelectronicmedicalrecordsbasedonembeddingtechniquedevelopmentandvalidationstudy |