Cargando…
Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Society of Clinical Oncology
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874014/ https://www.ncbi.nlm.nih.gov/pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057 |
_version_ | 1783472763186446336 |
---|---|
author | Zhao, Boyang |
author_facet | Zhao, Boyang |
author_sort | Zhao, Boyang |
collection | PubMed |
description | PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS: We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS: We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F(1) scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION: By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters. |
format | Online Article Text |
id | pubmed-6874014 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | American Society of Clinical Oncology |
record_format | MEDLINE/PubMed |
spelling | pubmed-68740142020-10-02 Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing Zhao, Boyang JCO Clin Cancer Inform Original Reports PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS: We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS: We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F(1) scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION: By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters. American Society of Clinical Oncology 2019-10-02 /pmc/articles/PMC6874014/ /pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057 Text en © 2019 by American Society of Clinical Oncology https://creativecommons.org/licenses/by-nc-nd/4.0/ Creative Commons Attribution Non-Commercial No Derivatives 4.0 License: https://creativecommons.org/licenses/by-nc-nd/4.0/ |
spellingShingle | Original Reports Zhao, Boyang Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing |
title | Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing |
title_full | Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing |
title_fullStr | Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing |
title_full_unstemmed | Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing |
title_short | Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing |
title_sort | clinical data extraction and normalization of cyrillic electronic health records via deep-learning natural language processing |
topic | Original Reports |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874014/ https://www.ncbi.nlm.nih.gov/pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057 |
work_keys_str_mv | AT zhaoboyang clinicaldataextractionandnormalizationofcyrillicelectronichealthrecordsviadeeplearningnaturallanguageprocessing |