Cargando…

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with...

Descripción completa

Detalles Bibliográficos
Autor principal: Zhao, Boyang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society of Clinical Oncology 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874014/
https://www.ncbi.nlm.nih.gov/pubmed/31577448
http://dx.doi.org/10.1200/CCI.19.00057
_version_ 1783472763186446336
author Zhao, Boyang
author_facet Zhao, Boyang
author_sort Zhao, Boyang
collection PubMed
description PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS: We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS: We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F(1) scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION: By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters.
format Online
Article
Text
id pubmed-6874014
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher American Society of Clinical Oncology
record_format MEDLINE/PubMed
spelling pubmed-68740142020-10-02 Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing Zhao, Boyang JCO Clin Cancer Inform Original Reports PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS: We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS: We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F(1) scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION: By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters. American Society of Clinical Oncology 2019-10-02 /pmc/articles/PMC6874014/ /pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057 Text en © 2019 by American Society of Clinical Oncology https://creativecommons.org/licenses/by-nc-nd/4.0/ Creative Commons Attribution Non-Commercial No Derivatives 4.0 License: https://creativecommons.org/licenses/by-nc-nd/4.0/
spellingShingle Original Reports
Zhao, Boyang
Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_full Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_fullStr Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_full_unstemmed Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_short Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_sort clinical data extraction and normalization of cyrillic electronic health records via deep-learning natural language processing
topic Original Reports
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874014/
https://www.ncbi.nlm.nih.gov/pubmed/31577448
http://dx.doi.org/10.1200/CCI.19.00057
work_keys_str_mv AT zhaoboyang clinicaldataextractionandnormalizationofcyrillicelectronichealthrecordsviadeeplearningnaturallanguageprocessing