Cargando…

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with...

Descripción completa

Detalles Bibliográficos
Autor principal:	Zhao, Boyang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Society of Clinical Oncology 2019
Materias:	Original Reports
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874014/ https://www.ncbi.nlm.nih.gov/pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057

_version_	1783472763186446336
author	Zhao, Boyang
author_facet	Zhao, Boyang
author_sort	Zhao, Boyang
collection	PubMed
description	PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS: We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS: We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F(1) scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION: By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters.
format	Online Article Text
id	pubmed-6874014
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	American Society of Clinical Oncology
record_format	MEDLINE/PubMed
spelling	pubmed-68740142020-10-02 Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing Zhao, Boyang JCO Clin Cancer Inform Original Reports PURPOSE: A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS: We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS: We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F(1) scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION: By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters. American Society of Clinical Oncology 2019-10-02 /pmc/articles/PMC6874014/ /pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057 Text en © 2019 by American Society of Clinical Oncology https://creativecommons.org/licenses/by-nc-nd/4.0/ Creative Commons Attribution Non-Commercial No Derivatives 4.0 License: https://creativecommons.org/licenses/by-nc-nd/4.0/
spellingShingle	Original Reports Zhao, Boyang Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title	Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_full	Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_fullStr	Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_full_unstemmed	Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_short	Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing
title_sort	clinical data extraction and normalization of cyrillic electronic health records via deep-learning natural language processing
topic	Original Reports
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874014/ https://www.ncbi.nlm.nih.gov/pubmed/31577448 http://dx.doi.org/10.1200/CCI.19.00057
work_keys_str_mv	AT zhaoboyang clinicaldataextractionandnormalizationofcyrillicelectronichealthrecordsviadeeplearningnaturallanguageprocessing

Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing

Ejemplares similares