Cargando…

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as na...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wulff, Antje, Mast, Marcel, Hassler, Marcus, Montag, Sara, Marschollek, Michael, Jack, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Georg Thieme Verlag KG 2020
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7725544/ https://www.ncbi.nlm.nih.gov/pubmed/33058101 http://dx.doi.org/10.1055/s-0040-1716403

_version_	1783620720597663744
author	Wulff, Antje Mast, Marcel Hassler, Marcus Montag, Sara Marschollek, Michael Jack, Thomas
author_facet	Wulff, Antje Mast, Marcel Hassler, Marcus Montag, Sara Marschollek, Michael Jack, Thomas
author_sort	Wulff, Antje
collection	PubMed
description	Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly. Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories. Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School. Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall. Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.
format	Online Article Text
id	pubmed-7725544
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Georg Thieme Verlag KG
record_format	MEDLINE/PubMed
spelling	pubmed-77255442020-12-10 Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing Wulff, Antje Mast, Marcel Hassler, Marcus Montag, Sara Marschollek, Michael Jack, Thomas Methods Inf Med Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly. Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories. Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School. Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall. Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission. Georg Thieme Verlag KG 2020-12 2020-10-14 /pmc/articles/PMC7725544/ /pubmed/33058101 http://dx.doi.org/10.1055/s-0040-1716403 Text en The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial-License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. ( https://creativecommons.org/licenses/by-nc-nd/4.0/ ). https://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License, which permits unrestricted reproduction and distribution, for non-commercial purposes only; and use and reproduction, but not distribution, of adapted material for non-commercial purposes only, provided the original work is properly cited.
spellingShingle	Wulff, Antje Mast, Marcel Hassler, Marcus Montag, Sara Marschollek, Michael Jack, Thomas Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
title	Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
title_full	Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
title_fullStr	Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
title_full_unstemmed	Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
title_short	Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
title_sort	designing an openehr-based pipeline for extracting and standardizing unstructured clinical data using natural language processing
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7725544/ https://www.ncbi.nlm.nih.gov/pubmed/33058101 http://dx.doi.org/10.1055/s-0040-1716403
work_keys_str_mv	AT wulffantje designinganopenehrbasedpipelineforextractingandstandardizingunstructuredclinicaldatausingnaturallanguageprocessing AT mastmarcel designinganopenehrbasedpipelineforextractingandstandardizingunstructuredclinicaldatausingnaturallanguageprocessing AT hasslermarcus designinganopenehrbasedpipelineforextractingandstandardizingunstructuredclinicaldatausingnaturallanguageprocessing AT montagsara designinganopenehrbasedpipelineforextractingandstandardizingunstructuredclinicaldatausingnaturallanguageprocessing AT marschollekmichael designinganopenehrbasedpipelineforextractingandstandardizingunstructuredclinicaldatausingnaturallanguageprocessing AT jackthomas designinganopenehrbasedpipelineforextractingandstandardizingunstructuredclinicaldatausingnaturallanguageprocessing

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Ejemplares similares