Cargando…

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation

BACKGROUND: Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A maj...

Descripción completa

Detalles Bibliográficos
Autores principales:	Frei, Johann, Kramer, Frank
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10015355/ https://www.ncbi.nlm.nih.gov/pubmed/36853741 http://dx.doi.org/10.2196/39077

_version_	1784907194079117312
author	Frei, Johann Kramer, Frank
author_facet	Frei, Johann Kramer, Frank
author_sort	Frei, Johann
collection	PubMed
description	BACKGROUND: Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data. OBJECTIVE: We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication. METHODS: The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements. RESULTS: The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency–averaged F(1) score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available. CONCLUSIONS: We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.
format	Online Article Text
id	pubmed-10015355
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-100153552023-03-16 German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation Frei, Johann Kramer, Frank JMIR Form Res Original Paper BACKGROUND: Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data. OBJECTIVE: We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication. METHODS: The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements. RESULTS: The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency–averaged F(1) score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available. CONCLUSIONS: We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work. JMIR Publications 2023-02-28 /pmc/articles/PMC10015355/ /pubmed/36853741 http://dx.doi.org/10.2196/39077 Text en ©Johann Frei, Frank Kramer. Originally published in JMIR Formative Research (https://formative.jmir.org), 28.02.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
spellingShingle	Original Paper Frei, Johann Kramer, Frank German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation
title	German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation
title_full	German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation
title_fullStr	German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation
title_full_unstemmed	German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation
title_short	German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation
title_sort	german medical named entity recognition model and data set creation using machine translation and word alignment: algorithm development and validation
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10015355/ https://www.ncbi.nlm.nih.gov/pubmed/36853741 http://dx.doi.org/10.2196/39077
work_keys_str_mv	AT freijohann germanmedicalnamedentityrecognitionmodelanddatasetcreationusingmachinetranslationandwordalignmentalgorithmdevelopmentandvalidation AT kramerfrank germanmedicalnamedentityrecognitionmodelanddatasetcreationusingmachinetranslationandwordalignmentalgorithmdevelopmentandvalidation

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation

Ejemplares similares