Cargando…

Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency

INTRODUCTION AND OBJECTIVE: Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scienti...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dietrich, Jürgen, Kazzer, Philipp
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2023
Materias:	Original Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10345043/ https://www.ncbi.nlm.nih.gov/pubmed/37338799 http://dx.doi.org/10.1007/s40264-023-01322-3

_version_	1785072996668407808
author	Dietrich, Jürgen Kazzer, Philipp
author_facet	Dietrich, Jürgen Kazzer, Philipp
author_sort	Dietrich, Jürgen
collection	PubMed
description	INTRODUCTION AND OBJECTIVE: Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance. METHODS: A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance. RESULTS: The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers. CONCLUSIONS: A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s40264-023-01322-3.
format	Online Article Text
id	pubmed-10345043
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-103450432023-07-15 Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency Dietrich, Jürgen Kazzer, Philipp Drug Saf Original Research Article INTRODUCTION AND OBJECTIVE: Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance. METHODS: A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance. RESULTS: The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers. CONCLUSIONS: A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s40264-023-01322-3. Springer International Publishing 2023-06-20 2023 /pmc/articles/PMC10345043/ /pubmed/37338799 http://dx.doi.org/10.1007/s40264-023-01322-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by-nc/4.0/Open Access This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which permits any non-commercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle	Original Research Article Dietrich, Jürgen Kazzer, Philipp Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency
title	Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency
title_full	Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency
title_fullStr	Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency
title_full_unstemmed	Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency
title_short	Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency
title_sort	provision and characterization of a corpus for pharmaceutical, biomedical named entity recognition for pharmacovigilance: evaluation of language registers and training data sufficiency
topic	Original Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10345043/ https://www.ncbi.nlm.nih.gov/pubmed/37338799 http://dx.doi.org/10.1007/s40264-023-01322-3
work_keys_str_mv	AT dietrichjurgen provisionandcharacterizationofacorpusforpharmaceuticalbiomedicalnamedentityrecognitionforpharmacovigilanceevaluationoflanguageregistersandtrainingdatasufficiency AT kazzerphilipp provisionandcharacterizationofacorpusforpharmaceuticalbiomedicalnamedentityrecognitionforpharmacovigilanceevaluationoflanguageregistersandtrainingdatasufficiency

Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency

Ejemplares similares