Cargando…

Exploiting and assessing multi-source data for supervised biomedical named entity recognition

MOTIVATION: Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent...

Descripción completa

Detalles Bibliográficos
Autores principales:	Galea, Dieter, Laponogov, Ivan, Veselkov, Kirill
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6041968/ https://www.ncbi.nlm.nih.gov/pubmed/29538614 http://dx.doi.org/10.1093/bioinformatics/bty152

_version_	1783339079379714048
author	Galea, Dieter Laponogov, Ivan Veselkov, Kirill
author_facet	Galea, Dieter Laponogov, Ivan Veselkov, Kirill
author_sort	Galea, Dieter
collection	PubMed
description	MOTIVATION: Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed. RESULTS: Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model ‘overtraining’) which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data. AVAILABILITY AND IMPLEMENTATION: Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6041968
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-60419682018-07-17 Exploiting and assessing multi-source data for supervised biomedical named entity recognition Galea, Dieter Laponogov, Ivan Veselkov, Kirill Bioinformatics Original Papers MOTIVATION: Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed. RESULTS: Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model ‘overtraining’) which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data. AVAILABILITY AND IMPLEMENTATION: Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-07-15 2018-03-10 /pmc/articles/PMC6041968/ /pubmed/29538614 http://dx.doi.org/10.1093/bioinformatics/bty152 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Galea, Dieter Laponogov, Ivan Veselkov, Kirill Exploiting and assessing multi-source data for supervised biomedical named entity recognition
title	Exploiting and assessing multi-source data for supervised biomedical named entity recognition
title_full	Exploiting and assessing multi-source data for supervised biomedical named entity recognition
title_fullStr	Exploiting and assessing multi-source data for supervised biomedical named entity recognition
title_full_unstemmed	Exploiting and assessing multi-source data for supervised biomedical named entity recognition
title_short	Exploiting and assessing multi-source data for supervised biomedical named entity recognition
title_sort	exploiting and assessing multi-source data for supervised biomedical named entity recognition
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6041968/ https://www.ncbi.nlm.nih.gov/pubmed/29538614 http://dx.doi.org/10.1093/bioinformatics/bty152
work_keys_str_mv	AT galeadieter exploitingandassessingmultisourcedataforsupervisedbiomedicalnamedentityrecognition AT laponogovivan exploitingandassessingmultisourcedataforsupervisedbiomedicalnamedentityrecognition AT veselkovkirill exploitingandassessingmultisourcedataforsupervisedbiomedicalnamedentityrecognition

Exploiting and assessing multi-source data for supervised biomedical named entity recognition

Ejemplares similares