Cargando…

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of tr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Naderi, Nona, Knafou, Julien, Copara, Jenny, Ruch, Patrick, Teodoro, Douglas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Research Metrics and Analytics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640190/ https://www.ncbi.nlm.nih.gov/pubmed/34870074 http://dx.doi.org/10.3389/frma.2021.689803

_version_	1784609289204137984
author	Naderi, Nona Knafou, Julien Copara, Jenny Ruch, Patrick Teodoro, Douglas
author_facet	Naderi, Nona Knafou, Julien Copara, Jenny Ruch, Patrick Teodoro, Douglas
author_sort	Naderi, Nona
collection	PubMed
description	The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
format	Online Article Text
id	pubmed-8640190
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-86401902021-12-04 Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora Naderi, Nona Knafou, Julien Copara, Jenny Ruch, Patrick Teodoro, Douglas Front Res Metr Anal Research Metrics and Analytics The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains. Frontiers Media S.A. 2021-11-19 /pmc/articles/PMC8640190/ /pubmed/34870074 http://dx.doi.org/10.3389/frma.2021.689803 Text en Copyright © 2021 Naderi, Knafou, Copara, Ruch and Teodoro. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Research Metrics and Analytics Naderi, Nona Knafou, Julien Copara, Jenny Ruch, Patrick Teodoro, Douglas Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title	Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_full	Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_fullStr	Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_full_unstemmed	Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_short	Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_sort	ensemble of deep masked language models for effective named entity recognition in health and life science corpora
topic	Research Metrics and Analytics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640190/ https://www.ncbi.nlm.nih.gov/pubmed/34870074 http://dx.doi.org/10.3389/frma.2021.689803
work_keys_str_mv	AT naderinona ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT knafoujulien ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT coparajenny ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT ruchpatrick ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT teodorodouglas ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Ejemplares similares