Cargando…

Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation

BACKGROUND: Biomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them. METHODS: To remedy the above issue, we propose a novel Biomedical N...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Huiwei, Liu, Zhe, Lang, Chengkun, Xu, Yibin, Lin, Yingyu, Hou, Junjie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8170952/
https://www.ncbi.nlm.nih.gov/pubmed/34078270
http://dx.doi.org/10.1186/s12859-021-04200-w
Descripción
Sumario:BACKGROUND: Biomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them. METHODS: To remedy the above issue, we propose a novel Biomedical Named Entity Recognition (BioNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance recognition model. Our framework is inspired by two points: (1) named entity recognition should be considered from the perspective of both coverage and accuracy; (2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large-scale unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another weakly labeled dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two recognition models, respectively. Finally, we compress the knowledge in the two models into a single recognition model with knowledge distillation. RESULTS: Experiments on the BioCreative V chemical-disease relation corpus and NCBI Disease corpus show that knowledge from large-scale datasets significantly improves the performance of BioNER, especially the recall of it, leading to new state-of-the-art results. CONCLUSIONS: We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for BioNER.