Cargando…

Full-text chemical identification with improved generalizability and tagging consistency

Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not b...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Hyunjae, Sung, Mujeen, Yoon, Wonjin, Park, Sungjoon, Kang, Jaewoo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9518746/ https://www.ncbi.nlm.nih.gov/pubmed/36170114 http://dx.doi.org/10.1093/database/baac074

_version_	1784799255123197952
author	Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo
author_facet	Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo
author_sort	Kim, Hyunjae
collection	PubMed
description	Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id
format	Online Article Text
id	pubmed-9518746
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-95187462022-09-29 Full-text chemical identification with improved generalizability and tagging consistency Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo Database (Oxford) Original Article Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id Oxford University Press 2022-09-28 /pmc/articles/PMC9518746/ /pubmed/36170114 http://dx.doi.org/10.1093/database/baac074 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Article Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo Full-text chemical identification with improved generalizability and tagging consistency
title	Full-text chemical identification with improved generalizability and tagging consistency
title_full	Full-text chemical identification with improved generalizability and tagging consistency
title_fullStr	Full-text chemical identification with improved generalizability and tagging consistency
title_full_unstemmed	Full-text chemical identification with improved generalizability and tagging consistency
title_short	Full-text chemical identification with improved generalizability and tagging consistency
title_sort	full-text chemical identification with improved generalizability and tagging consistency
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9518746/ https://www.ncbi.nlm.nih.gov/pubmed/36170114 http://dx.doi.org/10.1093/database/baac074
work_keys_str_mv	AT kimhyunjae fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT sungmujeen fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT yoonwonjin fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT parksungjoon fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT kangjaewoo fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency

Full-text chemical identification with improved generalizability and tagging consistency

Ejemplares similares