Cargando…

Full-text chemical identification with improved generalizability and tagging consistency

Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not b...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Hyunjae, Sung, Mujeen, Yoon, Wonjin, Park, Sungjoon, Kang, Jaewoo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9518746/
https://www.ncbi.nlm.nih.gov/pubmed/36170114
http://dx.doi.org/10.1093/database/baac074
_version_ 1784799255123197952
author Kim, Hyunjae
Sung, Mujeen
Yoon, Wonjin
Park, Sungjoon
Kang, Jaewoo
author_facet Kim, Hyunjae
Sung, Mujeen
Yoon, Wonjin
Park, Sungjoon
Kang, Jaewoo
author_sort Kim, Hyunjae
collection PubMed
description Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id
format Online
Article
Text
id pubmed-9518746
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-95187462022-09-29 Full-text chemical identification with improved generalizability and tagging consistency Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo Database (Oxford) Original Article Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id Oxford University Press 2022-09-28 /pmc/articles/PMC9518746/ /pubmed/36170114 http://dx.doi.org/10.1093/database/baac074 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Article
Kim, Hyunjae
Sung, Mujeen
Yoon, Wonjin
Park, Sungjoon
Kang, Jaewoo
Full-text chemical identification with improved generalizability and tagging consistency
title Full-text chemical identification with improved generalizability and tagging consistency
title_full Full-text chemical identification with improved generalizability and tagging consistency
title_fullStr Full-text chemical identification with improved generalizability and tagging consistency
title_full_unstemmed Full-text chemical identification with improved generalizability and tagging consistency
title_short Full-text chemical identification with improved generalizability and tagging consistency
title_sort full-text chemical identification with improved generalizability and tagging consistency
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9518746/
https://www.ncbi.nlm.nih.gov/pubmed/36170114
http://dx.doi.org/10.1093/database/baac074
work_keys_str_mv AT kimhyunjae fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency
AT sungmujeen fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency
AT yoonwonjin fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency
AT parksungjoon fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency
AT kangjaewoo fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency