Cargando…
Full-text chemical identification with improved generalizability and tagging consistency
Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not b...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9518746/ https://www.ncbi.nlm.nih.gov/pubmed/36170114 http://dx.doi.org/10.1093/database/baac074 |
_version_ | 1784799255123197952 |
---|---|
author | Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo |
author_facet | Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo |
author_sort | Kim, Hyunjae |
collection | PubMed |
description | Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id |
format | Online Article Text |
id | pubmed-9518746 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-95187462022-09-29 Full-text chemical identification with improved generalizability and tagging consistency Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo Database (Oxford) Original Article Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id Oxford University Press 2022-09-28 /pmc/articles/PMC9518746/ /pubmed/36170114 http://dx.doi.org/10.1093/database/baac074 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Article Kim, Hyunjae Sung, Mujeen Yoon, Wonjin Park, Sungjoon Kang, Jaewoo Full-text chemical identification with improved generalizability and tagging consistency |
title | Full-text chemical identification with improved generalizability and tagging consistency |
title_full | Full-text chemical identification with improved generalizability and tagging consistency |
title_fullStr | Full-text chemical identification with improved generalizability and tagging consistency |
title_full_unstemmed | Full-text chemical identification with improved generalizability and tagging consistency |
title_short | Full-text chemical identification with improved generalizability and tagging consistency |
title_sort | full-text chemical identification with improved generalizability and tagging consistency |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9518746/ https://www.ncbi.nlm.nih.gov/pubmed/36170114 http://dx.doi.org/10.1093/database/baac074 |
work_keys_str_mv | AT kimhyunjae fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT sungmujeen fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT yoonwonjin fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT parksungjoon fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency AT kangjaewoo fulltextchemicalidentificationwithimprovedgeneralizabilityandtaggingconsistency |