Cargando…

Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics

The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Almeida, Tiago, Antunes, Rui, F. Silva, João, Almeida, João R, Matos, Sérgio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9248917/ https://www.ncbi.nlm.nih.gov/pubmed/35776534 http://dx.doi.org/10.1093/database/baac047

_version_	1784739458520711168
author	Almeida, Tiago Antunes, Rui F. Silva, João Almeida, João R Matos, Sérgio
author_facet	Almeida, Tiago Antunes, Rui F. Silva, João Almeida, João R Matos, Sérgio
author_sort	Almeida, Tiago
collection	PubMed
description	The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work. The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sieve-based dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques. The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively. The code to reproduce our experiments and run the pipeline is publicly available. Database URL https://github.com/bioinformatics-ua/biocreativeVII_track2
format	Online Article Text
id	pubmed-9248917
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-92489172022-07-05 Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics Almeida, Tiago Antunes, Rui F. Silva, João Almeida, João R Matos, Sérgio Database (Oxford) Original Article The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work. The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sieve-based dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques. The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively. The code to reproduce our experiments and run the pipeline is publicly available. Database URL https://github.com/bioinformatics-ua/biocreativeVII_track2 Oxford University Press 2022-07-01 /pmc/articles/PMC9248917/ /pubmed/35776534 http://dx.doi.org/10.1093/database/baac047 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Article Almeida, Tiago Antunes, Rui F. Silva, João Almeida, João R Matos, Sérgio Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
title	Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
title_full	Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
title_fullStr	Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
title_full_unstemmed	Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
title_short	Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
title_sort	chemical identification and indexing in pubmed full-text articles using deep learning and heuristics
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9248917/ https://www.ncbi.nlm.nih.gov/pubmed/35776534 http://dx.doi.org/10.1093/database/baac047
work_keys_str_mv	AT almeidatiago chemicalidentificationandindexinginpubmedfulltextarticlesusingdeeplearningandheuristics AT antunesrui chemicalidentificationandindexinginpubmedfulltextarticlesusingdeeplearningandheuristics AT fsilvajoao chemicalidentificationandindexinginpubmedfulltextarticlesusingdeeplearningandheuristics AT almeidajoaor chemicalidentificationandindexinginpubmedfulltextarticlesusingdeeplearningandheuristics AT matossergio chemicalidentificationandindexinginpubmedfulltextarticlesusingdeeplearningandheuristics

Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics

Ejemplares similares