Cargando…

A document processing pipeline for annotating chemical entities in scientific documents

BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concep...

Descripción completa

Detalles Bibliográficos
Autores principales: Campos, David, Matos, Sérgio, Oliveira, José L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331697/
https://www.ncbi.nlm.nih.gov/pubmed/25810778
http://dx.doi.org/10.1186/1758-2946-7-S1-S7
_version_ 1782357761090125824
author Campos, David
Matos, Sérgio
Oliveira, José L
author_facet Campos, David
Matos, Sérgio
Oliveira, José L
author_sort Campos, David
collection PubMed
description BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. RESULTS: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. CONCLUSIONS: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.
format Online
Article
Text
id pubmed-4331697
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43316972015-03-25 A document processing pipeline for annotating chemical entities in scientific documents Campos, David Matos, Sérgio Oliveira, José L J Cheminform Research BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. RESULTS: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. CONCLUSIONS: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/. BioMed Central 2015-01-19 /pmc/articles/PMC4331697/ /pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7 Text en Copyright © 2015 Campos et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Campos, David
Matos, Sérgio
Oliveira, José L
A document processing pipeline for annotating chemical entities in scientific documents
title A document processing pipeline for annotating chemical entities in scientific documents
title_full A document processing pipeline for annotating chemical entities in scientific documents
title_fullStr A document processing pipeline for annotating chemical entities in scientific documents
title_full_unstemmed A document processing pipeline for annotating chemical entities in scientific documents
title_short A document processing pipeline for annotating chemical entities in scientific documents
title_sort document processing pipeline for annotating chemical entities in scientific documents
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331697/
https://www.ncbi.nlm.nih.gov/pubmed/25810778
http://dx.doi.org/10.1186/1758-2946-7-S1-S7
work_keys_str_mv AT camposdavid adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments
AT matossergio adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments
AT oliveirajosel adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments
AT camposdavid documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments
AT matossergio documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments
AT oliveirajosel documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments