Cargando…

A document processing pipeline for annotating chemical entities in scientific documents

BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concep...

Descripción completa

Detalles Bibliográficos
Autores principales:	Campos, David, Matos, Sérgio, Oliveira, José L
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331697/ https://www.ncbi.nlm.nih.gov/pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7

_version_	1782357761090125824
author	Campos, David Matos, Sérgio Oliveira, José L
author_facet	Campos, David Matos, Sérgio Oliveira, José L
author_sort	Campos, David
collection	PubMed
description	BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. RESULTS: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. CONCLUSIONS: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.
format	Online Article Text
id	pubmed-4331697
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43316972015-03-25 A document processing pipeline for annotating chemical entities in scientific documents Campos, David Matos, Sérgio Oliveira, José L J Cheminform Research BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. RESULTS: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. CONCLUSIONS: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/. BioMed Central 2015-01-19 /pmc/articles/PMC4331697/ /pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7 Text en Copyright © 2015 Campos et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Campos, David Matos, Sérgio Oliveira, José L A document processing pipeline for annotating chemical entities in scientific documents
title	A document processing pipeline for annotating chemical entities in scientific documents
title_full	A document processing pipeline for annotating chemical entities in scientific documents
title_fullStr	A document processing pipeline for annotating chemical entities in scientific documents
title_full_unstemmed	A document processing pipeline for annotating chemical entities in scientific documents
title_short	A document processing pipeline for annotating chemical entities in scientific documents
title_sort	document processing pipeline for annotating chemical entities in scientific documents
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331697/ https://www.ncbi.nlm.nih.gov/pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7
work_keys_str_mv	AT camposdavid adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT matossergio adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT oliveirajosel adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT camposdavid documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT matossergio documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT oliveirajosel documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments

A document processing pipeline for annotating chemical entities in scientific documents

Ejemplares similares