Cargando…
A document processing pipeline for annotating chemical entities in scientific documents
BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concep...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331697/ https://www.ncbi.nlm.nih.gov/pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7 |
_version_ | 1782357761090125824 |
---|---|
author | Campos, David Matos, Sérgio Oliveira, José L |
author_facet | Campos, David Matos, Sérgio Oliveira, José L |
author_sort | Campos, David |
collection | PubMed |
description | BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. RESULTS: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. CONCLUSIONS: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/. |
format | Online Article Text |
id | pubmed-4331697 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-43316972015-03-25 A document processing pipeline for annotating chemical entities in scientific documents Campos, David Matos, Sérgio Oliveira, José L J Cheminform Research BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. RESULTS: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. CONCLUSIONS: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/. BioMed Central 2015-01-19 /pmc/articles/PMC4331697/ /pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7 Text en Copyright © 2015 Campos et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Campos, David Matos, Sérgio Oliveira, José L A document processing pipeline for annotating chemical entities in scientific documents |
title | A document processing pipeline for annotating chemical entities in scientific documents |
title_full | A document processing pipeline for annotating chemical entities in scientific documents |
title_fullStr | A document processing pipeline for annotating chemical entities in scientific documents |
title_full_unstemmed | A document processing pipeline for annotating chemical entities in scientific documents |
title_short | A document processing pipeline for annotating chemical entities in scientific documents |
title_sort | document processing pipeline for annotating chemical entities in scientific documents |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331697/ https://www.ncbi.nlm.nih.gov/pubmed/25810778 http://dx.doi.org/10.1186/1758-2946-7-S1-S7 |
work_keys_str_mv | AT camposdavid adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT matossergio adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT oliveirajosel adocumentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT camposdavid documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT matossergio documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments AT oliveirajosel documentprocessingpipelineforannotatingchemicalentitiesinscientificdocuments |