Cargando…

Chemical entity extraction using CRF and an ensemble of extractors

BACKGROUND: As we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems t...

Descripción completa

Detalles Bibliográficos
Autores principales: Khabsa, Madian, Giles, C Lee
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331688/
https://www.ncbi.nlm.nih.gov/pubmed/25810769
http://dx.doi.org/10.1186/1758-2946-7-S1-S12
_version_ 1782357759024431104
author Khabsa, Madian
Giles, C Lee
author_facet Khabsa, Madian
Giles, C Lee
author_sort Khabsa, Madian
collection PubMed
description BACKGROUND: As we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems to be combined in a systematic way. The identified entities are assigned a probability score that reflects the extractors' confidence, without the need for each individual extractor to generate a probability score. We quantitively compared the performance of multiple chemical tokenizers to measure the effect of tokenization on extraction accuracy. Later, a single Conditional Random Fields (CRF) extractor that utilizes the best performing tokenizer is built using a unique collection of features such as word embeddings and Soundex codes, which, to the best of our knowledge, has not been explored in this context before, RESULTS: The ensemble of multiple extractors outperforms each extractor's individual performance during the CHEMDNER challenge. When the runs were optimized to favor recall, the ensemble approach achieved the second highest recall on unseen entities. As for the single CRF model with novel features, the extractor achieves an F1 score of 83.3% on the test set, without any post processing or abbreviation matching. CONCLUSIONS: Ensemble information extraction is effective when multiple stand alone extractors are to be used, and produces higher performance than individual off the shelf extractors. The novel features introduced in the single CRF model are sufficient to achieve very competitive F1 score using a simple standalone extractor.
format Online
Article
Text
id pubmed-4331688
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43316882015-03-25 Chemical entity extraction using CRF and an ensemble of extractors Khabsa, Madian Giles, C Lee J Cheminform Research BACKGROUND: As we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems to be combined in a systematic way. The identified entities are assigned a probability score that reflects the extractors' confidence, without the need for each individual extractor to generate a probability score. We quantitively compared the performance of multiple chemical tokenizers to measure the effect of tokenization on extraction accuracy. Later, a single Conditional Random Fields (CRF) extractor that utilizes the best performing tokenizer is built using a unique collection of features such as word embeddings and Soundex codes, which, to the best of our knowledge, has not been explored in this context before, RESULTS: The ensemble of multiple extractors outperforms each extractor's individual performance during the CHEMDNER challenge. When the runs were optimized to favor recall, the ensemble approach achieved the second highest recall on unseen entities. As for the single CRF model with novel features, the extractor achieves an F1 score of 83.3% on the test set, without any post processing or abbreviation matching. CONCLUSIONS: Ensemble information extraction is effective when multiple stand alone extractors are to be used, and produces higher performance than individual off the shelf extractors. The novel features introduced in the single CRF model are sufficient to achieve very competitive F1 score using a simple standalone extractor. BioMed Central 2015-01-19 /pmc/articles/PMC4331688/ /pubmed/25810769 http://dx.doi.org/10.1186/1758-2946-7-S1-S12 Text en Copyright © 2015 Khabsa and Giles; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Khabsa, Madian
Giles, C Lee
Chemical entity extraction using CRF and an ensemble of extractors
title Chemical entity extraction using CRF and an ensemble of extractors
title_full Chemical entity extraction using CRF and an ensemble of extractors
title_fullStr Chemical entity extraction using CRF and an ensemble of extractors
title_full_unstemmed Chemical entity extraction using CRF and an ensemble of extractors
title_short Chemical entity extraction using CRF and an ensemble of extractors
title_sort chemical entity extraction using crf and an ensemble of extractors
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331688/
https://www.ncbi.nlm.nih.gov/pubmed/25810769
http://dx.doi.org/10.1186/1758-2946-7-S1-S12
work_keys_str_mv AT khabsamadian chemicalentityextractionusingcrfandanensembleofextractors
AT gilesclee chemicalentityextractionusingcrfandanensembleofextractors