Cargando…

An open-source framework for large-scale, flexible evaluation of biomedical text mining systems

BACKGROUND: Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The...

Descripción completa

Detalles Bibliográficos
Autores principales: Baumgartner, William A, Cohen, K Bretonnel, Hunter, Lawrence
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2276192/
https://www.ncbi.nlm.nih.gov/pubmed/18230184
http://dx.doi.org/10.1186/1747-5333-3-1
_version_ 1782151975047004160
author Baumgartner, William A
Cohen, K Bretonnel
Hunter, Lawrence
author_facet Baumgartner, William A
Cohen, K Bretonnel
Hunter, Lawrence
author_sort Baumgartner, William A
collection PubMed
description BACKGROUND: Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain. RESULTS: Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision. CONCLUSION: The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net.
format Text
id pubmed-2276192
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22761922008-03-28 An open-source framework for large-scale, flexible evaluation of biomedical text mining systems Baumgartner, William A Cohen, K Bretonnel Hunter, Lawrence J Biomed Discov Collab Software BACKGROUND: Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain. RESULTS: Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision. CONCLUSION: The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net. BioMed Central 2008-01-29 /pmc/articles/PMC2276192/ /pubmed/18230184 http://dx.doi.org/10.1186/1747-5333-3-1 Text en Copyright © 2008 Baumgartner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Baumgartner, William A
Cohen, K Bretonnel
Hunter, Lawrence
An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
title An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
title_full An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
title_fullStr An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
title_full_unstemmed An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
title_short An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
title_sort open-source framework for large-scale, flexible evaluation of biomedical text mining systems
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2276192/
https://www.ncbi.nlm.nih.gov/pubmed/18230184
http://dx.doi.org/10.1186/1747-5333-3-1
work_keys_str_mv AT baumgartnerwilliama anopensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems
AT cohenkbretonnel anopensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems
AT hunterlawrence anopensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems
AT baumgartnerwilliama opensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems
AT cohenkbretonnel opensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems
AT hunterlawrence opensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems