Cargando…
An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
BACKGROUND: Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2276192/ https://www.ncbi.nlm.nih.gov/pubmed/18230184 http://dx.doi.org/10.1186/1747-5333-3-1 |
_version_ | 1782151975047004160 |
---|---|
author | Baumgartner, William A Cohen, K Bretonnel Hunter, Lawrence |
author_facet | Baumgartner, William A Cohen, K Bretonnel Hunter, Lawrence |
author_sort | Baumgartner, William A |
collection | PubMed |
description | BACKGROUND: Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain. RESULTS: Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision. CONCLUSION: The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net. |
format | Text |
id | pubmed-2276192 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-22761922008-03-28 An open-source framework for large-scale, flexible evaluation of biomedical text mining systems Baumgartner, William A Cohen, K Bretonnel Hunter, Lawrence J Biomed Discov Collab Software BACKGROUND: Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain. RESULTS: Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision. CONCLUSION: The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net. BioMed Central 2008-01-29 /pmc/articles/PMC2276192/ /pubmed/18230184 http://dx.doi.org/10.1186/1747-5333-3-1 Text en Copyright © 2008 Baumgartner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software Baumgartner, William A Cohen, K Bretonnel Hunter, Lawrence An open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
title | An open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
title_full | An open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
title_fullStr | An open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
title_full_unstemmed | An open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
title_short | An open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
title_sort | open-source framework for large-scale, flexible evaluation of biomedical text mining systems |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2276192/ https://www.ncbi.nlm.nih.gov/pubmed/18230184 http://dx.doi.org/10.1186/1747-5333-3-1 |
work_keys_str_mv | AT baumgartnerwilliama anopensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems AT cohenkbretonnel anopensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems AT hunterlawrence anopensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems AT baumgartnerwilliama opensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems AT cohenkbretonnel opensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems AT hunterlawrence opensourceframeworkforlargescaleflexibleevaluationofbiomedicaltextminingsystems |