Cargando…

Benchmarking natural-language parsers for biological applications using dependency graphs

BACKGROUND: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy usi...

Descripción completa

Detalles Bibliográficos
Autores principales: Clegg, Andrew B, Shepherd, Adrian J
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1797812/
https://www.ncbi.nlm.nih.gov/pubmed/17254351
http://dx.doi.org/10.1186/1471-2105-8-24
_version_ 1782132316311650304
author Clegg, Andrew B
Shepherd, Adrian J
author_facet Clegg, Andrew B
Shepherd, Adrian J
author_sort Clegg, Andrew B
collection PubMed
description BACKGROUND: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria. RESULTS: Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations. CONCLUSION: Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques.
format Text
id pubmed-1797812
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-17978122007-02-16 Benchmarking natural-language parsers for biological applications using dependency graphs Clegg, Andrew B Shepherd, Adrian J BMC Bioinformatics Research Article BACKGROUND: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria. RESULTS: Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations. CONCLUSION: Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques. BioMed Central 2007-01-25 /pmc/articles/PMC1797812/ /pubmed/17254351 http://dx.doi.org/10.1186/1471-2105-8-24 Text en Copyright © 2007 Clegg and Shepherd; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Clegg, Andrew B
Shepherd, Adrian J
Benchmarking natural-language parsers for biological applications using dependency graphs
title Benchmarking natural-language parsers for biological applications using dependency graphs
title_full Benchmarking natural-language parsers for biological applications using dependency graphs
title_fullStr Benchmarking natural-language parsers for biological applications using dependency graphs
title_full_unstemmed Benchmarking natural-language parsers for biological applications using dependency graphs
title_short Benchmarking natural-language parsers for biological applications using dependency graphs
title_sort benchmarking natural-language parsers for biological applications using dependency graphs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1797812/
https://www.ncbi.nlm.nih.gov/pubmed/17254351
http://dx.doi.org/10.1186/1471-2105-8-24
work_keys_str_mv AT cleggandrewb benchmarkingnaturallanguageparsersforbiologicalapplicationsusingdependencygraphs
AT shepherdadrianj benchmarkingnaturallanguageparsersforbiologicalapplicationsusingdependencygraphs