Cargando…

Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection

BACKGROUND: Gene normalization (GN) is the task of identifying the unique database IDs of genes and proteins in literature. The best-known public competition of GN systems is the GN task of the BioCreative challenge, which has been held four times since 2003. The last two BioCreatives, II.5 & II...

Descripción completa

Detalles Bibliográficos
Autores principales: Tsai, Richard Tzong-Han, Lai, Po-Ting
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269942/
https://www.ncbi.nlm.nih.gov/pubmed/22151087
http://dx.doi.org/10.1186/1471-2105-12-S8-S7
_version_ 1782222524222799872
author Tsai, Richard Tzong-Han
Lai, Po-Ting
author_facet Tsai, Richard Tzong-Han
Lai, Po-Ting
author_sort Tsai, Richard Tzong-Han
collection PubMed
description BACKGROUND: Gene normalization (GN) is the task of identifying the unique database IDs of genes and proteins in literature. The best-known public competition of GN systems is the GN task of the BioCreative challenge, which has been held four times since 2003. The last two BioCreatives, II.5 & III, had two significant differences from earlier tasks: firstly, they provided full-length articles in addition to abstracts; and secondly, they included multiple species without providing species ID information. Full papers introduce more complex targets for GN processing, while the inclusion of multiple species vastly increases the potential size of dictionaries needed for GN. BioCreative III GN uses Threshold Average Precision at a median of k errors per query (TAP-k), a new measure closely related to the well-known average precision, but also reflecting the reliability of the score provided by each GN system. RESULTS: To use full-paper text, we employed a multi-stage GN algorithm and a ranking method which exploit information in different sections and parts of a paper. To handle the inclusion of multiple unknown species, we developed two context-based dynamic strategies to select dictionary entries related to the species that appear in the paper—section-wide and article-wide context. Our originally submitted BioCreative III system uses a static dictionary containing only the most common species entries. It already exceeds the BioCreative III average team performance by at least 24% in every evaluation. However, using our proposed dynamic dictionary strategies, we were able to further improve TAP-5, TAP-10, and TAP-20 by 16.47%, 13.57% and 6.01%, respectively in the Gold 50 test set. Our best dynamic strategy outperforms the best BioCreative III systems in TAP-10 on the Silver 50 test set and in TAP-5 on the Silver 507 set. CONCLUSIONS: Our experimental results demonstrate the superiority of our proposed dynamic dictionary selection strategies over our original static strategy and most BioCreative III participant systems. Section-wide dynamic strategy is preferred because it achieves very similar TAP-k scores to article-wide dynamic strategy but it is more efficient.
format Online
Article
Text
id pubmed-3269942
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32699422012-02-02 Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection Tsai, Richard Tzong-Han Lai, Po-Ting BMC Bioinformatics Research BACKGROUND: Gene normalization (GN) is the task of identifying the unique database IDs of genes and proteins in literature. The best-known public competition of GN systems is the GN task of the BioCreative challenge, which has been held four times since 2003. The last two BioCreatives, II.5 & III, had two significant differences from earlier tasks: firstly, they provided full-length articles in addition to abstracts; and secondly, they included multiple species without providing species ID information. Full papers introduce more complex targets for GN processing, while the inclusion of multiple species vastly increases the potential size of dictionaries needed for GN. BioCreative III GN uses Threshold Average Precision at a median of k errors per query (TAP-k), a new measure closely related to the well-known average precision, but also reflecting the reliability of the score provided by each GN system. RESULTS: To use full-paper text, we employed a multi-stage GN algorithm and a ranking method which exploit information in different sections and parts of a paper. To handle the inclusion of multiple unknown species, we developed two context-based dynamic strategies to select dictionary entries related to the species that appear in the paper—section-wide and article-wide context. Our originally submitted BioCreative III system uses a static dictionary containing only the most common species entries. It already exceeds the BioCreative III average team performance by at least 24% in every evaluation. However, using our proposed dynamic dictionary strategies, we were able to further improve TAP-5, TAP-10, and TAP-20 by 16.47%, 13.57% and 6.01%, respectively in the Gold 50 test set. Our best dynamic strategy outperforms the best BioCreative III systems in TAP-10 on the Silver 50 test set and in TAP-5 on the Silver 507 set. CONCLUSIONS: Our experimental results demonstrate the superiority of our proposed dynamic dictionary selection strategies over our original static strategy and most BioCreative III participant systems. Section-wide dynamic strategy is preferred because it achieves very similar TAP-k scores to article-wide dynamic strategy but it is more efficient. BioMed Central 2011-10-03 /pmc/articles/PMC3269942/ /pubmed/22151087 http://dx.doi.org/10.1186/1471-2105-12-S8-S7 Text en Copyright ©2011 Tsai and Lai; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Tsai, Richard Tzong-Han
Lai, Po-Ting
Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
title Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
title_full Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
title_fullStr Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
title_full_unstemmed Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
title_short Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
title_sort multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269942/
https://www.ncbi.nlm.nih.gov/pubmed/22151087
http://dx.doi.org/10.1186/1471-2105-12-S8-S7
work_keys_str_mv AT tsairichardtzonghan multistagegenenormalizationforfulltextarticleswithcontextbasedspeciesfilteringfordynamicdictionaryentryselection
AT laipoting multistagegenenormalizationforfulltextarticleswithcontextbasedspeciesfilteringfordynamicdictionaryentryselection