Cargando…

Disambiguating the species of biomedical named entities using natural language parsers

Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive s...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Xinglong, Tsujii, Jun'ichi, Ananiadou, Sophia
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/
https://www.ncbi.nlm.nih.gov/pubmed/20053840
http://dx.doi.org/10.1093/bioinformatics/btq002
_version_ 1782177992879898624
author Wang, Xinglong
Tsujii, Jun'ichi
Ananiadou, Sophia
author_facet Wang, Xinglong
Tsujii, Jun'ichi
Ananiadou, Sophia
author_sort Wang, Xinglong
collection PubMed
description Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification. Availability: The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/ Contact: xinglong.wang@manchester.ac.uk
format Text
id pubmed-2828111
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-28281112010-02-25 Disambiguating the species of biomedical named entities using natural language parsers Wang, Xinglong Tsujii, Jun'ichi Ananiadou, Sophia Bioinformatics Original Papers Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification. Availability: The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/ Contact: xinglong.wang@manchester.ac.uk Oxford University Press 2010-03-01 2010-01-06 /pmc/articles/PMC2828111/ /pubmed/20053840 http://dx.doi.org/10.1093/bioinformatics/btq002 Text en © The Author(s) 2010. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Wang, Xinglong
Tsujii, Jun'ichi
Ananiadou, Sophia
Disambiguating the species of biomedical named entities using natural language parsers
title Disambiguating the species of biomedical named entities using natural language parsers
title_full Disambiguating the species of biomedical named entities using natural language parsers
title_fullStr Disambiguating the species of biomedical named entities using natural language parsers
title_full_unstemmed Disambiguating the species of biomedical named entities using natural language parsers
title_short Disambiguating the species of biomedical named entities using natural language parsers
title_sort disambiguating the species of biomedical named entities using natural language parsers
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/
https://www.ncbi.nlm.nih.gov/pubmed/20053840
http://dx.doi.org/10.1093/bioinformatics/btq002
work_keys_str_mv AT wangxinglong disambiguatingthespeciesofbiomedicalnamedentitiesusingnaturallanguageparsers
AT tsujiijunichi disambiguatingthespeciesofbiomedicalnamedentitiesusingnaturallanguageparsers
AT ananiadousophia disambiguatingthespeciesofbiomedicalnamedentitiesusingnaturallanguageparsers