Cargando…

The BioLexicon: a large-scale terminological resource for biomedical text mining

BACKGROUND: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow...

Descripción completa

Detalles Bibliográficos
Autores principales: Thompson, Paul, McNaught, John, Montemagni, Simonetta, Calzolari, Nicoletta, del Gratta, Riccardo, Lee, Vivian, Marchi, Simone, Monachini, Monica, Pezik, Piotr, Quochi, Valeria, Rupp, CJ, Sasaki, Yutaka, Venturi, Giulia, Rebholz-Schuhmann, Dietrich, Ananiadou, Sophia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228855/
https://www.ncbi.nlm.nih.gov/pubmed/21992002
http://dx.doi.org/10.1186/1471-2105-12-397
_version_ 1782217883419410432
author Thompson, Paul
McNaught, John
Montemagni, Simonetta
Calzolari, Nicoletta
del Gratta, Riccardo
Lee, Vivian
Marchi, Simone
Monachini, Monica
Pezik, Piotr
Quochi, Valeria
Rupp, CJ
Sasaki, Yutaka
Venturi, Giulia
Rebholz-Schuhmann, Dietrich
Ananiadou, Sophia
author_facet Thompson, Paul
McNaught, John
Montemagni, Simonetta
Calzolari, Nicoletta
del Gratta, Riccardo
Lee, Vivian
Marchi, Simone
Monachini, Monica
Pezik, Piotr
Quochi, Valeria
Rupp, CJ
Sasaki, Yutaka
Venturi, Giulia
Rebholz-Schuhmann, Dietrich
Ananiadou, Sophia
author_sort Thompson, Paul
collection PubMed
description BACKGROUND: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. RESULTS: This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. CONCLUSIONS: The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.
format Online
Article
Text
id pubmed-3228855
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32288552011-12-12 The BioLexicon: a large-scale terminological resource for biomedical text mining Thompson, Paul McNaught, John Montemagni, Simonetta Calzolari, Nicoletta del Gratta, Riccardo Lee, Vivian Marchi, Simone Monachini, Monica Pezik, Piotr Quochi, Valeria Rupp, CJ Sasaki, Yutaka Venturi, Giulia Rebholz-Schuhmann, Dietrich Ananiadou, Sophia BMC Bioinformatics Research Article BACKGROUND: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. RESULTS: This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. CONCLUSIONS: The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring. BioMed Central 2011-10-12 /pmc/articles/PMC3228855/ /pubmed/21992002 http://dx.doi.org/10.1186/1471-2105-12-397 Text en Copyright ©2011 Thompson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Thompson, Paul
McNaught, John
Montemagni, Simonetta
Calzolari, Nicoletta
del Gratta, Riccardo
Lee, Vivian
Marchi, Simone
Monachini, Monica
Pezik, Piotr
Quochi, Valeria
Rupp, CJ
Sasaki, Yutaka
Venturi, Giulia
Rebholz-Schuhmann, Dietrich
Ananiadou, Sophia
The BioLexicon: a large-scale terminological resource for biomedical text mining
title The BioLexicon: a large-scale terminological resource for biomedical text mining
title_full The BioLexicon: a large-scale terminological resource for biomedical text mining
title_fullStr The BioLexicon: a large-scale terminological resource for biomedical text mining
title_full_unstemmed The BioLexicon: a large-scale terminological resource for biomedical text mining
title_short The BioLexicon: a large-scale terminological resource for biomedical text mining
title_sort biolexicon: a large-scale terminological resource for biomedical text mining
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228855/
https://www.ncbi.nlm.nih.gov/pubmed/21992002
http://dx.doi.org/10.1186/1471-2105-12-397
work_keys_str_mv AT thompsonpaul thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT mcnaughtjohn thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT montemagnisimonetta thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT calzolarinicoletta thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT delgrattariccardo thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT leevivian thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT marchisimone thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT monachinimonica thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT pezikpiotr thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT quochivaleria thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT ruppcj thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT sasakiyutaka thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT venturigiulia thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT rebholzschuhmanndietrich thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT ananiadousophia thebiolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT thompsonpaul biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT mcnaughtjohn biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT montemagnisimonetta biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT calzolarinicoletta biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT delgrattariccardo biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT leevivian biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT marchisimone biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT monachinimonica biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT pezikpiotr biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT quochivaleria biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT ruppcj biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT sasakiyutaka biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT venturigiulia biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT rebholzschuhmanndietrich biolexiconalargescaleterminologicalresourceforbiomedicaltextmining
AT ananiadousophia biolexiconalargescaleterminologicalresourceforbiomedicaltextmining