Cargando…

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementa...

Descripción completa

Detalles Bibliográficos
Autores principales: Müller, Hans-Michael, Kenny, Eimear E, Sternberg, Paul W
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2004
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC517822/
https://www.ncbi.nlm.nih.gov/pubmed/15383839
http://dx.doi.org/10.1371/journal.pbio.0020309
_version_ 1782121790811668480
author Müller, Hans-Michael
Kenny, Eimear E
Sternberg, Paul W
author_facet Müller, Hans-Michael
Kenny, Eimear E
Sternberg, Paul W
author_sort Müller, Hans-Michael
collection PubMed
description We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.
format Text
id pubmed-517822
institution National Center for Biotechnology Information
language English
publishDate 2004
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-5178222004-09-21 Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature Müller, Hans-Michael Kenny, Eimear E Sternberg, Paul W PLoS Biol Research Article We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org. Public Library of Science 2004-11 2004-09-21 /pmc/articles/PMC517822/ /pubmed/15383839 http://dx.doi.org/10.1371/journal.pbio.0020309 Text en Copyright: © 2004 Müller et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Müller, Hans-Michael
Kenny, Eimear E
Sternberg, Paul W
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
title Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
title_full Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
title_fullStr Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
title_full_unstemmed Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
title_short Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
title_sort textpresso: an ontology-based information retrieval and extraction system for biological literature
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC517822/
https://www.ncbi.nlm.nih.gov/pubmed/15383839
http://dx.doi.org/10.1371/journal.pbio.0020309
work_keys_str_mv AT mullerhansmichael textpressoanontologybasedinformationretrievalandextractionsystemforbiologicalliterature
AT kennyeimeare textpressoanontologybasedinformationretrievalandextractionsystemforbiologicalliterature
AT sternbergpaulw textpressoanontologybasedinformationretrievalandextractionsystemforbiologicalliterature