Cargando…

The BioPrompt-box: an ontology-based clustering tool for searching in biological databases

BACKGROUND: High-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and r...

Descripción completa

Detalles Bibliográficos
Autores principales: Corsi, Claudio, Ferragina, Paolo, Marangoni, Roberto
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1885860/
https://www.ncbi.nlm.nih.gov/pubmed/17430575
http://dx.doi.org/10.1186/1471-2105-8-S1-S8
_version_ 1782133658244612096
author Corsi, Claudio
Ferragina, Paolo
Marangoni, Roberto
author_facet Corsi, Claudio
Ferragina, Paolo
Marangoni, Roberto
author_sort Corsi, Claudio
collection PubMed
description BACKGROUND: High-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and retrieval are required, but also at query time, since a user query may produce such a large set of results that their browsing and "understanding" becomes humanly impractical. This problem is well known to the Web community, where a new generation of Web search engines is being developed, like Vivisimo. These tools organize on-the-fly the results of a user query in a hierarchy of labeled folders that ease their browsing and knowledge extraction. We investigate this approach on biological data, and propose the so called The BioPrompt-boxsoftware system which deploys ontology-driven clustering strategies for making the searching process of biologists more efficient and effective. RESULTS: The BioPrompt-box (Bpb) defines a document as a biological sequence plus its associated meta-data taken from the underneath databank – like references to ontologies or to external databanks, and plain texts as comments of researchers and (title, abstracts or even body of) papers. Bpboffers several tools to customize the search and the clustering process over its indexed documents. The user can search a set of keywords within a specific field of the document schema, or can execute Blastto find documents relative to homologue sequences. In both cases the search task returns a set of documents (hits) which constitute the answer to the user query. Since the number of hits may be large, Bpbclusters them into groups of homogenous content, organized as a hierarchy of labeled clusters. The user can actually choose among several ontology-based hierarchical clustering strategies, each offering a different "view" of the returned hits. Bpbcomputes these views by exploiting the meta-data present within the retrieved documents such as the references to Gene Ontology, the taxonomy lineage, the organism and the keywords. Of course, the approach is flexible enough to leave room for future additions of other meta-information. The ultimate goal of the clustering process is to provide the user with several different readings of the (maybe numerous) query results and show possible hidden correlations among them, thus improving their browsing and understanding. CONCLUSION: Bpb is a powerful search engine that makes it very easy to perform complex queries over the indexed databanks (currently only UNIPROT is considered). The ontology-based clustering approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL.
format Text
id pubmed-1885860
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18858602007-06-05 The BioPrompt-box: an ontology-based clustering tool for searching in biological databases Corsi, Claudio Ferragina, Paolo Marangoni, Roberto BMC Bioinformatics Research BACKGROUND: High-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and retrieval are required, but also at query time, since a user query may produce such a large set of results that their browsing and "understanding" becomes humanly impractical. This problem is well known to the Web community, where a new generation of Web search engines is being developed, like Vivisimo. These tools organize on-the-fly the results of a user query in a hierarchy of labeled folders that ease their browsing and knowledge extraction. We investigate this approach on biological data, and propose the so called The BioPrompt-boxsoftware system which deploys ontology-driven clustering strategies for making the searching process of biologists more efficient and effective. RESULTS: The BioPrompt-box (Bpb) defines a document as a biological sequence plus its associated meta-data taken from the underneath databank – like references to ontologies or to external databanks, and plain texts as comments of researchers and (title, abstracts or even body of) papers. Bpboffers several tools to customize the search and the clustering process over its indexed documents. The user can search a set of keywords within a specific field of the document schema, or can execute Blastto find documents relative to homologue sequences. In both cases the search task returns a set of documents (hits) which constitute the answer to the user query. Since the number of hits may be large, Bpbclusters them into groups of homogenous content, organized as a hierarchy of labeled clusters. The user can actually choose among several ontology-based hierarchical clustering strategies, each offering a different "view" of the returned hits. Bpbcomputes these views by exploiting the meta-data present within the retrieved documents such as the references to Gene Ontology, the taxonomy lineage, the organism and the keywords. Of course, the approach is flexible enough to leave room for future additions of other meta-information. The ultimate goal of the clustering process is to provide the user with several different readings of the (maybe numerous) query results and show possible hidden correlations among them, thus improving their browsing and understanding. CONCLUSION: Bpb is a powerful search engine that makes it very easy to perform complex queries over the indexed databanks (currently only UNIPROT is considered). The ontology-based clustering approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL. BioMed Central 2007-03-08 /pmc/articles/PMC1885860/ /pubmed/17430575 http://dx.doi.org/10.1186/1471-2105-8-S1-S8 Text en Copyright © 2007 Corsi et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Corsi, Claudio
Ferragina, Paolo
Marangoni, Roberto
The BioPrompt-box: an ontology-based clustering tool for searching in biological databases
title The BioPrompt-box: an ontology-based clustering tool for searching in biological databases
title_full The BioPrompt-box: an ontology-based clustering tool for searching in biological databases
title_fullStr The BioPrompt-box: an ontology-based clustering tool for searching in biological databases
title_full_unstemmed The BioPrompt-box: an ontology-based clustering tool for searching in biological databases
title_short The BioPrompt-box: an ontology-based clustering tool for searching in biological databases
title_sort bioprompt-box: an ontology-based clustering tool for searching in biological databases
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1885860/
https://www.ncbi.nlm.nih.gov/pubmed/17430575
http://dx.doi.org/10.1186/1471-2105-8-S1-S8
work_keys_str_mv AT corsiclaudio thebiopromptboxanontologybasedclusteringtoolforsearchinginbiologicaldatabases
AT ferraginapaolo thebiopromptboxanontologybasedclusteringtoolforsearchinginbiologicaldatabases
AT marangoniroberto thebiopromptboxanontologybasedclusteringtoolforsearchinginbiologicaldatabases
AT corsiclaudio biopromptboxanontologybasedclusteringtoolforsearchinginbiologicaldatabases
AT ferraginapaolo biopromptboxanontologybasedclusteringtoolforsearchinginbiologicaldatabases
AT marangoniroberto biopromptboxanontologybasedclusteringtoolforsearchinginbiologicaldatabases