Cargando…

Retrieval with gene queries

BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents...

Descripción completa

Detalles Bibliográficos
Autores principales: Sehgal, Aditya K, Srinivasan, Padmini
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1482725/
https://www.ncbi.nlm.nih.gov/pubmed/16630348
http://dx.doi.org/10.1186/1471-2105-7-220
_version_ 1782128297465872384
author Sehgal, Aditya K
Srinivasan, Padmini
author_facet Sehgal, Aditya K
Srinivasan, Padmini
author_sort Sehgal, Aditya K
collection PubMed
description BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. RESULTS: Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. CONCLUSION: We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents.
format Text
id pubmed-1482725
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14827252006-06-24 Retrieval with gene queries Sehgal, Aditya K Srinivasan, Padmini BMC Bioinformatics Research Article BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. RESULTS: Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. CONCLUSION: We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents. BioMed Central 2006-04-21 /pmc/articles/PMC1482725/ /pubmed/16630348 http://dx.doi.org/10.1186/1471-2105-7-220 Text en Copyright © 2006 Sehgal and Srinivasan; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Sehgal, Aditya K
Srinivasan, Padmini
Retrieval with gene queries
title Retrieval with gene queries
title_full Retrieval with gene queries
title_fullStr Retrieval with gene queries
title_full_unstemmed Retrieval with gene queries
title_short Retrieval with gene queries
title_sort retrieval with gene queries
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1482725/
https://www.ncbi.nlm.nih.gov/pubmed/16630348
http://dx.doi.org/10.1186/1471-2105-7-220
work_keys_str_mv AT sehgaladityak retrievalwithgenequeries
AT srinivasanpadmini retrievalwithgenequeries