Cargando…

A LDA-based approach to promoting ranking diversity for genomics information retrieval

BACKGROUND: In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information retrieval techniques to access the scientific literature i...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Yan, Yin, Xiaoshi, Li , Zhoujun, Hu, Xiaohua, Huang, Jimmy Xiangji
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394425/
https://www.ncbi.nlm.nih.gov/pubmed/22759611
http://dx.doi.org/10.1186/1471-2164-13-S3-S2
_version_ 1782237867385290752
author Chen, Yan
Yin, Xiaoshi
Li , Zhoujun
Hu, Xiaohua
Huang, Jimmy Xiangji
author_facet Chen, Yan
Yin, Xiaoshi
Li , Zhoujun
Hu, Xiaohua
Huang, Jimmy Xiangji
author_sort Chen, Yan
collection PubMed
description BACKGROUND: In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information retrieval techniques to access the scientific literature in genomics and related biomedical disciplines. In many cases, the desired information of a query asked by biologists is a list of a certain type of entities covering different aspects that are related to the question, such as cells, genes, diseases, proteins, mutations, etc. Hence, it is important of a biomedical IR system to be able to provide relevant and diverse answers to fulfill biologists' information needs. However traditional IR model only concerns with the relevance between retrieved documents and user query, but does not take redundancy between retrieved documents into account. This will lead to high redundancy and low diversity in the retrieval ranked lists. RESULTS: In this paper, we propose an approach which employs a topic generative model called Latent Dirichlet Allocation (LDA) to promoting ranking diversity for biomedical information retrieval. Different from other approaches or models which consider aspects on word level, our approach assumes that aspects should be identified by the topics of retrieved documents. We present LDA model to discover topic distribution of retrieval passages and word distribution of each topic dimension, and then re-rank retrieval results with topic distribution similarity between passages based on N-size slide window. We perform our approach on TREC 2007 Genomics collection and two distinctive IR baseline runs, which can achieve 8% improvement over the highest Aspect MAP reported in TREC 2007 Genomics track. CONCLUSIONS: The proposed method is the first study of adopting topic model to genomics information retrieval, and demonstrates its effectiveness in promoting ranking diversity as well as in improving relevance of ranked lists of genomics search. Moreover, we proposes a distance measure to quantify how much a passage can increase topical diversity by considering both topical importance and topical coefficient by LDA, and the distance measure is a modified Euclidean distance.
format Online
Article
Text
id pubmed-3394425
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33944252012-07-16 A LDA-based approach to promoting ranking diversity for genomics information retrieval Chen, Yan Yin, Xiaoshi Li , Zhoujun Hu, Xiaohua Huang, Jimmy Xiangji BMC Genomics Proceedings BACKGROUND: In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information retrieval techniques to access the scientific literature in genomics and related biomedical disciplines. In many cases, the desired information of a query asked by biologists is a list of a certain type of entities covering different aspects that are related to the question, such as cells, genes, diseases, proteins, mutations, etc. Hence, it is important of a biomedical IR system to be able to provide relevant and diverse answers to fulfill biologists' information needs. However traditional IR model only concerns with the relevance between retrieved documents and user query, but does not take redundancy between retrieved documents into account. This will lead to high redundancy and low diversity in the retrieval ranked lists. RESULTS: In this paper, we propose an approach which employs a topic generative model called Latent Dirichlet Allocation (LDA) to promoting ranking diversity for biomedical information retrieval. Different from other approaches or models which consider aspects on word level, our approach assumes that aspects should be identified by the topics of retrieved documents. We present LDA model to discover topic distribution of retrieval passages and word distribution of each topic dimension, and then re-rank retrieval results with topic distribution similarity between passages based on N-size slide window. We perform our approach on TREC 2007 Genomics collection and two distinctive IR baseline runs, which can achieve 8% improvement over the highest Aspect MAP reported in TREC 2007 Genomics track. CONCLUSIONS: The proposed method is the first study of adopting topic model to genomics information retrieval, and demonstrates its effectiveness in promoting ranking diversity as well as in improving relevance of ranked lists of genomics search. Moreover, we proposes a distance measure to quantify how much a passage can increase topical diversity by considering both topical importance and topical coefficient by LDA, and the distance measure is a modified Euclidean distance. BioMed Central 2012-06-11 /pmc/articles/PMC3394425/ /pubmed/22759611 http://dx.doi.org/10.1186/1471-2164-13-S3-S2 Text en Copyright © 2012 Chen et al. licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Chen, Yan
Yin, Xiaoshi
Li , Zhoujun
Hu, Xiaohua
Huang, Jimmy Xiangji
A LDA-based approach to promoting ranking diversity for genomics information retrieval
title A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_full A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_fullStr A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_full_unstemmed A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_short A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_sort lda-based approach to promoting ranking diversity for genomics information retrieval
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394425/
https://www.ncbi.nlm.nih.gov/pubmed/22759611
http://dx.doi.org/10.1186/1471-2164-13-S3-S2
work_keys_str_mv AT chenyan aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT yinxiaoshi aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT lizhoujun aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT huxiaohua aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT huangjimmyxiangji aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT chenyan ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT yinxiaoshi ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT lizhoujun ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT huxiaohua ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
AT huangjimmyxiangji ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval