Cargando…

Using cited references to improve the retrieval of related biomedical documents

BACKGROUND: A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are...

Descripción completa

Detalles Bibliográficos
Autores principales: Ortuño, Francisco M, Rojas, Ignacio, Andrade-Navarro, Miguel A, Fontaine, Jean-Fred
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618341/
https://www.ncbi.nlm.nih.gov/pubmed/23537461
http://dx.doi.org/10.1186/1471-2105-14-113
_version_ 1782265404843884544
author Ortuño, Francisco M
Rojas, Ignacio
Andrade-Navarro, Miguel A
Fontaine, Jean-Fred
author_facet Ortuño, Francisco M
Rojas, Ignacio
Andrade-Navarro, Miguel A
Fontaine, Jean-Fred
author_sort Ortuño, Francisco M
collection PubMed
description BACKGROUND: A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references. RESULTS: Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central® database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed® database. Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH® terms from the entire PMC-OA (p-value<0.01). Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics. CONCLUSIONS: The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability.
format Online
Article
Text
id pubmed-3618341
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36183412013-04-09 Using cited references to improve the retrieval of related biomedical documents Ortuño, Francisco M Rojas, Ignacio Andrade-Navarro, Miguel A Fontaine, Jean-Fred BMC Bioinformatics Methodology Article BACKGROUND: A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references. RESULTS: Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central® database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed® database. Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH® terms from the entire PMC-OA (p-value<0.01). Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics. CONCLUSIONS: The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability. BioMed Central 2013-03-27 /pmc/articles/PMC3618341/ /pubmed/23537461 http://dx.doi.org/10.1186/1471-2105-14-113 Text en Copyright © 2013 Ortuño et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Ortuño, Francisco M
Rojas, Ignacio
Andrade-Navarro, Miguel A
Fontaine, Jean-Fred
Using cited references to improve the retrieval of related biomedical documents
title Using cited references to improve the retrieval of related biomedical documents
title_full Using cited references to improve the retrieval of related biomedical documents
title_fullStr Using cited references to improve the retrieval of related biomedical documents
title_full_unstemmed Using cited references to improve the retrieval of related biomedical documents
title_short Using cited references to improve the retrieval of related biomedical documents
title_sort using cited references to improve the retrieval of related biomedical documents
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618341/
https://www.ncbi.nlm.nih.gov/pubmed/23537461
http://dx.doi.org/10.1186/1471-2105-14-113
work_keys_str_mv AT ortunofranciscom usingcitedreferencestoimprovetheretrievalofrelatedbiomedicaldocuments
AT rojasignacio usingcitedreferencestoimprovetheretrievalofrelatedbiomedicaldocuments
AT andradenavarromiguela usingcitedreferencestoimprovetheretrievalofrelatedbiomedicaldocuments
AT fontainejeanfred usingcitedreferencestoimprovetheretrievalofrelatedbiomedicaldocuments