Cargando…

Discovering themes in biomedical literature using a projection-based algorithm

BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modelin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yeganova, Lana, Kim, Sun, Balasanov, Grigory, Wilbur, W. John
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6048865/ https://www.ncbi.nlm.nih.gov/pubmed/30012087 http://dx.doi.org/10.1186/s12859-018-2240-0

_version_	1783340178712035328
author	Yeganova, Lana Kim, Sun Balasanov, Grigory Wilbur, W. John
author_facet	Yeganova, Lana Kim, Sun Balasanov, Grigory Wilbur, W. John
author_sort	Yeganova, Lana
collection	PubMed
description	BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously. RESULTS: We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed (Ⓡ) documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method. CONCLUSIONS: This study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2240-0) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6048865
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-60488652018-07-19 Discovering themes in biomedical literature using a projection-based algorithm Yeganova, Lana Kim, Sun Balasanov, Grigory Wilbur, W. John BMC Bioinformatics Research Article BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously. RESULTS: We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed (Ⓡ) documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method. CONCLUSIONS: This study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2240-0) contains supplementary material, which is available to authorized users. BioMed Central 2018-07-16 /pmc/articles/PMC6048865/ /pubmed/30012087 http://dx.doi.org/10.1186/s12859-018-2240-0 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Yeganova, Lana Kim, Sun Balasanov, Grigory Wilbur, W. John Discovering themes in biomedical literature using a projection-based algorithm
title	Discovering themes in biomedical literature using a projection-based algorithm
title_full	Discovering themes in biomedical literature using a projection-based algorithm
title_fullStr	Discovering themes in biomedical literature using a projection-based algorithm
title_full_unstemmed	Discovering themes in biomedical literature using a projection-based algorithm
title_short	Discovering themes in biomedical literature using a projection-based algorithm
title_sort	discovering themes in biomedical literature using a projection-based algorithm
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6048865/ https://www.ncbi.nlm.nih.gov/pubmed/30012087 http://dx.doi.org/10.1186/s12859-018-2240-0
work_keys_str_mv	AT yeganovalana discoveringthemesinbiomedicalliteratureusingaprojectionbasedalgorithm AT kimsun discoveringthemesinbiomedicalliteratureusingaprojectionbasedalgorithm AT balasanovgrigory discoveringthemesinbiomedicalliteratureusingaprojectionbasedalgorithm AT wilburwjohn discoveringthemesinbiomedicalliteratureusingaprojectionbasedalgorithm

Discovering themes in biomedical literature using a projection-based algorithm

Ejemplares similares