Cargando…

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

BACKGROUND: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis...

Descripción completa

Detalles Bibliográficos
Autores principales: Boyack, Kevin W., Newman, David, Duhon, Russell J., Klavans, Richard, Patek, Michael, Biberstine, Joseph R., Schijvenaars, Bob, Skupin, André, Ma, Nianli, Börner, Katy
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3060097/
https://www.ncbi.nlm.nih.gov/pubmed/21437291
http://dx.doi.org/10.1371/journal.pone.0018029
_version_ 1782200492851462144
author Boyack, Kevin W.
Newman, David
Duhon, Russell J.
Klavans, Richard
Patek, Michael
Biberstine, Joseph R.
Schijvenaars, Bob
Skupin, André
Ma, Nianli
Börner, Katy
author_facet Boyack, Kevin W.
Newman, David
Duhon, Russell J.
Klavans, Richard
Patek, Michael
Biberstine, Joseph R.
Schijvenaars, Bob
Skupin, André
Ma, Nianli
Börner, Katy
author_sort Boyack, Kevin W.
collection PubMed
description BACKGROUND: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. METHODOLOGY: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. CONCLUSIONS: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.
format Text
id pubmed-3060097
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-30600972011-03-23 Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack, Kevin W. Newman, David Duhon, Russell J. Klavans, Richard Patek, Michael Biberstine, Joseph R. Schijvenaars, Bob Skupin, André Ma, Nianli Börner, Katy PLoS One Research Article BACKGROUND: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. METHODOLOGY: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. CONCLUSIONS: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts. Public Library of Science 2011-03-17 /pmc/articles/PMC3060097/ /pubmed/21437291 http://dx.doi.org/10.1371/journal.pone.0018029 Text en Boyack et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Boyack, Kevin W.
Newman, David
Duhon, Russell J.
Klavans, Richard
Patek, Michael
Biberstine, Joseph R.
Schijvenaars, Bob
Skupin, André
Ma, Nianli
Börner, Katy
Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
title Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
title_full Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
title_fullStr Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
title_full_unstemmed Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
title_short Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
title_sort clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3060097/
https://www.ncbi.nlm.nih.gov/pubmed/21437291
http://dx.doi.org/10.1371/journal.pone.0018029
work_keys_str_mv AT boyackkevinw clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT newmandavid clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT duhonrussellj clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT klavansrichard clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT patekmichael clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT biberstinejosephr clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT schijvenaarsbob clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT skupinandre clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT manianli clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches
AT bornerkaty clusteringmorethantwomillionbiomedicalpublicationscomparingtheaccuraciesofninetextbasedsimilarityapproaches