Cargando…

Using citation networks to evaluate the impact of text length on keyword extraction

The identification of key concepts within unstructured data is of paramount importance in practical applications. Despite the abundance of proposed methods for extracting primary topics, only a few works investigated the influence of text length on the performance of keyword extraction (KE) methods....

Descripción completa

Detalles Bibliográficos
Autores principales: Tohalino, Jorge A. V., Silva, Thiago C., Amancio, Diego R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10681196/
https://www.ncbi.nlm.nih.gov/pubmed/38011182
http://dx.doi.org/10.1371/journal.pone.0294500
_version_ 1785150767141748736
author Tohalino, Jorge A. V.
Silva, Thiago C.
Amancio, Diego R.
author_facet Tohalino, Jorge A. V.
Silva, Thiago C.
Amancio, Diego R.
author_sort Tohalino, Jorge A. V.
collection PubMed
description The identification of key concepts within unstructured data is of paramount importance in practical applications. Despite the abundance of proposed methods for extracting primary topics, only a few works investigated the influence of text length on the performance of keyword extraction (KE) methods. Specifically, many studies lean on abstracts and titles for content extraction from papers, leaving it uncertain whether leveraging the complete content of papers can yield consistent results. Hence, in this study, we employ a network-based approach to evaluate the concordance between keywords extracted from abstracts and those from the entire papers. Community detection methods are utilized to identify interconnected papers in citation networks. Subsequently, paper clusters are formed to identify salient terms within each cluster, employing a methodology akin to the term frequency-inverse document frequency (tf-idf) approach. Once each cluster has been endowed with its distinctive set of key terms, these selected terms are employed to serve as representative keywords at the paper level. The top-ranked words at the cluster level, which also appear in the abstract, are chosen as keywords for the paper. Our findings indicate that although various community detection methods used in KE yield similar levels of accuracy. Notably, text clustering approaches outperform all citation-based methods, while all approaches yield relatively low accuracy values. We also identified a lack of concordance between keywords extracted from the abstracts and those extracted from the corresponding full-text source. Considering that citations and text clustering yield distinct outcomes, combining them in hybrid approaches could offer improved performance.
format Online
Article
Text
id pubmed-10681196
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-106811962023-11-27 Using citation networks to evaluate the impact of text length on keyword extraction Tohalino, Jorge A. V. Silva, Thiago C. Amancio, Diego R. PLoS One Research Article The identification of key concepts within unstructured data is of paramount importance in practical applications. Despite the abundance of proposed methods for extracting primary topics, only a few works investigated the influence of text length on the performance of keyword extraction (KE) methods. Specifically, many studies lean on abstracts and titles for content extraction from papers, leaving it uncertain whether leveraging the complete content of papers can yield consistent results. Hence, in this study, we employ a network-based approach to evaluate the concordance between keywords extracted from abstracts and those from the entire papers. Community detection methods are utilized to identify interconnected papers in citation networks. Subsequently, paper clusters are formed to identify salient terms within each cluster, employing a methodology akin to the term frequency-inverse document frequency (tf-idf) approach. Once each cluster has been endowed with its distinctive set of key terms, these selected terms are employed to serve as representative keywords at the paper level. The top-ranked words at the cluster level, which also appear in the abstract, are chosen as keywords for the paper. Our findings indicate that although various community detection methods used in KE yield similar levels of accuracy. Notably, text clustering approaches outperform all citation-based methods, while all approaches yield relatively low accuracy values. We also identified a lack of concordance between keywords extracted from the abstracts and those extracted from the corresponding full-text source. Considering that citations and text clustering yield distinct outcomes, combining them in hybrid approaches could offer improved performance. Public Library of Science 2023-11-27 /pmc/articles/PMC10681196/ /pubmed/38011182 http://dx.doi.org/10.1371/journal.pone.0294500 Text en © 2023 Tohalino et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tohalino, Jorge A. V.
Silva, Thiago C.
Amancio, Diego R.
Using citation networks to evaluate the impact of text length on keyword extraction
title Using citation networks to evaluate the impact of text length on keyword extraction
title_full Using citation networks to evaluate the impact of text length on keyword extraction
title_fullStr Using citation networks to evaluate the impact of text length on keyword extraction
title_full_unstemmed Using citation networks to evaluate the impact of text length on keyword extraction
title_short Using citation networks to evaluate the impact of text length on keyword extraction
title_sort using citation networks to evaluate the impact of text length on keyword extraction
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10681196/
https://www.ncbi.nlm.nih.gov/pubmed/38011182
http://dx.doi.org/10.1371/journal.pone.0294500
work_keys_str_mv AT tohalinojorgeav usingcitationnetworkstoevaluatetheimpactoftextlengthonkeywordextraction
AT silvathiagoc usingcitationnetworkstoevaluatetheimpactoftextlengthonkeywordextraction
AT amanciodiegor usingcitationnetworkstoevaluatetheimpactoftextlengthonkeywordextraction