Cargando…

Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding

A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sarwar, Talha Bin, Noor, Noorhuzaimi Mohd, Saef Ullah Miah, M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Computational Linguistics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9299267/ https://www.ncbi.nlm.nih.gov/pubmed/35875631 http://dx.doi.org/10.7717/peerj-cs.1024

_version_	1784750927753773056
author	Sarwar, Talha Bin Noor, Noorhuzaimi Mohd Saef Ullah Miah, M.
author_facet	Sarwar, Talha Bin Noor, Noorhuzaimi Mohd Saef Ullah Miah, M.
author_sort	Sarwar, Talha Bin
collection	PubMed
description	A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is rapidly increasing as a lot of textual information is processed online in various domains such as offices, news portals, or for research purposes. Given the exponential increase of news articles on the Internet, manually searching for similar news articles by reading the entire news content that matches the user’s interests has become a time-consuming and tedious task. Therefore, automatically finding similar news articles can be a significant task in text processing. In this context, keyphrase extraction algorithms can extract information from news articles. However, selecting the most appropriate algorithm is also a problem. Therefore, this study analyzes various supervised and unsupervised keyphrase extraction algorithms, namely KEA, KP-Miner, YAKE, MultipartiteRank, TopicRank, and TeKET, which are used to extract keyphrases from news articles. The extracted keyphrases are used to compute lexical and semantic similarity to find similar news articles. The lexical similarity is calculated using the Cosine and Jaccard similarity techniques. In addition, semantic similarity is calculated using a word embedding technique called Word2Vec in combination with the Cosine similarity measure. The experimental results show that the KP-Miner keyphrase extraction algorithm, together with the Cosine similarity calculation using Word2Vec (Cosine-Word2Vec), outperforms the other combinations of keyphrase extraction algorithms and similarity calculation techniques to find similar news articles. The similar articles identified using KPMiner and the Cosine similarity measure with Word2Vec appear to be relevant to a particular news article and thus show satisfactory performance with a Normalized Discounted Cumulative Gain (NDCG) value of 0.97. This study proposes a method for finding similar news articles that can be used in conjunction with other methods already in use.
format	Online Article Text
id	pubmed-9299267
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-92992672022-07-21 Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding Sarwar, Talha Bin Noor, Noorhuzaimi Mohd Saef Ullah Miah, M. PeerJ Comput Sci Computational Linguistics A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is rapidly increasing as a lot of textual information is processed online in various domains such as offices, news portals, or for research purposes. Given the exponential increase of news articles on the Internet, manually searching for similar news articles by reading the entire news content that matches the user’s interests has become a time-consuming and tedious task. Therefore, automatically finding similar news articles can be a significant task in text processing. In this context, keyphrase extraction algorithms can extract information from news articles. However, selecting the most appropriate algorithm is also a problem. Therefore, this study analyzes various supervised and unsupervised keyphrase extraction algorithms, namely KEA, KP-Miner, YAKE, MultipartiteRank, TopicRank, and TeKET, which are used to extract keyphrases from news articles. The extracted keyphrases are used to compute lexical and semantic similarity to find similar news articles. The lexical similarity is calculated using the Cosine and Jaccard similarity techniques. In addition, semantic similarity is calculated using a word embedding technique called Word2Vec in combination with the Cosine similarity measure. The experimental results show that the KP-Miner keyphrase extraction algorithm, together with the Cosine similarity calculation using Word2Vec (Cosine-Word2Vec), outperforms the other combinations of keyphrase extraction algorithms and similarity calculation techniques to find similar news articles. The similar articles identified using KPMiner and the Cosine similarity measure with Word2Vec appear to be relevant to a particular news article and thus show satisfactory performance with a Normalized Discounted Cumulative Gain (NDCG) value of 0.97. This study proposes a method for finding similar news articles that can be used in conjunction with other methods already in use. PeerJ Inc. 2022-07-07 /pmc/articles/PMC9299267/ /pubmed/35875631 http://dx.doi.org/10.7717/peerj-cs.1024 Text en © 2022 Sarwar et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Computational Linguistics Sarwar, Talha Bin Noor, Noorhuzaimi Mohd Saef Ullah Miah, M. Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
title	Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
title_full	Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
title_fullStr	Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
title_full_unstemmed	Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
title_short	Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
title_sort	evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding
topic	Computational Linguistics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9299267/ https://www.ncbi.nlm.nih.gov/pubmed/35875631 http://dx.doi.org/10.7717/peerj-cs.1024
work_keys_str_mv	AT sarwartalhabin evaluatingkeyphraseextractionalgorithmsforfindingsimilarnewsarticlesusinglexicalsimilaritycalculationandsemanticrelatednessmeasurementbywordembedding AT noornoorhuzaimimohd evaluatingkeyphraseextractionalgorithmsforfindingsimilarnewsarticlesusinglexicalsimilaritycalculationandsemanticrelatednessmeasurementbywordembedding AT saefullahmiahm evaluatingkeyphraseextractionalgorithmsforfindingsimilarnewsarticlesusinglexicalsimilaritycalculationandsemanticrelatednessmeasurementbywordembedding

Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding

Ejemplares similares