Cargando…

Effect of stemming on text similarity for Arabic language at sentence level

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmati...

Descripción completa

Detalles Bibliográficos
Autores principales: Alhawarat, Mohammad O., Abdeljaber, Hikmat, Hilal, Anwer
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8156998/
https://www.ncbi.nlm.nih.gov/pubmed/34084932
http://dx.doi.org/10.7717/peerj-cs.530
_version_ 1783699580567683072
author Alhawarat, Mohammad O.
Abdeljaber, Hikmat
Hilal, Anwer
author_facet Alhawarat, Mohammad O.
Abdeljaber, Hikmat
Hilal, Anwer
author_sort Alhawarat, Mohammad O.
collection PubMed
description Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.
format Online
Article
Text
id pubmed-8156998
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-81569982021-06-02 Effect of stemming on text similarity for Arabic language at sentence level Alhawarat, Mohammad O. Abdeljaber, Hikmat Hilal, Anwer PeerJ Comput Sci Artificial Intelligence Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer. PeerJ Inc. 2021-05-14 /pmc/articles/PMC8156998/ /pubmed/34084932 http://dx.doi.org/10.7717/peerj-cs.530 Text en © 2021 Alhawarat et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Artificial Intelligence
Alhawarat, Mohammad O.
Abdeljaber, Hikmat
Hilal, Anwer
Effect of stemming on text similarity for Arabic language at sentence level
title Effect of stemming on text similarity for Arabic language at sentence level
title_full Effect of stemming on text similarity for Arabic language at sentence level
title_fullStr Effect of stemming on text similarity for Arabic language at sentence level
title_full_unstemmed Effect of stemming on text similarity for Arabic language at sentence level
title_short Effect of stemming on text similarity for Arabic language at sentence level
title_sort effect of stemming on text similarity for arabic language at sentence level
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8156998/
https://www.ncbi.nlm.nih.gov/pubmed/34084932
http://dx.doi.org/10.7717/peerj-cs.530
work_keys_str_mv AT alhawaratmohammado effectofstemmingontextsimilarityforarabiclanguageatsentencelevel
AT abdeljaberhikmat effectofstemmingontextsimilarityforarabiclanguageatsentencelevel
AT hilalanwer effectofstemmingontextsimilarityforarabiclanguageatsentencelevel