Cargando…
Effect of stemming on text similarity for Arabic language at sentence level
Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmati...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8156998/ https://www.ncbi.nlm.nih.gov/pubmed/34084932 http://dx.doi.org/10.7717/peerj-cs.530 |
_version_ | 1783699580567683072 |
---|---|
author | Alhawarat, Mohammad O. Abdeljaber, Hikmat Hilal, Anwer |
author_facet | Alhawarat, Mohammad O. Abdeljaber, Hikmat Hilal, Anwer |
author_sort | Alhawarat, Mohammad O. |
collection | PubMed |
description | Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer. |
format | Online Article Text |
id | pubmed-8156998 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-81569982021-06-02 Effect of stemming on text similarity for Arabic language at sentence level Alhawarat, Mohammad O. Abdeljaber, Hikmat Hilal, Anwer PeerJ Comput Sci Artificial Intelligence Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer. PeerJ Inc. 2021-05-14 /pmc/articles/PMC8156998/ /pubmed/34084932 http://dx.doi.org/10.7717/peerj-cs.530 Text en © 2021 Alhawarat et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Artificial Intelligence Alhawarat, Mohammad O. Abdeljaber, Hikmat Hilal, Anwer Effect of stemming on text similarity for Arabic language at sentence level |
title | Effect of stemming on text similarity for Arabic language at sentence level |
title_full | Effect of stemming on text similarity for Arabic language at sentence level |
title_fullStr | Effect of stemming on text similarity for Arabic language at sentence level |
title_full_unstemmed | Effect of stemming on text similarity for Arabic language at sentence level |
title_short | Effect of stemming on text similarity for Arabic language at sentence level |
title_sort | effect of stemming on text similarity for arabic language at sentence level |
topic | Artificial Intelligence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8156998/ https://www.ncbi.nlm.nih.gov/pubmed/34084932 http://dx.doi.org/10.7717/peerj-cs.530 |
work_keys_str_mv | AT alhawaratmohammado effectofstemmingontextsimilarityforarabiclanguageatsentencelevel AT abdeljaberhikmat effectofstemmingontextsimilarityforarabiclanguageatsentencelevel AT hilalanwer effectofstemmingontextsimilarityforarabiclanguageatsentencelevel |