Cargando…

Merging of Numerical Intervals in Entropy-Based Discretization

As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the do...

Descripción completa

Detalles Bibliográficos
Autores principales: Grzymala-Busse, Jerzy W., Mroczek, Teresa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7512462/
https://www.ncbi.nlm.nih.gov/pubmed/33266604
http://dx.doi.org/10.3390/e20110880
_version_ 1783586163670384640
author Grzymala-Busse, Jerzy W.
Mroczek, Teresa
author_facet Grzymala-Busse, Jerzy W.
Mroczek, Teresa
author_sort Grzymala-Busse, Jerzy W.
collection PubMed
description As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches.
format Online
Article
Text
id pubmed-7512462
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75124622020-11-09 Merging of Numerical Intervals in Entropy-Based Discretization Grzymala-Busse, Jerzy W. Mroczek, Teresa Entropy (Basel) Article As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches. MDPI 2018-11-16 /pmc/articles/PMC7512462/ /pubmed/33266604 http://dx.doi.org/10.3390/e20110880 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Grzymala-Busse, Jerzy W.
Mroczek, Teresa
Merging of Numerical Intervals in Entropy-Based Discretization
title Merging of Numerical Intervals in Entropy-Based Discretization
title_full Merging of Numerical Intervals in Entropy-Based Discretization
title_fullStr Merging of Numerical Intervals in Entropy-Based Discretization
title_full_unstemmed Merging of Numerical Intervals in Entropy-Based Discretization
title_short Merging of Numerical Intervals in Entropy-Based Discretization
title_sort merging of numerical intervals in entropy-based discretization
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7512462/
https://www.ncbi.nlm.nih.gov/pubmed/33266604
http://dx.doi.org/10.3390/e20110880
work_keys_str_mv AT grzymalabussejerzyw mergingofnumericalintervalsinentropybaseddiscretization
AT mroczekteresa mergingofnumericalintervalsinentropybaseddiscretization