Cargando…

A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities

When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Darscheid, Paul, Guthke, Anneli, Ehret, Uwe
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7513126/ https://www.ncbi.nlm.nih.gov/pubmed/33265690 http://dx.doi.org/10.3390/e20080601

_version_	1783586316354584576
author	Darscheid, Paul Guthke, Anneli Ehret, Uwe
author_facet	Darscheid, Paul Guthke, Anneli Ehret, Uwe
author_sort	Darscheid, Paul
collection	PubMed
description	When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a response for the entire codomain, or if we use Kullback–Leibler divergence to measure the (dis-)agreement of the sample distribution and the original distribution of the variable, which, in the described case, is inconveniently infinite. Several sample-based distribution estimators exist which assure nonzero bin probability, such as adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample pdf, smoothing methods such as Kernel-density smoothing, or Bayesian approaches based on the Dirichlet and Multinomial distribution. Here, we suggest and test an approach based on the Clopper–Pearson method, which makes use of the binominal distribution. Based on the sample distribution, confidence intervals for bin-occupation probability are calculated. The mean of each confidence interval is a strictly positive estimator of the true bin-occupation probability and is convergent with increasing sample size. For small samples, it converges towards a uniform distribution, i.e., the method effectively applies a maximum entropy approach. We apply this nonzero method and four alternative sample-based distribution estimators to a range of typical distributions (uniform, Dirac, normal, multimodal, and irregular) and measure the effect with Kullback–Leibler divergence. While the performance of each method strongly depends on the distribution type it is applied to, on average, and especially for small sample sizes, the nonzero, the simple “add one counter”, and the Bayesian Dirichlet-multinomial model show very similar behavior and perform best. We conclude that, when estimating distributions without an a priori idea of their shape, applying one of these methods is favorable.
format	Online Article Text
id	pubmed-7513126
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75131262020-11-09 A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities Darscheid, Paul Guthke, Anneli Ehret, Uwe Entropy (Basel) Article When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a response for the entire codomain, or if we use Kullback–Leibler divergence to measure the (dis-)agreement of the sample distribution and the original distribution of the variable, which, in the described case, is inconveniently infinite. Several sample-based distribution estimators exist which assure nonzero bin probability, such as adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample pdf, smoothing methods such as Kernel-density smoothing, or Bayesian approaches based on the Dirichlet and Multinomial distribution. Here, we suggest and test an approach based on the Clopper–Pearson method, which makes use of the binominal distribution. Based on the sample distribution, confidence intervals for bin-occupation probability are calculated. The mean of each confidence interval is a strictly positive estimator of the true bin-occupation probability and is convergent with increasing sample size. For small samples, it converges towards a uniform distribution, i.e., the method effectively applies a maximum entropy approach. We apply this nonzero method and four alternative sample-based distribution estimators to a range of typical distributions (uniform, Dirac, normal, multimodal, and irregular) and measure the effect with Kullback–Leibler divergence. While the performance of each method strongly depends on the distribution type it is applied to, on average, and especially for small sample sizes, the nonzero, the simple “add one counter”, and the Bayesian Dirichlet-multinomial model show very similar behavior and perform best. We conclude that, when estimating distributions without an a priori idea of their shape, applying one of these methods is favorable. MDPI 2018-08-13 /pmc/articles/PMC7513126/ /pubmed/33265690 http://dx.doi.org/10.3390/e20080601 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Darscheid, Paul Guthke, Anneli Ehret, Uwe A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities
title	A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities
title_full	A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities
title_fullStr	A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities
title_full_unstemmed	A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities
title_short	A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities
title_sort	maximum-entropy method to estimate discrete distributions from samples ensuring nonzero probabilities
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7513126/ https://www.ncbi.nlm.nih.gov/pubmed/33265690 http://dx.doi.org/10.3390/e20080601
work_keys_str_mv	AT darscheidpaul amaximumentropymethodtoestimatediscretedistributionsfromsamplesensuringnonzeroprobabilities AT guthkeanneli amaximumentropymethodtoestimatediscretedistributionsfromsamplesensuringnonzeroprobabilities AT ehretuwe amaximumentropymethodtoestimatediscretedistributionsfromsamplesensuringnonzeroprobabilities AT darscheidpaul maximumentropymethodtoestimatediscretedistributionsfromsamplesensuringnonzeroprobabilities AT guthkeanneli maximumentropymethodtoestimatediscretedistributionsfromsamplesensuringnonzeroprobabilities AT ehretuwe maximumentropymethodtoestimatediscretedistributionsfromsamplesensuringnonzeroprobabilities

A Maximum-Entropy Method to Estimate Discrete Distributions from Samples Ensuring Nonzero Probabilities

Ejemplares similares