Cargando…

High throughput nonparametric probability density estimation

In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate dat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Farmer, Jenny, Jacobs, Donald
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5947915/ https://www.ncbi.nlm.nih.gov/pubmed/29750803 http://dx.doi.org/10.1371/journal.pone.0196937

_version_	1783322461716086784
author	Farmer, Jenny Jacobs, Donald
author_facet	Farmer, Jenny Jacobs, Donald
author_sort	Farmer, Jenny
collection	PubMed
description	In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference.
format	Online Article Text
id	pubmed-5947915
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-59479152018-05-25 High throughput nonparametric probability density estimation Farmer, Jenny Jacobs, Donald PLoS One Research Article In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference. Public Library of Science 2018-05-11 /pmc/articles/PMC5947915/ /pubmed/29750803 http://dx.doi.org/10.1371/journal.pone.0196937 Text en © 2018 Farmer, Jacobs http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Farmer, Jenny Jacobs, Donald High throughput nonparametric probability density estimation
title	High throughput nonparametric probability density estimation
title_full	High throughput nonparametric probability density estimation
title_fullStr	High throughput nonparametric probability density estimation
title_full_unstemmed	High throughput nonparametric probability density estimation
title_short	High throughput nonparametric probability density estimation
title_sort	high throughput nonparametric probability density estimation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5947915/ https://www.ncbi.nlm.nih.gov/pubmed/29750803 http://dx.doi.org/10.1371/journal.pone.0196937
work_keys_str_mv	AT farmerjenny highthroughputnonparametricprobabilitydensityestimation AT jacobsdonald highthroughputnonparametricprobabilitydensityestimation

High throughput nonparametric probability density estimation

Ejemplares similares