Cargando…

Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)

A crucial decision in designing a spatial sample for soil survey is the number of sampling locations required to answer, with sufficient accuracy and precision, the questions posed by decision makers at different levels of geographic aggregation. In the Indian Soil Health Card (SHC) scheme, many tho...

Descripción completa

Detalles Bibliográficos
Autores principales: Brus, D.J., Kempen, B., Rossiter, D., Balwinder-Singh, McDonald, A.J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier Scientific Pub. Co 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607328/
https://www.ncbi.nlm.nih.gov/pubmed/34980929
http://dx.doi.org/10.1016/j.geoderma.2021.115396
_version_ 1784602544671031296
author Brus, D.J.
Kempen, B.
Rossiter, D.
Balwinder-Singh
McDonald, A.J.
author_facet Brus, D.J.
Kempen, B.
Rossiter, D.
Balwinder-Singh
McDonald, A.J.
author_sort Brus, D.J.
collection PubMed
description A crucial decision in designing a spatial sample for soil survey is the number of sampling locations required to answer, with sufficient accuracy and precision, the questions posed by decision makers at different levels of geographic aggregation. In the Indian Soil Health Card (SHC) scheme, many thousands of locations are sampled per district. In this paper the SHC data are used to estimate the mean of a soil property within a defined study area, e.g., a district, or the areal fraction of the study area where some condition is satisfied, e.g., exceedence of a critical level. The central question is whether this large sample size is needed for this aim. The sample size required for a given maximum length of a confidence interval can be computed with formulas from classical sampling theory, using a prior estimate of the variance of the property of interest within the study area. Similarly, for the areal fraction a prior estimate of this fraction is required. In practice we are uncertain about these prior estimates, and our uncertainty is not accounted for in classical sample size determination (SSD). This deficiency can be overcome with a Bayesian approach, in which the prior estimate of the variance or areal fraction is replaced by a prior distribution. Once new data from the sample are available, this prior distribution is updated to a posterior distribution using Bayes’ rule. The apparent problem with a Bayesian approach prior to a sampling campaign is that the data are not yet available. This dilemma can be solved by computing, for a given sample size, the predictive distribution of the data, given a prior distribution on the population and design parameter. Thus we do not have a single vector with data values, but a finite or infinite set of possible data vectors. As a consequence, we have as many posterior distribution functions as we have data vectors. This leads to a probability distribution of lengths or coverages of Bayesian credible intervals, from which various criteria for SSD can be derived. Besides the fully Bayesian approach, a mixed Bayesian-likelihood approach for SSD is available. This is of interest when, after the data have been collected, we prefer to estimate the mean from these data only, using the frequentist approach, ignoring the prior distribution. The fully Bayesian and mixed Bayesian-likelihood approach are illustrated for estimating the mean of log-transformed Zn and the areal fraction with Zn-deficiency, defined as Zn concentration <0.9 mg kg (−1), in the thirteen districts of Andhra Pradesh state. The SHC data from 2015–2017 are used to derive prior distributions. For all districts the Bayesian and mixed Bayesian-likelihood sample sizes are much smaller than the current sample sizes. The hyperparameters of the prior distributions have a strong effect on the sample sizes. We discuss methods to deal with this. Even at the mandal (sub-district) level the sample size can almost always be reduced substantially. Clearly SHC over-sampled, and here we show how to reduce the effort while still providing information required for decision-making. R scripts for SSD are provided as supplementary material.
format Online
Article
Text
id pubmed-8607328
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier Scientific Pub. Co
record_format MEDLINE/PubMed
spelling pubmed-86073282022-01-01 Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India) Brus, D.J. Kempen, B. Rossiter, D. Balwinder-Singh McDonald, A.J. Geoderma Article A crucial decision in designing a spatial sample for soil survey is the number of sampling locations required to answer, with sufficient accuracy and precision, the questions posed by decision makers at different levels of geographic aggregation. In the Indian Soil Health Card (SHC) scheme, many thousands of locations are sampled per district. In this paper the SHC data are used to estimate the mean of a soil property within a defined study area, e.g., a district, or the areal fraction of the study area where some condition is satisfied, e.g., exceedence of a critical level. The central question is whether this large sample size is needed for this aim. The sample size required for a given maximum length of a confidence interval can be computed with formulas from classical sampling theory, using a prior estimate of the variance of the property of interest within the study area. Similarly, for the areal fraction a prior estimate of this fraction is required. In practice we are uncertain about these prior estimates, and our uncertainty is not accounted for in classical sample size determination (SSD). This deficiency can be overcome with a Bayesian approach, in which the prior estimate of the variance or areal fraction is replaced by a prior distribution. Once new data from the sample are available, this prior distribution is updated to a posterior distribution using Bayes’ rule. The apparent problem with a Bayesian approach prior to a sampling campaign is that the data are not yet available. This dilemma can be solved by computing, for a given sample size, the predictive distribution of the data, given a prior distribution on the population and design parameter. Thus we do not have a single vector with data values, but a finite or infinite set of possible data vectors. As a consequence, we have as many posterior distribution functions as we have data vectors. This leads to a probability distribution of lengths or coverages of Bayesian credible intervals, from which various criteria for SSD can be derived. Besides the fully Bayesian approach, a mixed Bayesian-likelihood approach for SSD is available. This is of interest when, after the data have been collected, we prefer to estimate the mean from these data only, using the frequentist approach, ignoring the prior distribution. The fully Bayesian and mixed Bayesian-likelihood approach are illustrated for estimating the mean of log-transformed Zn and the areal fraction with Zn-deficiency, defined as Zn concentration <0.9 mg kg (−1), in the thirteen districts of Andhra Pradesh state. The SHC data from 2015–2017 are used to derive prior distributions. For all districts the Bayesian and mixed Bayesian-likelihood sample sizes are much smaller than the current sample sizes. The hyperparameters of the prior distributions have a strong effect on the sample sizes. We discuss methods to deal with this. Even at the mandal (sub-district) level the sample size can almost always be reduced substantially. Clearly SHC over-sampled, and here we show how to reduce the effort while still providing information required for decision-making. R scripts for SSD are provided as supplementary material. Elsevier Scientific Pub. Co 2022-01-01 /pmc/articles/PMC8607328/ /pubmed/34980929 http://dx.doi.org/10.1016/j.geoderma.2021.115396 Text en © 2021 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Brus, D.J.
Kempen, B.
Rossiter, D.
Balwinder-Singh
McDonald, A.J.
Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
title Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
title_full Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
title_fullStr Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
title_full_unstemmed Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
title_short Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
title_sort bayesian approach for sample size determination, illustrated with soil health card data of andhra pradesh (india)
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607328/
https://www.ncbi.nlm.nih.gov/pubmed/34980929
http://dx.doi.org/10.1016/j.geoderma.2021.115396
work_keys_str_mv AT brusdj bayesianapproachforsamplesizedeterminationillustratedwithsoilhealthcarddataofandhrapradeshindia
AT kempenb bayesianapproachforsamplesizedeterminationillustratedwithsoilhealthcarddataofandhrapradeshindia
AT rossiterd bayesianapproachforsamplesizedeterminationillustratedwithsoilhealthcarddataofandhrapradeshindia
AT balwindersingh bayesianapproachforsamplesizedeterminationillustratedwithsoilhealthcarddataofandhrapradeshindia
AT mcdonaldaj bayesianapproachforsamplesizedeterminationillustratedwithsoilhealthcarddataofandhrapradeshindia