Cargando…
HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such es...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924493/ https://www.ncbi.nlm.nih.gov/pubmed/33816897 http://dx.doi.org/10.7717/peerj-cs.243 |
_version_ | 1783659102126211072 |
---|---|
author | Phillips, Jarrett D. French, Steven H. Hanner, Robert H. Gillis, Daniel J. |
author_facet | Phillips, Jarrett D. French, Steven H. Hanner, Robert H. Gillis, Daniel J. |
author_sort | Phillips, Jarrett D. |
collection | PubMed |
description | Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5–10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution. |
format | Online Article Text |
id | pubmed-7924493 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-79244932021-04-02 HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves Phillips, Jarrett D. French, Steven H. Hanner, Robert H. Gillis, Daniel J. PeerJ Comput Sci Bioinformatics Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5–10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution. PeerJ Inc. 2020-01-06 /pmc/articles/PMC7924493/ /pubmed/33816897 http://dx.doi.org/10.7717/peerj-cs.243 Text en ©2020 Phillips et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Phillips, Jarrett D. French, Steven H. Hanner, Robert H. Gillis, Daniel J. HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
title | HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
title_full | HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
title_fullStr | HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
title_full_unstemmed | HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
title_short | HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
title_sort | hacsim: an r package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924493/ https://www.ncbi.nlm.nih.gov/pubmed/33816897 http://dx.doi.org/10.7717/peerj-cs.243 |
work_keys_str_mv | AT phillipsjarrettd hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves AT frenchstevenh hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves AT hannerroberth hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves AT gillisdanielj hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves |