Cargando…

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such es...

Descripción completa

Detalles Bibliográficos
Autores principales: Phillips, Jarrett D., French, Steven H., Hanner, Robert H., Gillis, Daniel J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924493/
https://www.ncbi.nlm.nih.gov/pubmed/33816897
http://dx.doi.org/10.7717/peerj-cs.243
_version_ 1783659102126211072
author Phillips, Jarrett D.
French, Steven H.
Hanner, Robert H.
Gillis, Daniel J.
author_facet Phillips, Jarrett D.
French, Steven H.
Hanner, Robert H.
Gillis, Daniel J.
author_sort Phillips, Jarrett D.
collection PubMed
description Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5–10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution.
format Online
Article
Text
id pubmed-7924493
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-79244932021-04-02 HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves Phillips, Jarrett D. French, Steven H. Hanner, Robert H. Gillis, Daniel J. PeerJ Comput Sci Bioinformatics Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5–10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution. PeerJ Inc. 2020-01-06 /pmc/articles/PMC7924493/ /pubmed/33816897 http://dx.doi.org/10.7717/peerj-cs.243 Text en ©2020 Phillips et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Phillips, Jarrett D.
French, Steven H.
Hanner, Robert H.
Gillis, Daniel J.
HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
title HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
title_full HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
title_fullStr HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
title_full_unstemmed HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
title_short HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
title_sort hacsim: an r package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924493/
https://www.ncbi.nlm.nih.gov/pubmed/33816897
http://dx.doi.org/10.7717/peerj-cs.243
work_keys_str_mv AT phillipsjarrettd hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves
AT frenchstevenh hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves
AT hannerroberth hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves
AT gillisdanielj hacsimanrpackagetoestimateintraspecificsamplesizesforgeneticdiversityassessmentusinghaplotypeaccumulationcurves