Cargando…
Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can le...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9112775/ https://www.ncbi.nlm.nih.gov/pubmed/35579549 http://dx.doi.org/10.1093/gigascience/giac032 |
_version_ | 1784709470364893184 |
---|---|
author | Mas-Sandoval, Alex Pope, Nathaniel S Nielsen, Knud Nor Altinkaya, Isin Fumagalli, Matteo Korneliussen, Thorfinn Sand |
author_facet | Mas-Sandoval, Alex Pope, Nathaniel S Nielsen, Knud Nor Altinkaya, Isin Fumagalli, Matteo Korneliussen, Thorfinn Sand |
author_sort | Mas-Sandoval, Alex |
collection | PubMed |
description | BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS: Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION: The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms. |
format | Online Article Text |
id | pubmed-9112775 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-91127752022-05-18 Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data Mas-Sandoval, Alex Pope, Nathaniel S Nielsen, Knud Nor Altinkaya, Isin Fumagalli, Matteo Korneliussen, Thorfinn Sand Gigascience Technical Note BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS: Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION: The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms. Oxford University Press 2022-05-17 /pmc/articles/PMC9112775/ /pubmed/35579549 http://dx.doi.org/10.1093/gigascience/giac032 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Technical Note Mas-Sandoval, Alex Pope, Nathaniel S Nielsen, Knud Nor Altinkaya, Isin Fumagalli, Matteo Korneliussen, Thorfinn Sand Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
title | Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
title_full | Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
title_fullStr | Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
title_full_unstemmed | Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
title_short | Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
title_sort | fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data |
topic | Technical Note |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9112775/ https://www.ncbi.nlm.nih.gov/pubmed/35579549 http://dx.doi.org/10.1093/gigascience/giac032 |
work_keys_str_mv | AT massandovalalex fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata AT popenathaniels fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata AT nielsenknudnor fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata AT altinkayaisin fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata AT fumagallimatteo fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata AT korneliussenthorfinnsand fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata |