Cargando…

Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data

BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can le...

Descripción completa

Detalles Bibliográficos
Autores principales: Mas-Sandoval, Alex, Pope, Nathaniel S, Nielsen, Knud Nor, Altinkaya, Isin, Fumagalli, Matteo, Korneliussen, Thorfinn Sand
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9112775/
https://www.ncbi.nlm.nih.gov/pubmed/35579549
http://dx.doi.org/10.1093/gigascience/giac032
_version_ 1784709470364893184
author Mas-Sandoval, Alex
Pope, Nathaniel S
Nielsen, Knud Nor
Altinkaya, Isin
Fumagalli, Matteo
Korneliussen, Thorfinn Sand
author_facet Mas-Sandoval, Alex
Pope, Nathaniel S
Nielsen, Knud Nor
Altinkaya, Isin
Fumagalli, Matteo
Korneliussen, Thorfinn Sand
author_sort Mas-Sandoval, Alex
collection PubMed
description BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS: Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION: The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.
format Online
Article
Text
id pubmed-9112775
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-91127752022-05-18 Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data Mas-Sandoval, Alex Pope, Nathaniel S Nielsen, Knud Nor Altinkaya, Isin Fumagalli, Matteo Korneliussen, Thorfinn Sand Gigascience Technical Note BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS: Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION: The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms. Oxford University Press 2022-05-17 /pmc/articles/PMC9112775/ /pubmed/35579549 http://dx.doi.org/10.1093/gigascience/giac032 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Mas-Sandoval, Alex
Pope, Nathaniel S
Nielsen, Knud Nor
Altinkaya, Isin
Fumagalli, Matteo
Korneliussen, Thorfinn Sand
Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
title Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
title_full Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
title_fullStr Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
title_full_unstemmed Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
title_short Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
title_sort fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9112775/
https://www.ncbi.nlm.nih.gov/pubmed/35579549
http://dx.doi.org/10.1093/gigascience/giac032
work_keys_str_mv AT massandovalalex fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata
AT popenathaniels fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata
AT nielsenknudnor fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata
AT altinkayaisin fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata
AT fumagallimatteo fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata
AT korneliussenthorfinnsand fastandaccurateestimationofmultidimensionalsitefrequencyspectrafromlowcoveragehighthroughputsequencingdata