Cargando…

Optimized sample selection for cost-efficient long-read population sequencing

An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which a few individuals are resequenced using a more comprehensive approach, especially long-read sequencing. T...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ranallo-Benavidez, T. Rhyker, Lemmon, Zachary, Soyk, Sebastian, Aganezov, Sergey, Salerno, William J., McCoy, Rajiv C., Lippman, Zachary B., Schatz, Michael C., Sedlazeck, Fritz J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory Press 2021
Materias:	Method
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8092009/ https://www.ncbi.nlm.nih.gov/pubmed/33811084 http://dx.doi.org/10.1101/gr.264879.120

_version_	1783687582140334080
author	Ranallo-Benavidez, T. Rhyker Lemmon, Zachary Soyk, Sebastian Aganezov, Sergey Salerno, William J. McCoy, Rajiv C. Lippman, Zachary B. Schatz, Michael C. Sedlazeck, Fritz J.
author_facet	Ranallo-Benavidez, T. Rhyker Lemmon, Zachary Soyk, Sebastian Aganezov, Sergey Salerno, William J. McCoy, Rajiv C. Lippman, Zachary B. Schatz, Michael C. Sedlazeck, Fritz J.
author_sort	Ranallo-Benavidez, T. Rhyker
collection	PubMed
description	An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which a few individuals are resequenced using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. Addressing this, SVCollector identifies the optimal subset of individuals for resequencing by analyzing population-level VCF files from low-resolution genotyping studies. It then computes a ranked list of samples that maximizes the total number of variants present within a subset of a given size. To solve this optimization problem, SVCollector implements a fast, greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3000 Rice Genomes Project and show the rankings it computes are more representative than alternative naive strategies. When selecting an optimal subset of 100 samples in these cohorts, SVCollector identifies individuals from every subpopulation, whereas naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples.
format	Online Article Text
id	pubmed-8092009
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Cold Spring Harbor Laboratory Press
record_format	MEDLINE/PubMed
spelling	pubmed-80920092021-11-01 Optimized sample selection for cost-efficient long-read population sequencing Ranallo-Benavidez, T. Rhyker Lemmon, Zachary Soyk, Sebastian Aganezov, Sergey Salerno, William J. McCoy, Rajiv C. Lippman, Zachary B. Schatz, Michael C. Sedlazeck, Fritz J. Genome Res Method An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which a few individuals are resequenced using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. Addressing this, SVCollector identifies the optimal subset of individuals for resequencing by analyzing population-level VCF files from low-resolution genotyping studies. It then computes a ranked list of samples that maximizes the total number of variants present within a subset of a given size. To solve this optimization problem, SVCollector implements a fast, greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3000 Rice Genomes Project and show the rankings it computes are more representative than alternative naive strategies. When selecting an optimal subset of 100 samples in these cohorts, SVCollector identifies individuals from every subpopulation, whereas naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples. Cold Spring Harbor Laboratory Press 2021-05 /pmc/articles/PMC8092009/ /pubmed/33811084 http://dx.doi.org/10.1101/gr.264879.120 Text en © 2021 Ranallo-Benavidez et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle	Method Ranallo-Benavidez, T. Rhyker Lemmon, Zachary Soyk, Sebastian Aganezov, Sergey Salerno, William J. McCoy, Rajiv C. Lippman, Zachary B. Schatz, Michael C. Sedlazeck, Fritz J. Optimized sample selection for cost-efficient long-read population sequencing
title	Optimized sample selection for cost-efficient long-read population sequencing
title_full	Optimized sample selection for cost-efficient long-read population sequencing
title_fullStr	Optimized sample selection for cost-efficient long-read population sequencing
title_full_unstemmed	Optimized sample selection for cost-efficient long-read population sequencing
title_short	Optimized sample selection for cost-efficient long-read population sequencing
title_sort	optimized sample selection for cost-efficient long-read population sequencing
topic	Method
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8092009/ https://www.ncbi.nlm.nih.gov/pubmed/33811084 http://dx.doi.org/10.1101/gr.264879.120
work_keys_str_mv	AT ranallobenavideztrhyker optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT lemmonzachary optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT soyksebastian optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT aganezovsergey optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT salernowilliamj optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT mccoyrajivc optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT lippmanzacharyb optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT schatzmichaelc optimizedsampleselectionforcostefficientlongreadpopulationsequencing AT sedlazeckfritzj optimizedsampleselectionforcostefficientlongreadpopulationsequencing

Optimized sample selection for cost-efficient long-read population sequencing

Ejemplares similares