Cargando…

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ralph, Peter, Thornton, Kevin, Kelleher, Jerome
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Genetics Society of America 2020
Materias:	Investigations
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7337078/ https://www.ncbi.nlm.nih.gov/pubmed/32357960 http://dx.doi.org/10.1534/genetics.120.303253

_version_	1783554444949979136
author	Ralph, Peter Thornton, Kevin Kelleher, Jerome
author_facet	Ralph, Peter Thornton, Kevin Kelleher, Jerome
author_sort	Ralph, Peter
collection	PubMed
description	As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.
format	Online Article Text
id	pubmed-7337078
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Genetics Society of America
record_format	MEDLINE/PubMed
spelling	pubmed-73370782020-07-16 Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes Ralph, Peter Thornton, Kevin Kelleher, Jerome Genetics Investigations As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals. Genetics Society of America 2020-07 2020-05-01 /pmc/articles/PMC7337078/ /pubmed/32357960 http://dx.doi.org/10.1534/genetics.120.303253 Text en Copyright © 2020 Ralph et al. Available freely online through the author-supported open access option. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Investigations Ralph, Peter Thornton, Kevin Kelleher, Jerome Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
title	Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
title_full	Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
title_fullStr	Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
title_full_unstemmed	Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
title_short	Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
title_sort	efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes
topic	Investigations
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7337078/ https://www.ncbi.nlm.nih.gov/pubmed/32357960 http://dx.doi.org/10.1534/genetics.120.303253
work_keys_str_mv	AT ralphpeter efficientlysummarizingrelationshipsinlargesamplesageneraldualitybetweenstatisticsofgenealogiesandgenomes AT thorntonkevin efficientlysummarizingrelationshipsinlargesamplesageneraldualitybetweenstatisticsofgenealogiesandgenomes AT kelleherjerome efficientlysummarizingrelationshipsinlargesamplesageneraldualitybetweenstatisticsofgenealogiesandgenomes

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

Ejemplares similares