Cargando…

Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

BACKGROUND: Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compressi...

Descripción completa

Detalles Bibliográficos
Autores principales: Hudson, Nicholas J, Porto-Neto, Laercio R, Kijas, James, McWilliam, Sean, Taft, Ryan J, Reverter, Antonio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015654/
https://www.ncbi.nlm.nih.gov/pubmed/24606587
http://dx.doi.org/10.1186/1471-2105-15-66
_version_ 1782315371359895552
author Hudson, Nicholas J
Porto-Neto, Laercio R
Kijas, James
McWilliam, Sean
Taft, Ryan J
Reverter, Antonio
author_facet Hudson, Nicholas J
Porto-Neto, Laercio R
Kijas, James
McWilliam, Sean
Taft, Ryan J
Reverter, Antonio
author_sort Hudson, Nicholas J
collection PubMed
description BACKGROUND: Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order. RESULTS: We applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (F(ST)) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE ‘sliding window’ scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike F(ST), CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome. CONCLUSIONS: We conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation.
format Online
Article
Text
id pubmed-4015654
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40156542014-05-23 Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest Hudson, Nicholas J Porto-Neto, Laercio R Kijas, James McWilliam, Sean Taft, Ryan J Reverter, Antonio BMC Bioinformatics Research Article BACKGROUND: Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order. RESULTS: We applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (F(ST)) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE ‘sliding window’ scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike F(ST), CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome. CONCLUSIONS: We conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation. BioMed Central 2014-03-07 /pmc/articles/PMC4015654/ /pubmed/24606587 http://dx.doi.org/10.1186/1471-2105-15-66 Text en Copyright © 2014 Hudson et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Hudson, Nicholas J
Porto-Neto, Laercio R
Kijas, James
McWilliam, Sean
Taft, Ryan J
Reverter, Antonio
Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
title Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
title_full Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
title_fullStr Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
title_full_unstemmed Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
title_short Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
title_sort information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015654/
https://www.ncbi.nlm.nih.gov/pubmed/24606587
http://dx.doi.org/10.1186/1471-2105-15-66
work_keys_str_mv AT hudsonnicholasj informationcompressionexploitspatternsofgenomecompositiontodiscriminatepopulationsandhighlightregionsofevolutionaryinterest
AT portonetolaercior informationcompressionexploitspatternsofgenomecompositiontodiscriminatepopulationsandhighlightregionsofevolutionaryinterest
AT kijasjames informationcompressionexploitspatternsofgenomecompositiontodiscriminatepopulationsandhighlightregionsofevolutionaryinterest
AT mcwilliamsean informationcompressionexploitspatternsofgenomecompositiontodiscriminatepopulationsandhighlightregionsofevolutionaryinterest
AT taftryanj informationcompressionexploitspatternsofgenomecompositiontodiscriminatepopulationsandhighlightregionsofevolutionaryinterest
AT reverterantonio informationcompressionexploitspatternsofgenomecompositiontodiscriminatepopulationsandhighlightregionsofevolutionaryinterest