Cargando…

Whole genome association mapping by incompatibilities and local perfect phylogenies

BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a f...

Descripción completa

Detalles Bibliográficos
Autores principales: Mailund, Thomas, Besenbacher, Søren, Schierup, Mikkel H
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624851/
https://www.ncbi.nlm.nih.gov/pubmed/17042942
http://dx.doi.org/10.1186/1471-2105-7-454
_version_ 1782130575459483648
author Mailund, Thomas
Besenbacher, Søren
Schierup, Mikkel H
author_facet Mailund, Thomas
Besenbacher, Søren
Schierup, Mikkel H
author_sort Mailund, Thomas
collection PubMed
description BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
format Text
id pubmed-1624851
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-16248512006-10-26 Whole genome association mapping by incompatibilities and local perfect phylogenies Mailund, Thomas Besenbacher, Søren Schierup, Mikkel H BMC Bioinformatics Methodology Article BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours. BioMed Central 2006-10-16 /pmc/articles/PMC1624851/ /pubmed/17042942 http://dx.doi.org/10.1186/1471-2105-7-454 Text en Copyright © 2006 Mailund et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Mailund, Thomas
Besenbacher, Søren
Schierup, Mikkel H
Whole genome association mapping by incompatibilities and local perfect phylogenies
title Whole genome association mapping by incompatibilities and local perfect phylogenies
title_full Whole genome association mapping by incompatibilities and local perfect phylogenies
title_fullStr Whole genome association mapping by incompatibilities and local perfect phylogenies
title_full_unstemmed Whole genome association mapping by incompatibilities and local perfect phylogenies
title_short Whole genome association mapping by incompatibilities and local perfect phylogenies
title_sort whole genome association mapping by incompatibilities and local perfect phylogenies
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624851/
https://www.ncbi.nlm.nih.gov/pubmed/17042942
http://dx.doi.org/10.1186/1471-2105-7-454
work_keys_str_mv AT mailundthomas wholegenomeassociationmappingbyincompatibilitiesandlocalperfectphylogenies
AT besenbachersøren wholegenomeassociationmappingbyincompatibilitiesandlocalperfectphylogenies
AT schierupmikkelh wholegenomeassociationmappingbyincompatibilitiesandlocalperfectphylogenies