Cargando…
Whole genome association mapping by incompatibilities and local perfect phylogenies
BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a f...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624851/ https://www.ncbi.nlm.nih.gov/pubmed/17042942 http://dx.doi.org/10.1186/1471-2105-7-454 |
_version_ | 1782130575459483648 |
---|---|
author | Mailund, Thomas Besenbacher, Søren Schierup, Mikkel H |
author_facet | Mailund, Thomas Besenbacher, Søren Schierup, Mikkel H |
author_sort | Mailund, Thomas |
collection | PubMed |
description | BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours. |
format | Text |
id | pubmed-1624851 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-16248512006-10-26 Whole genome association mapping by incompatibilities and local perfect phylogenies Mailund, Thomas Besenbacher, Søren Schierup, Mikkel H BMC Bioinformatics Methodology Article BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours. BioMed Central 2006-10-16 /pmc/articles/PMC1624851/ /pubmed/17042942 http://dx.doi.org/10.1186/1471-2105-7-454 Text en Copyright © 2006 Mailund et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Mailund, Thomas Besenbacher, Søren Schierup, Mikkel H Whole genome association mapping by incompatibilities and local perfect phylogenies |
title | Whole genome association mapping by incompatibilities and local perfect phylogenies |
title_full | Whole genome association mapping by incompatibilities and local perfect phylogenies |
title_fullStr | Whole genome association mapping by incompatibilities and local perfect phylogenies |
title_full_unstemmed | Whole genome association mapping by incompatibilities and local perfect phylogenies |
title_short | Whole genome association mapping by incompatibilities and local perfect phylogenies |
title_sort | whole genome association mapping by incompatibilities and local perfect phylogenies |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624851/ https://www.ncbi.nlm.nih.gov/pubmed/17042942 http://dx.doi.org/10.1186/1471-2105-7-454 |
work_keys_str_mv | AT mailundthomas wholegenomeassociationmappingbyincompatibilitiesandlocalperfectphylogenies AT besenbachersøren wholegenomeassociationmappingbyincompatibilitiesandlocalperfectphylogenies AT schierupmikkelh wholegenomeassociationmappingbyincompatibilitiesandlocalperfectphylogenies |