Cargando…

Efficient haplotype block recognition of very long and dense genetic sequences

BACKGROUND: The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles...

Descripción completa

Detalles Bibliográficos
Autores principales: Taliun, Daniel, Gamper, Johann, Pattaro, Cristian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3898000/
https://www.ncbi.nlm.nih.gov/pubmed/24423111
http://dx.doi.org/10.1186/1471-2105-15-10
_version_ 1782300340079558656
author Taliun, Daniel
Gamper, Johann
Pattaro, Cristian
author_facet Taliun, Daniel
Gamper, Johann
Pattaro, Cristian
author_sort Taliun, Daniel
collection PubMed
description BACKGROUND: The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least Θ(n(2)) time and memory complexity. RESULTS: We derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel et al. in 2002. Our most efficient solution, called MIG (++), has only Θ(n) memory complexity and, on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG (++) analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based D(′) variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG (++) can support genome-wide haplotype association studies. CONCLUSIONS: The MIG (++) enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers.
format Online
Article
Text
id pubmed-3898000
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38980002014-02-05 Efficient haplotype block recognition of very long and dense genetic sequences Taliun, Daniel Gamper, Johann Pattaro, Cristian BMC Bioinformatics Methodology Article BACKGROUND: The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least Θ(n(2)) time and memory complexity. RESULTS: We derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel et al. in 2002. Our most efficient solution, called MIG (++), has only Θ(n) memory complexity and, on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG (++) analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based D(′) variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG (++) can support genome-wide haplotype association studies. CONCLUSIONS: The MIG (++) enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers. BioMed Central 2014-01-14 /pmc/articles/PMC3898000/ /pubmed/24423111 http://dx.doi.org/10.1186/1471-2105-15-10 Text en Copyright © 2014 Taliun et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Taliun, Daniel
Gamper, Johann
Pattaro, Cristian
Efficient haplotype block recognition of very long and dense genetic sequences
title Efficient haplotype block recognition of very long and dense genetic sequences
title_full Efficient haplotype block recognition of very long and dense genetic sequences
title_fullStr Efficient haplotype block recognition of very long and dense genetic sequences
title_full_unstemmed Efficient haplotype block recognition of very long and dense genetic sequences
title_short Efficient haplotype block recognition of very long and dense genetic sequences
title_sort efficient haplotype block recognition of very long and dense genetic sequences
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3898000/
https://www.ncbi.nlm.nih.gov/pubmed/24423111
http://dx.doi.org/10.1186/1471-2105-15-10
work_keys_str_mv AT taliundaniel efficienthaplotypeblockrecognitionofverylonganddensegeneticsequences
AT gamperjohann efficienthaplotypeblockrecognitionofverylonganddensegeneticsequences
AT pattarocristian efficienthaplotypeblockrecognitionofverylonganddensegeneticsequences