Cargando…

Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform

Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational cha...

Descripción completa

Detalles Bibliográficos
Autores principales: Freyman, William A, McManus, Kimberly F, Shringarpure, Suyash S, Jewett, Ethan M, Bryc, Katarzyna, Auton, Adam
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8097300/
https://www.ncbi.nlm.nih.gov/pubmed/33355662
http://dx.doi.org/10.1093/molbev/msaa328
_version_ 1783688326525485056
author Freyman, William A
McManus, Kimberly F
Shringarpure, Suyash S
Jewett, Ethan M
Bryc, Katarzyna
Auton, Adam
author_facet Freyman, William A
McManus, Kimberly F
Shringarpure, Suyash S
Jewett, Ethan M
Bryc, Katarzyna
Auton, Adam
author_sort Freyman, William A
collection PubMed
description Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows–Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors, we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally, we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale data sets with millions of samples. Furthermore, we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis, exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for noncommercial use in the code repository (https://github.com/23andMe/phasedibd, last accessed January 11, 2021).
format Online
Article
Text
id pubmed-8097300
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-80973002021-05-10 Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform Freyman, William A McManus, Kimberly F Shringarpure, Suyash S Jewett, Ethan M Bryc, Katarzyna Auton, Adam Mol Biol Evol Methods Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows–Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors, we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally, we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale data sets with millions of samples. Furthermore, we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis, exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for noncommercial use in the code repository (https://github.com/23andMe/phasedibd, last accessed January 11, 2021). Oxford University Press 2020-12-23 /pmc/articles/PMC8097300/ /pubmed/33355662 http://dx.doi.org/10.1093/molbev/msaa328 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods
Freyman, William A
McManus, Kimberly F
Shringarpure, Suyash S
Jewett, Ethan M
Bryc, Katarzyna
Auton, Adam
Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform
title Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform
title_full Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform
title_fullStr Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform
title_full_unstemmed Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform
title_short Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform
title_sort fast and robust identity-by-descent inference with the templated positional burrows–wheeler transform
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8097300/
https://www.ncbi.nlm.nih.gov/pubmed/33355662
http://dx.doi.org/10.1093/molbev/msaa328
work_keys_str_mv AT freymanwilliama fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform
AT mcmanuskimberlyf fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform
AT shringarpuresuyashs fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform
AT jewettethanm fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform
AT bryckatarzyna fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform
AT fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform
AT autonadam fastandrobustidentitybydescentinferencewiththetemplatedpositionalburrowswheelertransform