Cargando…
A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6610336/ https://www.ncbi.nlm.nih.gov/pubmed/31316542 http://dx.doi.org/10.3389/fgene.2019.00562 |
_version_ | 1783432489030647808 |
---|---|
author | Faux, Pierre Geurts, Pierre Druet, Tom |
author_facet | Faux, Pierre Geurts, Pierre Druet, Tom |
author_sort | Faux, Pierre |
collection | PubMed |
description | Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation. |
format | Online Article Text |
id | pubmed-6610336 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-66103362019-07-17 A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes Faux, Pierre Geurts, Pierre Druet, Tom Front Genet Genetics Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation. Frontiers Media S.A. 2019-06-27 /pmc/articles/PMC6610336/ /pubmed/31316542 http://dx.doi.org/10.3389/fgene.2019.00562 Text en Copyright © 2019 Faux, Geurts and Druet. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Faux, Pierre Geurts, Pierre Druet, Tom A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes |
title | A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes |
title_full | A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes |
title_fullStr | A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes |
title_full_unstemmed | A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes |
title_short | A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes |
title_sort | random forests framework for modeling haplotypes as mosaics of reference haplotypes |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6610336/ https://www.ncbi.nlm.nih.gov/pubmed/31316542 http://dx.doi.org/10.3389/fgene.2019.00562 |
work_keys_str_mv | AT fauxpierre arandomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT geurtspierre arandomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT druettom arandomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT fauxpierre randomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT geurtspierre randomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT druettom randomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes |