Cargando…

High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs

Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes...

Descripción completa

Detalles Bibliográficos
Autores principales: Dilthey, Alexander T., Gourraud, Pierre-Antoine, Mentzer, Alexander J., Cereb, Nezih, Iqbal, Zamin, McVean, Gil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5085092/
https://www.ncbi.nlm.nih.gov/pubmed/27792722
http://dx.doi.org/10.1371/journal.pcbi.1005151
_version_ 1782463500663128064
author Dilthey, Alexander T.
Gourraud, Pierre-Antoine
Mentzer, Alexander J.
Cereb, Nezih
Iqbal, Zamin
McVean, Gil
author_facet Dilthey, Alexander T.
Gourraud, Pierre-Antoine
Mentzer, Alexander J.
Cereb, Nezih
Iqbal, Zamin
McVean, Gil
author_sort Dilthey, Alexander T.
collection PubMed
description Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application.
format Online
Article
Text
id pubmed-5085092
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-50850922016-11-04 High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs Dilthey, Alexander T. Gourraud, Pierre-Antoine Mentzer, Alexander J. Cereb, Nezih Iqbal, Zamin McVean, Gil PLoS Comput Biol Research Article Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application. Public Library of Science 2016-10-28 /pmc/articles/PMC5085092/ /pubmed/27792722 http://dx.doi.org/10.1371/journal.pcbi.1005151 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Dilthey, Alexander T.
Gourraud, Pierre-Antoine
Mentzer, Alexander J.
Cereb, Nezih
Iqbal, Zamin
McVean, Gil
High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
title High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
title_full High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
title_fullStr High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
title_full_unstemmed High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
title_short High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
title_sort high-accuracy hla type inference from whole-genome sequencing data using population reference graphs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5085092/
https://www.ncbi.nlm.nih.gov/pubmed/27792722
http://dx.doi.org/10.1371/journal.pcbi.1005151
work_keys_str_mv AT diltheyalexandert highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs
AT gourraudpierreantoine highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs
AT mentzeralexanderj highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs
AT cerebnezih highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs
AT iqbalzamin highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs
AT mcveangil highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs