Cargando…

Alternate-locus aware variant calling in whole genome sequencing

BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant...

Descripción completa

Detalles Bibliográficos
Autores principales: Jäger, Marten, Schubach, Max, Zemojtel, Tomasz, Reinert, Knut, Church, Deanna M., Robinson, Peter N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155401/
https://www.ncbi.nlm.nih.gov/pubmed/27964746
http://dx.doi.org/10.1186/s13073-016-0383-z
_version_ 1782474998090301440
author Jäger, Marten
Schubach, Max
Zemojtel, Tomasz
Reinert, Knut
Church, Deanna M.
Robinson, Peter N.
author_facet Jäger, Marten
Schubach, Max
Zemojtel, Tomasz
Reinert, Knut
Church, Deanna M.
Robinson, Peter N.
author_sort Jäger, Marten
collection PubMed
description BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). METHODS: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. RESULTS: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8±3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs. Additionally, we found that 437 of 791 genome-wide association study hits located within one of the regions corresponded to ASDPs. CONCLUSIONS: Our algorithm uses the information contained in the 178 structurally variable regions of the GRCh38 genome assembly to avoid spurious variant calls in cases where samples contain an alternate locus rather than the corresponding segment of the primary assembly. These results suggest the great potential of fully incorporating the resources of graph-like genome assemblies into variant calling, but also underscore the importance of developing computational resources that will allow a full reconstruction of the genotype in personal genomes. Our algorithm is freely available at https://github.com/charite/asdpex. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13073-016-0383-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5155401
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-51554012016-12-20 Alternate-locus aware variant calling in whole genome sequencing Jäger, Marten Schubach, Max Zemojtel, Tomasz Reinert, Knut Church, Deanna M. Robinson, Peter N. Genome Med Research BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). METHODS: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. RESULTS: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8±3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs. Additionally, we found that 437 of 791 genome-wide association study hits located within one of the regions corresponded to ASDPs. CONCLUSIONS: Our algorithm uses the information contained in the 178 structurally variable regions of the GRCh38 genome assembly to avoid spurious variant calls in cases where samples contain an alternate locus rather than the corresponding segment of the primary assembly. These results suggest the great potential of fully incorporating the resources of graph-like genome assemblies into variant calling, but also underscore the importance of developing computational resources that will allow a full reconstruction of the genotype in personal genomes. Our algorithm is freely available at https://github.com/charite/asdpex. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13073-016-0383-z) contains supplementary material, which is available to authorized users. BioMed Central 2016-12-13 /pmc/articles/PMC5155401/ /pubmed/27964746 http://dx.doi.org/10.1186/s13073-016-0383-z Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Jäger, Marten
Schubach, Max
Zemojtel, Tomasz
Reinert, Knut
Church, Deanna M.
Robinson, Peter N.
Alternate-locus aware variant calling in whole genome sequencing
title Alternate-locus aware variant calling in whole genome sequencing
title_full Alternate-locus aware variant calling in whole genome sequencing
title_fullStr Alternate-locus aware variant calling in whole genome sequencing
title_full_unstemmed Alternate-locus aware variant calling in whole genome sequencing
title_short Alternate-locus aware variant calling in whole genome sequencing
title_sort alternate-locus aware variant calling in whole genome sequencing
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155401/
https://www.ncbi.nlm.nih.gov/pubmed/27964746
http://dx.doi.org/10.1186/s13073-016-0383-z
work_keys_str_mv AT jagermarten alternatelocusawarevariantcallinginwholegenomesequencing
AT schubachmax alternatelocusawarevariantcallinginwholegenomesequencing
AT zemojteltomasz alternatelocusawarevariantcallinginwholegenomesequencing
AT reinertknut alternatelocusawarevariantcallinginwholegenomesequencing
AT churchdeannam alternatelocusawarevariantcallinginwholegenomesequencing
AT robinsonpetern alternatelocusawarevariantcallinginwholegenomesequencing