Cargando…
Alternate-locus aware variant calling in whole genome sequencing
BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155401/ https://www.ncbi.nlm.nih.gov/pubmed/27964746 http://dx.doi.org/10.1186/s13073-016-0383-z |
_version_ | 1782474998090301440 |
---|---|
author | Jäger, Marten Schubach, Max Zemojtel, Tomasz Reinert, Knut Church, Deanna M. Robinson, Peter N. |
author_facet | Jäger, Marten Schubach, Max Zemojtel, Tomasz Reinert, Knut Church, Deanna M. Robinson, Peter N. |
author_sort | Jäger, Marten |
collection | PubMed |
description | BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). METHODS: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. RESULTS: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8±3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs. Additionally, we found that 437 of 791 genome-wide association study hits located within one of the regions corresponded to ASDPs. CONCLUSIONS: Our algorithm uses the information contained in the 178 structurally variable regions of the GRCh38 genome assembly to avoid spurious variant calls in cases where samples contain an alternate locus rather than the corresponding segment of the primary assembly. These results suggest the great potential of fully incorporating the resources of graph-like genome assemblies into variant calling, but also underscore the importance of developing computational resources that will allow a full reconstruction of the genotype in personal genomes. Our algorithm is freely available at https://github.com/charite/asdpex. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13073-016-0383-z) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5155401 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-51554012016-12-20 Alternate-locus aware variant calling in whole genome sequencing Jäger, Marten Schubach, Max Zemojtel, Tomasz Reinert, Knut Church, Deanna M. Robinson, Peter N. Genome Med Research BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). METHODS: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. RESULTS: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8±3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs. Additionally, we found that 437 of 791 genome-wide association study hits located within one of the regions corresponded to ASDPs. CONCLUSIONS: Our algorithm uses the information contained in the 178 structurally variable regions of the GRCh38 genome assembly to avoid spurious variant calls in cases where samples contain an alternate locus rather than the corresponding segment of the primary assembly. These results suggest the great potential of fully incorporating the resources of graph-like genome assemblies into variant calling, but also underscore the importance of developing computational resources that will allow a full reconstruction of the genotype in personal genomes. Our algorithm is freely available at https://github.com/charite/asdpex. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13073-016-0383-z) contains supplementary material, which is available to authorized users. BioMed Central 2016-12-13 /pmc/articles/PMC5155401/ /pubmed/27964746 http://dx.doi.org/10.1186/s13073-016-0383-z Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Jäger, Marten Schubach, Max Zemojtel, Tomasz Reinert, Knut Church, Deanna M. Robinson, Peter N. Alternate-locus aware variant calling in whole genome sequencing |
title | Alternate-locus aware variant calling in whole genome sequencing |
title_full | Alternate-locus aware variant calling in whole genome sequencing |
title_fullStr | Alternate-locus aware variant calling in whole genome sequencing |
title_full_unstemmed | Alternate-locus aware variant calling in whole genome sequencing |
title_short | Alternate-locus aware variant calling in whole genome sequencing |
title_sort | alternate-locus aware variant calling in whole genome sequencing |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155401/ https://www.ncbi.nlm.nih.gov/pubmed/27964746 http://dx.doi.org/10.1186/s13073-016-0383-z |
work_keys_str_mv | AT jagermarten alternatelocusawarevariantcallinginwholegenomesequencing AT schubachmax alternatelocusawarevariantcallinginwholegenomesequencing AT zemojteltomasz alternatelocusawarevariantcallinginwholegenomesequencing AT reinertknut alternatelocusawarevariantcallinginwholegenomesequencing AT churchdeannam alternatelocusawarevariantcallinginwholegenomesequencing AT robinsonpetern alternatelocusawarevariantcallinginwholegenomesequencing |