Cargando…

Towards pan-genome read alignment to improve variation calling

BACKGROUND: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity,...

Descripción completa

Detalles Bibliográficos
Autores principales: Valenzuela, Daniel, Norri, Tuukka, Välimäki, Niko, Pitkänen, Esa, Mäkinen, Veli
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5954285/
https://www.ncbi.nlm.nih.gov/pubmed/29764365
http://dx.doi.org/10.1186/s12864-018-4465-8
_version_ 1783323490824224768
author Valenzuela, Daniel
Norri, Tuukka
Välimäki, Niko
Pitkänen, Esa
Mäkinen, Veli
author_facet Valenzuela, Daniel
Norri, Tuukka
Välimäki, Niko
Pitkänen, Esa
Mäkinen, Veli
author_sort Valenzuela, Daniel
collection PubMed
description BACKGROUND: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. RESULTS: We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC. CONCLUSIONS: Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.
format Online
Article
Text
id pubmed-5954285
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-59542852018-05-21 Towards pan-genome read alignment to improve variation calling Valenzuela, Daniel Norri, Tuukka Välimäki, Niko Pitkänen, Esa Mäkinen, Veli BMC Genomics Research BACKGROUND: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. RESULTS: We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC. CONCLUSIONS: Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions. BioMed Central 2018-05-09 /pmc/articles/PMC5954285/ /pubmed/29764365 http://dx.doi.org/10.1186/s12864-018-4465-8 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Valenzuela, Daniel
Norri, Tuukka
Välimäki, Niko
Pitkänen, Esa
Mäkinen, Veli
Towards pan-genome read alignment to improve variation calling
title Towards pan-genome read alignment to improve variation calling
title_full Towards pan-genome read alignment to improve variation calling
title_fullStr Towards pan-genome read alignment to improve variation calling
title_full_unstemmed Towards pan-genome read alignment to improve variation calling
title_short Towards pan-genome read alignment to improve variation calling
title_sort towards pan-genome read alignment to improve variation calling
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5954285/
https://www.ncbi.nlm.nih.gov/pubmed/29764365
http://dx.doi.org/10.1186/s12864-018-4465-8
work_keys_str_mv AT valenzueladaniel towardspangenomereadalignmenttoimprovevariationcalling
AT norrituukka towardspangenomereadalignmenttoimprovevariationcalling
AT valimakiniko towardspangenomereadalignmenttoimprovevariationcalling
AT pitkanenesa towardspangenomereadalignmenttoimprovevariationcalling
AT makinenveli towardspangenomereadalignmenttoimprovevariationcalling