Cargando…

Alignment of 1000 Genomes Project reads to reference assembly GRCh38

The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An up...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng-Bradley, Xiangqun, Streeter, Ian, Fairley, Susan, Richardson, David, Clarke, Laura, Flicek, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5522380/
https://www.ncbi.nlm.nih.gov/pubmed/28531267
http://dx.doi.org/10.1093/gigascience/gix038
_version_ 1783252155397832704
author Zheng-Bradley, Xiangqun
Streeter, Ian
Fairley, Susan
Richardson, David
Clarke, Laura
Flicek, Paul
author_facet Zheng-Bradley, Xiangqun
Streeter, Ian
Fairley, Susan
Richardson, David
Clarke, Laura
Flicek, Paul
author_sort Zheng-Bradley, Xiangqun
collection PubMed
description The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold–aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.
format Online
Article
Text
id pubmed-5522380
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-55223802017-07-26 Alignment of 1000 Genomes Project reads to reference assembly GRCh38 Zheng-Bradley, Xiangqun Streeter, Ian Fairley, Susan Richardson, David Clarke, Laura Flicek, Paul Gigascience Data Note The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold–aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38. Oxford University Press 2017-05-20 /pmc/articles/PMC5522380/ /pubmed/28531267 http://dx.doi.org/10.1093/gigascience/gix038 Text en © The Authors 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Data Note
Zheng-Bradley, Xiangqun
Streeter, Ian
Fairley, Susan
Richardson, David
Clarke, Laura
Flicek, Paul
Alignment of 1000 Genomes Project reads to reference assembly GRCh38
title Alignment of 1000 Genomes Project reads to reference assembly GRCh38
title_full Alignment of 1000 Genomes Project reads to reference assembly GRCh38
title_fullStr Alignment of 1000 Genomes Project reads to reference assembly GRCh38
title_full_unstemmed Alignment of 1000 Genomes Project reads to reference assembly GRCh38
title_short Alignment of 1000 Genomes Project reads to reference assembly GRCh38
title_sort alignment of 1000 genomes project reads to reference assembly grch38
topic Data Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5522380/
https://www.ncbi.nlm.nih.gov/pubmed/28531267
http://dx.doi.org/10.1093/gigascience/gix038
work_keys_str_mv AT zhengbradleyxiangqun alignmentof1000genomesprojectreadstoreferenceassemblygrch38
AT streeterian alignmentof1000genomesprojectreadstoreferenceassemblygrch38
AT fairleysusan alignmentof1000genomesprojectreadstoreferenceassemblygrch38
AT richardsondavid alignmentof1000genomesprojectreadstoreferenceassemblygrch38
AT clarkelaura alignmentof1000genomesprojectreadstoreferenceassemblygrch38
AT flicekpaul alignmentof1000genomesprojectreadstoreferenceassemblygrch38
AT alignmentof1000genomesprojectreadstoreferenceassemblygrch38