Cargando…

De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data

The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes tha...

Descripción completa

Detalles Bibliográficos
Autores principales: Ameur, Adam, Che, Huiwen, Martin, Marcel, Bunikis, Ignas, Dahlberg, Johan, Höijer, Ida, Häggqvist, Susana, Vezzi, Francesco, Nordlund, Jessica, Olason, Pall, Feuk, Lars, Gyllensten, Ulf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6210158/
https://www.ncbi.nlm.nih.gov/pubmed/30304863
http://dx.doi.org/10.3390/genes9100486
_version_ 1783367050054336512
author Ameur, Adam
Che, Huiwen
Martin, Marcel
Bunikis, Ignas
Dahlberg, Johan
Höijer, Ida
Häggqvist, Susana
Vezzi, Francesco
Nordlund, Jessica
Olason, Pall
Feuk, Lars
Gyllensten, Ulf
author_facet Ameur, Adam
Che, Huiwen
Martin, Marcel
Bunikis, Ignas
Dahlberg, Johan
Höijer, Ida
Häggqvist, Susana
Vezzi, Francesco
Nordlund, Jessica
Olason, Pall
Feuk, Lars
Gyllensten, Ulf
author_sort Ameur, Adam
collection PubMed
description The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data.
format Online
Article
Text
id pubmed-6210158
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-62101582018-11-02 De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data Ameur, Adam Che, Huiwen Martin, Marcel Bunikis, Ignas Dahlberg, Johan Höijer, Ida Häggqvist, Susana Vezzi, Francesco Nordlund, Jessica Olason, Pall Feuk, Lars Gyllensten, Ulf Genes (Basel) Article The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data. MDPI 2018-10-09 /pmc/articles/PMC6210158/ /pubmed/30304863 http://dx.doi.org/10.3390/genes9100486 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Ameur, Adam
Che, Huiwen
Martin, Marcel
Bunikis, Ignas
Dahlberg, Johan
Höijer, Ida
Häggqvist, Susana
Vezzi, Francesco
Nordlund, Jessica
Olason, Pall
Feuk, Lars
Gyllensten, Ulf
De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_full De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_fullStr De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_full_unstemmed De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_short De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_sort de novo assembly of two swedish genomes reveals missing segments from the human grch38 reference and improves variant calling of population-scale sequencing data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6210158/
https://www.ncbi.nlm.nih.gov/pubmed/30304863
http://dx.doi.org/10.3390/genes9100486
work_keys_str_mv AT ameuradam denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT chehuiwen denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT martinmarcel denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT bunikisignas denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT dahlbergjohan denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT hoijerida denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT haggqvistsusana denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT vezzifrancesco denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT nordlundjessica denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT olasonpall denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT feuklars denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT gyllenstenulf denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata