Cargando…

NovoGraph: Human genome graph construction from multiple long-read de novo assemblies

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic...

Descripción completa

Detalles Bibliográficos
Autores principales: Biederstedt, Evan, Oliver, Jeffrey C., Hansen, Nancy F., Jajoo, Aarti, Dunn, Nathan, Olson, Andrew, Busby, Ben, Dilthey, Alexander T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000 Research Limited 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305223/
https://www.ncbi.nlm.nih.gov/pubmed/30613392
http://dx.doi.org/10.12688/f1000research.15895.2
_version_ 1783382517435334656
author Biederstedt, Evan
Oliver, Jeffrey C.
Hansen, Nancy F.
Jajoo, Aarti
Dunn, Nathan
Olson, Andrew
Busby, Ben
Dilthey, Alexander T.
author_facet Biederstedt, Evan
Oliver, Jeffrey C.
Hansen, Nancy F.
Jajoo, Aarti
Dunn, Nathan
Olson, Andrew
Busby, Ben
Dilthey, Alexander T.
author_sort Biederstedt, Evan
collection PubMed
description Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.
format Online
Article
Text
id pubmed-6305223
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher F1000 Research Limited
record_format MEDLINE/PubMed
spelling pubmed-63052232019-01-03 NovoGraph: Human genome graph construction from multiple long-read de novo assemblies Biederstedt, Evan Oliver, Jeffrey C. Hansen, Nancy F. Jajoo, Aarti Dunn, Nathan Olson, Andrew Busby, Ben Dilthey, Alexander T. F1000Res Software Tool Article Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped. F1000 Research Limited 2018-12-10 /pmc/articles/PMC6305223/ /pubmed/30613392 http://dx.doi.org/10.12688/f1000research.15895.2 Text en Copyright: © 2018 Biederstedt E et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.
spellingShingle Software Tool Article
Biederstedt, Evan
Oliver, Jeffrey C.
Hansen, Nancy F.
Jajoo, Aarti
Dunn, Nathan
Olson, Andrew
Busby, Ben
Dilthey, Alexander T.
NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
title NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
title_full NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
title_fullStr NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
title_full_unstemmed NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
title_short NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
title_sort novograph: human genome graph construction from multiple long-read de novo assemblies
topic Software Tool Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305223/
https://www.ncbi.nlm.nih.gov/pubmed/30613392
http://dx.doi.org/10.12688/f1000research.15895.2
work_keys_str_mv AT biederstedtevan novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT oliverjeffreyc novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT hansennancyf novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT jajooaarti novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT dunnnathan novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT olsonandrew novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT busbyben novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies
AT diltheyalexandert novographhumangenomegraphconstructionfrommultiplelongreaddenovoassemblies