Cargando…

Gaps and complex structurally variant loci in phased genome assemblies

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, asse...

Descripción completa

Detalles Bibliográficos
Autores principales: Porubsky, David, Vollger, Mitchell R., Harvey, William T., Rozanski, Allison N., Ebert, Peter, Hickey, Glenn, Hasenfeld, Patrick, Sanders, Ashley D., Stober, Catherine, Korbel, Jan O., Paten, Benedict, Marschall, Tobias, Eichler, Evan E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10234299/
https://www.ncbi.nlm.nih.gov/pubmed/37164484
http://dx.doi.org/10.1101/gr.277334.122
_version_ 1785052458120118272
author Porubsky, David
Vollger, Mitchell R.
Harvey, William T.
Rozanski, Allison N.
Ebert, Peter
Hickey, Glenn
Hasenfeld, Patrick
Sanders, Ashley D.
Stober, Catherine
Korbel, Jan O.
Paten, Benedict
Marschall, Tobias
Eichler, Evan E.
author_facet Porubsky, David
Vollger, Mitchell R.
Harvey, William T.
Rozanski, Allison N.
Ebert, Peter
Hickey, Glenn
Hasenfeld, Patrick
Sanders, Ashley D.
Stober, Catherine
Korbel, Jan O.
Paten, Benedict
Marschall, Tobias
Eichler, Evan E.
author_sort Porubsky, David
collection PubMed
description There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
format Online
Article
Text
id pubmed-10234299
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-102342992023-06-02 Gaps and complex structurally variant loci in phased genome assemblies Porubsky, David Vollger, Mitchell R. Harvey, William T. Rozanski, Allison N. Ebert, Peter Hickey, Glenn Hasenfeld, Patrick Sanders, Ashley D. Stober, Catherine Korbel, Jan O. Paten, Benedict Marschall, Tobias Eichler, Evan E. Genome Res Research There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation. Cold Spring Harbor Laboratory Press 2023-04 /pmc/articles/PMC10234299/ /pubmed/37164484 http://dx.doi.org/10.1101/gr.277334.122 Text en © 2023 Porubsky et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research
Porubsky, David
Vollger, Mitchell R.
Harvey, William T.
Rozanski, Allison N.
Ebert, Peter
Hickey, Glenn
Hasenfeld, Patrick
Sanders, Ashley D.
Stober, Catherine
Korbel, Jan O.
Paten, Benedict
Marschall, Tobias
Eichler, Evan E.
Gaps and complex structurally variant loci in phased genome assemblies
title Gaps and complex structurally variant loci in phased genome assemblies
title_full Gaps and complex structurally variant loci in phased genome assemblies
title_fullStr Gaps and complex structurally variant loci in phased genome assemblies
title_full_unstemmed Gaps and complex structurally variant loci in phased genome assemblies
title_short Gaps and complex structurally variant loci in phased genome assemblies
title_sort gaps and complex structurally variant loci in phased genome assemblies
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10234299/
https://www.ncbi.nlm.nih.gov/pubmed/37164484
http://dx.doi.org/10.1101/gr.277334.122
work_keys_str_mv AT porubskydavid gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT vollgermitchellr gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT harveywilliamt gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT rozanskiallisonn gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT ebertpeter gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT hickeyglenn gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT hasenfeldpatrick gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT sandersashleyd gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT stobercatherine gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT korbeljano gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT patenbenedict gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT marschalltobias gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies
AT eichlerevane gapsandcomplexstructurallyvariantlociinphasedgenomeassemblies