Cargando…

Long-read sequence and assembly of segmental duplications

We developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. The approach, Segmental Duplication Assembler (SDA), constructs graphs where paralogous sequence variants define the nodes and long-r...

Descripción completa

Detalles Bibliográficos
Autores principales: Vollger, Mitchell R., Dishuck, Philip C., Sorensen, Melanie, Welch, AnneMarie E., Dang, Vy, Dougherty, Max L., Graves-Lindsay, Tina A., Wilson, Richard K., Chaisson, Mark J. P., Eichler, Evan E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6382464/
https://www.ncbi.nlm.nih.gov/pubmed/30559433
http://dx.doi.org/10.1038/s41592-018-0236-3
Descripción
Sumario:We developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. The approach, Segmental Duplication Assembler (SDA), constructs graphs where paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges allowing us to partition and assemble long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33–79 Mbp of duplications where approximately half of the loci are diverged (<99.8%) when compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy number variable paralogs that are absent from the human reference. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy number variant genetic diversity at the base-pair level.