Cargando…

Efficient detection and assembly of non-reference DNA sequences with synthetic long reads

Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion’s share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-rea...

Descripción completa

Detalles Bibliográficos
Autores principales: Meleshko, Dmitry, Yang, Rui, Marks, Patrick, Williams, Stephen, Hajirasouliha, Iman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9561269/
https://www.ncbi.nlm.nih.gov/pubmed/35924489
http://dx.doi.org/10.1093/nar/gkac653
_version_ 1784807914787045376
author Meleshko, Dmitry
Yang, Rui
Marks, Patrick
Williams, Stephen
Hajirasouliha, Iman
author_facet Meleshko, Dmitry
Yang, Rui
Marks, Patrick
Williams, Stephen
Hajirasouliha, Iman
author_sort Meleshko, Dmitry
collection PubMed
description Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion’s share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.
format Online
Article
Text
id pubmed-9561269
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-95612692022-10-18 Efficient detection and assembly of non-reference DNA sequences with synthetic long reads Meleshko, Dmitry Yang, Rui Marks, Patrick Williams, Stephen Hajirasouliha, Iman Nucleic Acids Res Methods Online Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion’s share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact. Oxford University Press 2022-08-04 /pmc/articles/PMC9561269/ /pubmed/35924489 http://dx.doi.org/10.1093/nar/gkac653 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Meleshko, Dmitry
Yang, Rui
Marks, Patrick
Williams, Stephen
Hajirasouliha, Iman
Efficient detection and assembly of non-reference DNA sequences with synthetic long reads
title Efficient detection and assembly of non-reference DNA sequences with synthetic long reads
title_full Efficient detection and assembly of non-reference DNA sequences with synthetic long reads
title_fullStr Efficient detection and assembly of non-reference DNA sequences with synthetic long reads
title_full_unstemmed Efficient detection and assembly of non-reference DNA sequences with synthetic long reads
title_short Efficient detection and assembly of non-reference DNA sequences with synthetic long reads
title_sort efficient detection and assembly of non-reference dna sequences with synthetic long reads
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9561269/
https://www.ncbi.nlm.nih.gov/pubmed/35924489
http://dx.doi.org/10.1093/nar/gkac653
work_keys_str_mv AT meleshkodmitry efficientdetectionandassemblyofnonreferencednasequenceswithsyntheticlongreads
AT yangrui efficientdetectionandassemblyofnonreferencednasequenceswithsyntheticlongreads
AT markspatrick efficientdetectionandassemblyofnonreferencednasequenceswithsyntheticlongreads
AT williamsstephen efficientdetectionandassemblyofnonreferencednasequenceswithsyntheticlongreads
AT hajirasoulihaiman efficientdetectionandassemblyofnonreferencednasequenceswithsyntheticlongreads