Cargando…

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization

MOTIVATION: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characteriza...

Descripción completa

Detalles Bibliográficos
Autores principales: Ghareghani, Maryam, Porubskỳ, David, Sanders, Ashley D, Meiers, Sascha, Eichler, Evan E, Korbel, Jan O, Marschall, Tobias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022540/
https://www.ncbi.nlm.nih.gov/pubmed/29949971
http://dx.doi.org/10.1093/bioinformatics/bty290
_version_ 1783335700102381568
author Ghareghani, Maryam
Porubskỳ, David
Sanders, Ashley D
Meiers, Sascha
Eichler, Evan E
Korbel, Jan O
Marschall, Tobias
author_facet Ghareghani, Maryam
Porubskỳ, David
Sanders, Ashley D
Meiers, Sascha
Eichler, Evan E
Korbel, Jan O
Marschall, Tobias
author_sort Ghareghani, Maryam
collection PubMed
description MOTIVATION: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. RESULTS: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. AVAILABILITY AND IMPLEMENTATION: https://github.com/daewoooo/SaaRclust
format Online
Article
Text
id pubmed-6022540
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-60225402018-07-10 Strand-seq enables reliable separation of long reads by chromosome via expectation maximization Ghareghani, Maryam Porubskỳ, David Sanders, Ashley D Meiers, Sascha Eichler, Evan E Korbel, Jan O Marschall, Tobias Bioinformatics Ismb 2018–Intelligent Systems for Molecular Biology Proceedings MOTIVATION: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. RESULTS: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. AVAILABILITY AND IMPLEMENTATION: https://github.com/daewoooo/SaaRclust Oxford University Press 2018-07-01 2018-06-27 /pmc/articles/PMC6022540/ /pubmed/29949971 http://dx.doi.org/10.1093/bioinformatics/bty290 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb 2018–Intelligent Systems for Molecular Biology Proceedings
Ghareghani, Maryam
Porubskỳ, David
Sanders, Ashley D
Meiers, Sascha
Eichler, Evan E
Korbel, Jan O
Marschall, Tobias
Strand-seq enables reliable separation of long reads by chromosome via expectation maximization
title Strand-seq enables reliable separation of long reads by chromosome via expectation maximization
title_full Strand-seq enables reliable separation of long reads by chromosome via expectation maximization
title_fullStr Strand-seq enables reliable separation of long reads by chromosome via expectation maximization
title_full_unstemmed Strand-seq enables reliable separation of long reads by chromosome via expectation maximization
title_short Strand-seq enables reliable separation of long reads by chromosome via expectation maximization
title_sort strand-seq enables reliable separation of long reads by chromosome via expectation maximization
topic Ismb 2018–Intelligent Systems for Molecular Biology Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022540/
https://www.ncbi.nlm.nih.gov/pubmed/29949971
http://dx.doi.org/10.1093/bioinformatics/bty290
work_keys_str_mv AT ghareghanimaryam strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization
AT porubskydavid strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization
AT sandersashleyd strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization
AT meierssascha strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization
AT eichlerevane strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization
AT korbeljano strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization
AT marschalltobias strandseqenablesreliableseparationoflongreadsbychromosomeviaexpectationmaximization