Cargando…

Informed kmer selection for de novo transcriptome assembly

Motivation: De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality o...

Descripción completa

Detalles Bibliográficos
Autores principales: Durai, Dilip A., Schulz, Marcel H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4892416/
https://www.ncbi.nlm.nih.gov/pubmed/27153653
http://dx.doi.org/10.1093/bioinformatics/btw217
_version_ 1782435383375560704
author Durai, Dilip A.
Schulz, Marcel H.
author_facet Durai, Dilip A.
Schulz, Marcel H.
author_sort Durai, Dilip A.
collection PubMed
description Motivation: De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality of the assemblies produced by such assemblers is highly influenced by the exact word length k. As such no single kmer value leads to optimal results. Instead, DBGs over different kmer values are built and the assemblies are merged to improve sensitivity. However, no studies have investigated thoroughly the problem of automatically learning at which kmer value to stop the assembly. Instead a suboptimal selection of kmer values is often used in practice. Results: Here we investigate the contribution of a single kmer value in a multi-kmer based assembly approach. We find that a comparative clustering of related assemblies can be used to estimate the importance of an additional kmer assembly. Using a model fit based algorithm we predict the kmer value at which no further assemblies are necessary. Our approach is tested with different de novo assemblers for datasets with different coverage values and read lengths. Further, we suggest a simple post processing step that significantly improves the quality of multi-kmer assemblies. Conclusion: We provide an automatic method for limiting the number of kmer values without a significant loss in assembly quality but with savings in assembly time. This is a step forward to making multi-kmer methods more reliable and easier to use. Availability and Implementation:A general implementation of our approach can be found under: https://github.com/SchulzLab/KREATION. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: mschulz@mmci.uni-saarland.de
format Online
Article
Text
id pubmed-4892416
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-48924162016-06-07 Informed kmer selection for de novo transcriptome assembly Durai, Dilip A. Schulz, Marcel H. Bioinformatics Hitseq Papers Motivation: De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality of the assemblies produced by such assemblers is highly influenced by the exact word length k. As such no single kmer value leads to optimal results. Instead, DBGs over different kmer values are built and the assemblies are merged to improve sensitivity. However, no studies have investigated thoroughly the problem of automatically learning at which kmer value to stop the assembly. Instead a suboptimal selection of kmer values is often used in practice. Results: Here we investigate the contribution of a single kmer value in a multi-kmer based assembly approach. We find that a comparative clustering of related assemblies can be used to estimate the importance of an additional kmer assembly. Using a model fit based algorithm we predict the kmer value at which no further assemblies are necessary. Our approach is tested with different de novo assemblers for datasets with different coverage values and read lengths. Further, we suggest a simple post processing step that significantly improves the quality of multi-kmer assemblies. Conclusion: We provide an automatic method for limiting the number of kmer values without a significant loss in assembly quality but with savings in assembly time. This is a step forward to making multi-kmer methods more reliable and easier to use. Availability and Implementation:A general implementation of our approach can be found under: https://github.com/SchulzLab/KREATION. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: mschulz@mmci.uni-saarland.de Oxford University Press 2016-06-01 2016-04-28 /pmc/articles/PMC4892416/ /pubmed/27153653 http://dx.doi.org/10.1093/bioinformatics/btw217 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Hitseq Papers
Durai, Dilip A.
Schulz, Marcel H.
Informed kmer selection for de novo transcriptome assembly
title Informed kmer selection for de novo transcriptome assembly
title_full Informed kmer selection for de novo transcriptome assembly
title_fullStr Informed kmer selection for de novo transcriptome assembly
title_full_unstemmed Informed kmer selection for de novo transcriptome assembly
title_short Informed kmer selection for de novo transcriptome assembly
title_sort informed kmer selection for de novo transcriptome assembly
topic Hitseq Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4892416/
https://www.ncbi.nlm.nih.gov/pubmed/27153653
http://dx.doi.org/10.1093/bioinformatics/btw217
work_keys_str_mv AT duraidilipa informedkmerselectionfordenovotranscriptomeassembly
AT schulzmarcelh informedkmerselectionfordenovotranscriptomeassembly