Cargando…

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resem...

Descripción completa

Detalles Bibliográficos
Autores principales:	Linheiro, Raquel, Archer, John
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8651127/ https://www.ncbi.nlm.nih.gov/pubmed/34813594 http://dx.doi.org/10.1371/journal.pcbi.1009631

_version_	1784611345441751040
author	Linheiro, Raquel Archer, John
author_facet	Linheiro, Raquel Archer, John
author_sort	Linheiro, Raquel
collection	PubMed
description	With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.
format	Online Article Text
id	pubmed-8651127
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-86511272021-12-08 CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure Linheiro, Raquel Archer, John PLoS Comput Biol Research Article With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/. Public Library of Science 2021-11-23 /pmc/articles/PMC8651127/ /pubmed/34813594 http://dx.doi.org/10.1371/journal.pcbi.1009631 Text en © 2021 Linheiro, Archer https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Linheiro, Raquel Archer, John CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
title	CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
title_full	CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
title_fullStr	CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
title_full_unstemmed	CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
title_short	CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
title_sort	cstone: a de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8651127/ https://www.ncbi.nlm.nih.gov/pubmed/34813594 http://dx.doi.org/10.1371/journal.pcbi.1009631
work_keys_str_mv	AT linheiroraquel cstoneadenovotranscriptomeassemblerforshortreaddatathatidentifiesnonchimericcontigsbasedonunderlyinggraphstructure AT archerjohn cstoneadenovotranscriptomeassemblerforshortreaddatathatidentifiesnonchimericcontigsbasedonunderlyinggraphstructure

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

Ejemplares similares