Cargando…

Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms

BACKGROUND: De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mbandi, Stanley Kimbung, Hesse, Uljana, van Heusden, Peter, Christoffels, Alan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4344733/ https://www.ncbi.nlm.nih.gov/pubmed/25880035 http://dx.doi.org/10.1186/s12859-015-0492-5

_version_	1782359475939704832
author	Mbandi, Stanley Kimbung Hesse, Uljana van Heusden, Peter Christoffels, Alan
author_facet	Mbandi, Stanley Kimbung Hesse, Uljana van Heusden, Peter Christoffels, Alan
author_sort	Mbandi, Stanley Kimbung
collection	PubMed
description	BACKGROUND: De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies. RESULTS: Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5′ and 3′) regions and non-coding gene loci. CONCLUSIONS: IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0492-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4344733
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43447332015-03-01 Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms Mbandi, Stanley Kimbung Hesse, Uljana van Heusden, Peter Christoffels, Alan BMC Bioinformatics Methodology Article BACKGROUND: De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies. RESULTS: Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5′ and 3′) regions and non-coding gene loci. CONCLUSIONS: IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0492-5) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-21 /pmc/articles/PMC4344733/ /pubmed/25880035 http://dx.doi.org/10.1186/s12859-015-0492-5 Text en © Mbandi et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Mbandi, Stanley Kimbung Hesse, Uljana van Heusden, Peter Christoffels, Alan Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
title	Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
title_full	Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
title_fullStr	Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
title_full_unstemmed	Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
title_short	Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
title_sort	inferring bona fide transfrags in rna-seq derived-transcriptome assemblies of non-model organisms
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4344733/ https://www.ncbi.nlm.nih.gov/pubmed/25880035 http://dx.doi.org/10.1186/s12859-015-0492-5
work_keys_str_mv	AT mbandistanleykimbung inferringbonafidetransfragsinrnaseqderivedtranscriptomeassembliesofnonmodelorganisms AT hesseuljana inferringbonafidetransfragsinrnaseqderivedtranscriptomeassembliesofnonmodelorganisms AT vanheusdenpeter inferringbonafidetransfragsinrnaseqderivedtranscriptomeassembliesofnonmodelorganisms AT christoffelsalan inferringbonafidetransfragsinrnaseqderivedtranscriptomeassembliesofnonmodelorganisms

Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms

Ejemplares similares