Cargando…

EasyCluster2: an improved tool for clustering and assembling long transcriptome reads

BACKGROUND: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowada...

Descripción completa

Detalles Bibliográficos
Autores principales: Bevilacqua, Vitoantonio, Pietroleonardo, Nicola, Giannino, Ely Ignazio, Stroppa, Fabio, Simone, Domenico, Pesole, Graziano, Picardi, Ernesto
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271567/
https://www.ncbi.nlm.nih.gov/pubmed/25474441
http://dx.doi.org/10.1186/1471-2105-15-S15-S7
_version_ 1782349629125296128
author Bevilacqua, Vitoantonio
Pietroleonardo, Nicola
Giannino, Ely Ignazio
Stroppa, Fabio
Simone, Domenico
Pesole, Graziano
Picardi, Ernesto
author_facet Bevilacqua, Vitoantonio
Pietroleonardo, Nicola
Giannino, Ely Ignazio
Stroppa, Fabio
Simone, Domenico
Pesole, Graziano
Picardi, Ernesto
author_sort Bevilacqua, Vitoantonio
collection PubMed
description BACKGROUND: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. RESULTS: EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. CONCLUSIONS: EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.
format Online
Article
Text
id pubmed-4271567
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42715672015-01-02 EasyCluster2: an improved tool for clustering and assembling long transcriptome reads Bevilacqua, Vitoantonio Pietroleonardo, Nicola Giannino, Ely Ignazio Stroppa, Fabio Simone, Domenico Pesole, Graziano Picardi, Ernesto BMC Bioinformatics Proceedings BACKGROUND: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. RESULTS: EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. CONCLUSIONS: EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics. BioMed Central 2014-12-03 /pmc/articles/PMC4271567/ /pubmed/25474441 http://dx.doi.org/10.1186/1471-2105-15-S15-S7 Text en Copyright © 2014 Bevilacqua et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Bevilacqua, Vitoantonio
Pietroleonardo, Nicola
Giannino, Ely Ignazio
Stroppa, Fabio
Simone, Domenico
Pesole, Graziano
Picardi, Ernesto
EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
title EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
title_full EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
title_fullStr EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
title_full_unstemmed EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
title_short EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
title_sort easycluster2: an improved tool for clustering and assembling long transcriptome reads
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271567/
https://www.ncbi.nlm.nih.gov/pubmed/25474441
http://dx.doi.org/10.1186/1471-2105-15-S15-S7
work_keys_str_mv AT bevilacquavitoantonio easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads
AT pietroleonardonicola easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads
AT gianninoelyignazio easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads
AT stroppafabio easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads
AT simonedomenico easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads
AT pesolegraziano easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads
AT picardiernesto easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads