Cargando…
EasyCluster2: an improved tool for clustering and assembling long transcriptome reads
BACKGROUND: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowada...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271567/ https://www.ncbi.nlm.nih.gov/pubmed/25474441 http://dx.doi.org/10.1186/1471-2105-15-S15-S7 |
_version_ | 1782349629125296128 |
---|---|
author | Bevilacqua, Vitoantonio Pietroleonardo, Nicola Giannino, Ely Ignazio Stroppa, Fabio Simone, Domenico Pesole, Graziano Picardi, Ernesto |
author_facet | Bevilacqua, Vitoantonio Pietroleonardo, Nicola Giannino, Ely Ignazio Stroppa, Fabio Simone, Domenico Pesole, Graziano Picardi, Ernesto |
author_sort | Bevilacqua, Vitoantonio |
collection | PubMed |
description | BACKGROUND: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. RESULTS: EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. CONCLUSIONS: EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics. |
format | Online Article Text |
id | pubmed-4271567 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42715672015-01-02 EasyCluster2: an improved tool for clustering and assembling long transcriptome reads Bevilacqua, Vitoantonio Pietroleonardo, Nicola Giannino, Ely Ignazio Stroppa, Fabio Simone, Domenico Pesole, Graziano Picardi, Ernesto BMC Bioinformatics Proceedings BACKGROUND: Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. RESULTS: EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. CONCLUSIONS: EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics. BioMed Central 2014-12-03 /pmc/articles/PMC4271567/ /pubmed/25474441 http://dx.doi.org/10.1186/1471-2105-15-S15-S7 Text en Copyright © 2014 Bevilacqua et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Proceedings Bevilacqua, Vitoantonio Pietroleonardo, Nicola Giannino, Ely Ignazio Stroppa, Fabio Simone, Domenico Pesole, Graziano Picardi, Ernesto EasyCluster2: an improved tool for clustering and assembling long transcriptome reads |
title | EasyCluster2: an improved tool for clustering and assembling long transcriptome reads |
title_full | EasyCluster2: an improved tool for clustering and assembling long transcriptome reads |
title_fullStr | EasyCluster2: an improved tool for clustering and assembling long transcriptome reads |
title_full_unstemmed | EasyCluster2: an improved tool for clustering and assembling long transcriptome reads |
title_short | EasyCluster2: an improved tool for clustering and assembling long transcriptome reads |
title_sort | easycluster2: an improved tool for clustering and assembling long transcriptome reads |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271567/ https://www.ncbi.nlm.nih.gov/pubmed/25474441 http://dx.doi.org/10.1186/1471-2105-15-S15-S7 |
work_keys_str_mv | AT bevilacquavitoantonio easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads AT pietroleonardonicola easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads AT gianninoelyignazio easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads AT stroppafabio easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads AT simonedomenico easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads AT pesolegraziano easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads AT picardiernesto easycluster2animprovedtoolforclusteringandassemblinglongtranscriptomereads |