Cargando…

Genome-guided transcript assembly from integrative analysis of RNA sequence data

The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in genome annotation pipelines. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call generalized RNA int...

Descripción completa

Detalles Bibliográficos
Autores principales: Boley, Nathan, Stoiber, Marcus H., Booth, Benjamin W., Wan, Kenneth H., Hoskins, Roger A., Bickel, Peter J., Celniker, Susan E., Brown, James B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4037530/
https://www.ncbi.nlm.nih.gov/pubmed/24633242
http://dx.doi.org/10.1038/nbt.2850
_version_ 1782318245852741632
author Boley, Nathan
Stoiber, Marcus H.
Booth, Benjamin W.
Wan, Kenneth H.
Hoskins, Roger A.
Bickel, Peter J.
Celniker, Susan E.
Brown, James B.
author_facet Boley, Nathan
Stoiber, Marcus H.
Booth, Benjamin W.
Wan, Kenneth H.
Hoskins, Roger A.
Bickel, Peter J.
Celniker, Susan E.
Brown, James B.
author_sort Boley, Nathan
collection PubMed
description The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in genome annotation pipelines. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call generalized RNA integration tool, or GRIT. By applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recover the vast majority of previously annotated transcripts and double the total number of transcripts cataloged. We find that 20% of protein coding genes encode multiple protein-localization signals, and that, in 20 day old adult fly heads, genes with multiple poly-adenylation sites are more common than genes with alternate splicing or alternate promoters. When compared to the most widely used transcript assembly tools, GRIT recovers a larger fraction of annotated transcripts at higher precision. GRIT will enable the automated generation of high-quality genome annotations without necessitating extensive manual annotation.
format Online
Article
Text
id pubmed-4037530
institution National Center for Biotechnology Information
language English
publishDate 2014
record_format MEDLINE/PubMed
spelling pubmed-40375302014-10-01 Genome-guided transcript assembly from integrative analysis of RNA sequence data Boley, Nathan Stoiber, Marcus H. Booth, Benjamin W. Wan, Kenneth H. Hoskins, Roger A. Bickel, Peter J. Celniker, Susan E. Brown, James B. Nat Biotechnol Article The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in genome annotation pipelines. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call generalized RNA integration tool, or GRIT. By applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recover the vast majority of previously annotated transcripts and double the total number of transcripts cataloged. We find that 20% of protein coding genes encode multiple protein-localization signals, and that, in 20 day old adult fly heads, genes with multiple poly-adenylation sites are more common than genes with alternate splicing or alternate promoters. When compared to the most widely used transcript assembly tools, GRIT recovers a larger fraction of annotated transcripts at higher precision. GRIT will enable the automated generation of high-quality genome annotations without necessitating extensive manual annotation. 2014-03-16 2014-04 /pmc/articles/PMC4037530/ /pubmed/24633242 http://dx.doi.org/10.1038/nbt.2850 Text en http://www.nature.com/authors/editorial_policies/license.html#terms Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:http://www.nature.com/authors/editorial_policies/license.html#terms
spellingShingle Article
Boley, Nathan
Stoiber, Marcus H.
Booth, Benjamin W.
Wan, Kenneth H.
Hoskins, Roger A.
Bickel, Peter J.
Celniker, Susan E.
Brown, James B.
Genome-guided transcript assembly from integrative analysis of RNA sequence data
title Genome-guided transcript assembly from integrative analysis of RNA sequence data
title_full Genome-guided transcript assembly from integrative analysis of RNA sequence data
title_fullStr Genome-guided transcript assembly from integrative analysis of RNA sequence data
title_full_unstemmed Genome-guided transcript assembly from integrative analysis of RNA sequence data
title_short Genome-guided transcript assembly from integrative analysis of RNA sequence data
title_sort genome-guided transcript assembly from integrative analysis of rna sequence data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4037530/
https://www.ncbi.nlm.nih.gov/pubmed/24633242
http://dx.doi.org/10.1038/nbt.2850
work_keys_str_mv AT boleynathan genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT stoibermarcush genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT boothbenjaminw genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT wankennethh genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT hoskinsrogera genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT bickelpeterj genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT celnikersusane genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata
AT brownjamesb genomeguidedtranscriptassemblyfromintegrativeanalysisofrnasequencedata