Cargando…

A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data

Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequ...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Yuan, Sun, Yanni, Cole, James R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4133164/
https://www.ncbi.nlm.nih.gov/pubmed/25122209
http://dx.doi.org/10.1371/journal.pcbi.1003737
_version_ 1782330721305624576
author Zhang, Yuan
Sun, Yanni
Cole, James R.
author_facet Zhang, Yuan
Sun, Yanni
Cole, James R.
author_sort Zhang, Yuan
collection PubMed
description Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.
format Online
Article
Text
id pubmed-4133164
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-41331642014-08-19 A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data Zhang, Yuan Sun, Yanni Cole, James R. PLoS Comput Biol Research Article Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material. Public Library of Science 2014-08-14 /pmc/articles/PMC4133164/ /pubmed/25122209 http://dx.doi.org/10.1371/journal.pcbi.1003737 Text en © 2014 Zhang et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Zhang, Yuan
Sun, Yanni
Cole, James R.
A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data
title A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data
title_full A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data
title_fullStr A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data
title_full_unstemmed A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data
title_short A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data
title_sort scalable and accurate targeted gene assembly tool (sat-assembler) for next-generation sequencing data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4133164/
https://www.ncbi.nlm.nih.gov/pubmed/25122209
http://dx.doi.org/10.1371/journal.pcbi.1003737
work_keys_str_mv AT zhangyuan ascalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT sunyanni ascalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT colejamesr ascalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT zhangyuan scalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT sunyanni scalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT colejamesr scalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata