Cargando…

GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data

BACKGROUND: A crucial task in metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhong, Cuncong, Yang, Youngik, Yooseph, Shibu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6551247/ https://www.ncbi.nlm.nih.gov/pubmed/31167633 http://dx.doi.org/10.1186/s12859-019-2818-1

_version_	1783424363917213696
author	Zhong, Cuncong Yang, Youngik Yooseph, Shibu
author_facet	Zhong, Cuncong Yang, Youngik Yooseph, Shibu
author_sort	Zhong, Cuncong
collection	PubMed
description	BACKGROUND: A crucial task in metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The first approach may suffer from fragmentary and incomplete assembly, while the second is hampered by the reduced functional signal contained in the short reads. To tackle these issues, we have previously developed GRASP (Guided Reference-based Assembly of Short Peptides), which accepts a reference protein sequence as input and aims to assemble its homologs from a database containing fragmentary protein sequences. In addition to a gene-centric assembly tool, GRASP also serves as a homolog search tool when using the assembled protein sequences as templates to recruit reads. GRASP has significantly improved recall rate (60–80% vs. 30–40%) compared to other homolog search tools such as BLAST. However, GRASP is both time- and space-consuming. Subsequently, we developed GRASPx, which is 30X faster than GRASP. Here, we present a completely redesigned algorithm, GRASP2, for this computational problem. RESULTS: GRASP2 utilizes Burrows-Wheeler Transformation (BWT) and FM-index to perform assembly graph generation, and reduces the search space by employing a fast ungapped alignment strategy as a filter. GRASP2 also explicitly generates candidate paths prior to alignment, which effectively uncouples the iterative access of the assembly graph and alignment matrix. This strategy makes the execution of the program more efficient under current computer architecture, and contributes to GRASP2’s speedup. GRASP2 is 8-fold faster than GRASPx (and 250-fold faster than GRASP) and uses 8-fold less memory while maintaining the original high recall rate of GRASP. GRASP2 reaches ~ 80% recall rate compared to that of ~ 40% generated by BLAST, both at a high precision level (> 95%). With such a high performance, GRASP2 is only ~3X slower than BLASTP. CONCLUSION: GRASP2 is a high-performance gene-centric and homolog search tool with significant speedup compared to its predecessors, which makes GRASP2 a useful tool for metagenomics data analysis, GRASP2 is implemented in C++ and is freely available from http://www.sourceforge.net/projects/grasp2.
format	Online Article Text
id	pubmed-6551247
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-65512472019-06-07 GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data Zhong, Cuncong Yang, Youngik Yooseph, Shibu BMC Bioinformatics Research BACKGROUND: A crucial task in metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The first approach may suffer from fragmentary and incomplete assembly, while the second is hampered by the reduced functional signal contained in the short reads. To tackle these issues, we have previously developed GRASP (Guided Reference-based Assembly of Short Peptides), which accepts a reference protein sequence as input and aims to assemble its homologs from a database containing fragmentary protein sequences. In addition to a gene-centric assembly tool, GRASP also serves as a homolog search tool when using the assembled protein sequences as templates to recruit reads. GRASP has significantly improved recall rate (60–80% vs. 30–40%) compared to other homolog search tools such as BLAST. However, GRASP is both time- and space-consuming. Subsequently, we developed GRASPx, which is 30X faster than GRASP. Here, we present a completely redesigned algorithm, GRASP2, for this computational problem. RESULTS: GRASP2 utilizes Burrows-Wheeler Transformation (BWT) and FM-index to perform assembly graph generation, and reduces the search space by employing a fast ungapped alignment strategy as a filter. GRASP2 also explicitly generates candidate paths prior to alignment, which effectively uncouples the iterative access of the assembly graph and alignment matrix. This strategy makes the execution of the program more efficient under current computer architecture, and contributes to GRASP2’s speedup. GRASP2 is 8-fold faster than GRASPx (and 250-fold faster than GRASP) and uses 8-fold less memory while maintaining the original high recall rate of GRASP. GRASP2 reaches ~ 80% recall rate compared to that of ~ 40% generated by BLAST, both at a high precision level (> 95%). With such a high performance, GRASP2 is only ~3X slower than BLASTP. CONCLUSION: GRASP2 is a high-performance gene-centric and homolog search tool with significant speedup compared to its predecessors, which makes GRASP2 a useful tool for metagenomics data analysis, GRASP2 is implemented in C++ and is freely available from http://www.sourceforge.net/projects/grasp2. BioMed Central 2019-06-06 /pmc/articles/PMC6551247/ /pubmed/31167633 http://dx.doi.org/10.1186/s12859-019-2818-1 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Zhong, Cuncong Yang, Youngik Yooseph, Shibu GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
title	GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
title_full	GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
title_fullStr	GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
title_full_unstemmed	GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
title_short	GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
title_sort	grasp2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6551247/ https://www.ncbi.nlm.nih.gov/pubmed/31167633 http://dx.doi.org/10.1186/s12859-019-2818-1
work_keys_str_mv	AT zhongcuncong grasp2fastandmemoryefficientgenecentricassemblyandhomologsearchformetagenomicsequencingdata AT yangyoungik grasp2fastandmemoryefficientgenecentricassemblyandhomologsearchformetagenomicsequencingdata AT yoosephshibu grasp2fastandmemoryefficientgenecentricassemblyandhomologsearchformetagenomicsequencingdata

GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data

Ejemplares similares