Cargando…

MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

BACKGROUND: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to res...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Minji, Zhang, Xiejia, Ligo, Jonathan G., Farnoud, Farzad, Veeravalli, Venugopal V., Milenkovic, Olgica
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4759986/ https://www.ncbi.nlm.nih.gov/pubmed/26895947 http://dx.doi.org/10.1186/s12859-016-0932-x

_version_	1782416824163368960
author	Kim, Minji Zhang, Xiejia Ligo, Jonathan G. Farnoud, Farzad Veeravalli, Venugopal V. Milenkovic, Olgica
author_facet	Kim, Minji Zhang, Xiejia Ligo, Jonathan G. Farnoud, Farzad Veeravalli, Venugopal V. Milenkovic, Olgica
author_sort	Kim, Minji
collection	PubMed
description	BACKGROUND: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. RESULTS: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. CONCLUSIONS: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. AVAILABILITY: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0932-x) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4759986
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47599862016-02-20 MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression Kim, Minji Zhang, Xiejia Ligo, Jonathan G. Farnoud, Farzad Veeravalli, Venugopal V. Milenkovic, Olgica BMC Bioinformatics Methodology Article BACKGROUND: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. RESULTS: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. CONCLUSIONS: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. AVAILABILITY: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0932-x) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-19 /pmc/articles/PMC4759986/ /pubmed/26895947 http://dx.doi.org/10.1186/s12859-016-0932-x Text en © Kim et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Kim, Minji Zhang, Xiejia Ligo, Jonathan G. Farnoud, Farzad Veeravalli, Venugopal V. Milenkovic, Olgica MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
title	MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
title_full	MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
title_fullStr	MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
title_full_unstemmed	MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
title_short	MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
title_sort	metacram: an integrated pipeline for metagenomic taxonomy identification and compression
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4759986/ https://www.ncbi.nlm.nih.gov/pubmed/26895947 http://dx.doi.org/10.1186/s12859-016-0932-x
work_keys_str_mv	AT kimminji metacramanintegratedpipelineformetagenomictaxonomyidentificationandcompression AT zhangxiejia metacramanintegratedpipelineformetagenomictaxonomyidentificationandcompression AT ligojonathang metacramanintegratedpipelineformetagenomictaxonomyidentificationandcompression AT farnoudfarzad metacramanintegratedpipelineformetagenomictaxonomyidentificationandcompression AT veeravallivenugopalv metacramanintegratedpipelineformetagenomictaxonomyidentificationandcompression AT milenkovicolgica metacramanintegratedpipelineformetagenomictaxonomyidentificationandcompression

MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

Ejemplares similares