Cargando…

Fragment assignment in the cloud with eXpress-D

BACKGROUND: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood...

Descripción completa

Detalles Bibliográficos
Autores principales:	Roberts, Adam, Feng, Harvey, Pachter, Lior
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881492/ https://www.ncbi.nlm.nih.gov/pubmed/24314033 http://dx.doi.org/10.1186/1471-2105-14-358

_version_	1782298223306604544
author	Roberts, Adam Feng, Harvey Pachter, Lior
author_facet	Roberts, Adam Feng, Harvey Pachter, Lior
author_sort	Roberts, Adam
collection	PubMed
description	BACKGROUND: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability. RESULTS: We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters–“the cloud”. We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data. CONCLUSIONS: The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems–such as new frameworks like Spark–for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.
format	Online Article Text
id	pubmed-3881492
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38814922014-01-07 Fragment assignment in the cloud with eXpress-D Roberts, Adam Feng, Harvey Pachter, Lior BMC Bioinformatics Methodology Article BACKGROUND: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability. RESULTS: We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters–“the cloud”. We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data. CONCLUSIONS: The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems–such as new frameworks like Spark–for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d. BioMed Central 2013-12-07 /pmc/articles/PMC3881492/ /pubmed/24314033 http://dx.doi.org/10.1186/1471-2105-14-358 Text en Copyright © 2013 Roberts et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Roberts, Adam Feng, Harvey Pachter, Lior Fragment assignment in the cloud with eXpress-D
title	Fragment assignment in the cloud with eXpress-D
title_full	Fragment assignment in the cloud with eXpress-D
title_fullStr	Fragment assignment in the cloud with eXpress-D
title_full_unstemmed	Fragment assignment in the cloud with eXpress-D
title_short	Fragment assignment in the cloud with eXpress-D
title_sort	fragment assignment in the cloud with express-d
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881492/ https://www.ncbi.nlm.nih.gov/pubmed/24314033 http://dx.doi.org/10.1186/1471-2105-14-358
work_keys_str_mv	AT robertsadam fragmentassignmentinthecloudwithexpressd AT fengharvey fragmentassignmentinthecloudwithexpressd AT pachterlior fragmentassignmentinthecloudwithexpressd

Fragment assignment in the cloud with eXpress-D

Ejemplares similares