Cargando…

CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping

BACKGROUND: Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framewo...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, Tung, Shi, Weisong, Ruden, Douglas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3127959/
https://www.ncbi.nlm.nih.gov/pubmed/21645377
http://dx.doi.org/10.1186/1756-0500-4-171
_version_ 1782207391475957760
author Nguyen, Tung
Shi, Weisong
Ruden, Douglas
author_facet Nguyen, Tung
Shi, Weisong
Ruden, Douglas
author_sort Nguyen, Tung
collection PubMed
description BACKGROUND: Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. RESULTS: To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http://cloudaligner.sourceforge.net/ and its web version is at http://mine.cs.wayne.edu:8080/CloudAligner/. CONCLUSIONS: Our results show that CloudAligner is faster than CloudBurst, provides more accurate results than RMAP, and supports various input as well as output formats. In addition, with the web-based interface, it is easier to use than its counterparts.
format Online
Article
Text
id pubmed-3127959
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31279592011-07-01 CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping Nguyen, Tung Shi, Weisong Ruden, Douglas BMC Res Notes Technical Note BACKGROUND: Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. RESULTS: To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http://cloudaligner.sourceforge.net/ and its web version is at http://mine.cs.wayne.edu:8080/CloudAligner/. CONCLUSIONS: Our results show that CloudAligner is faster than CloudBurst, provides more accurate results than RMAP, and supports various input as well as output formats. In addition, with the web-based interface, it is easier to use than its counterparts. BioMed Central 2011-06-06 /pmc/articles/PMC3127959/ /pubmed/21645377 http://dx.doi.org/10.1186/1756-0500-4-171 Text en Copyright ©2011 Nguyen et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Nguyen, Tung
Shi, Weisong
Ruden, Douglas
CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
title CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
title_full CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
title_fullStr CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
title_full_unstemmed CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
title_short CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
title_sort cloudaligner: a fast and full-featured mapreduce based tool for sequence mapping
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3127959/
https://www.ncbi.nlm.nih.gov/pubmed/21645377
http://dx.doi.org/10.1186/1756-0500-4-171
work_keys_str_mv AT nguyentung cloudalignerafastandfullfeaturedmapreducebasedtoolforsequencemapping
AT shiweisong cloudalignerafastandfullfeaturedmapreducebasedtoolforsequencemapping
AT rudendouglas cloudalignerafastandfullfeaturedmapreducebasedtoolforsequencemapping