Cargando…

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relati...

Descripción completa

Detalles Bibliográficos
Autores principales:	Oh, Jeongsu, Choi, Chi-Hwan, Park, Min-Kyu, Kim, Byung Kwon, Hwang, Kyuin, Lee, Sang-Heon, Hong, Soon Gyu, Nasir, Arshan, Cho, Wan-Sup, Kim, Kyung Mo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4783016/ https://www.ncbi.nlm.nih.gov/pubmed/26954507 http://dx.doi.org/10.1371/journal.pone.0151064

_version_	1782420052812759040
author	Oh, Jeongsu Choi, Chi-Hwan Park, Min-Kyu Kim, Byung Kwon Hwang, Kyuin Lee, Sang-Heon Hong, Soon Gyu Nasir, Arshan Cho, Wan-Sup Kim, Kyung Mo
author_facet	Oh, Jeongsu Choi, Chi-Hwan Park, Min-Kyu Kim, Byung Kwon Hwang, Kyuin Lee, Sang-Heon Hong, Soon Gyu Nasir, Arshan Cho, Wan-Sup Kim, Kyung Mo
author_sort	Oh, Jeongsu
collection	PubMed
description	High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.
format	Online Article Text
id	pubmed-4783016
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-47830162016-03-23 CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment Oh, Jeongsu Choi, Chi-Hwan Park, Min-Kyu Kim, Byung Kwon Hwang, Kyuin Lee, Sang-Heon Hong, Soon Gyu Nasir, Arshan Cho, Wan-Sup Kim, Kyung Mo PLoS One Research Article High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr. Public Library of Science 2016-03-08 /pmc/articles/PMC4783016/ /pubmed/26954507 http://dx.doi.org/10.1371/journal.pone.0151064 Text en © 2016 Oh et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Oh, Jeongsu Choi, Chi-Hwan Park, Min-Kyu Kim, Byung Kwon Hwang, Kyuin Lee, Sang-Heon Hong, Soon Gyu Nasir, Arshan Cho, Wan-Sup Kim, Kyung Mo CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
title	CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
title_full	CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
title_fullStr	CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
title_full_unstemmed	CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
title_short	CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
title_sort	clustom-cloud: in-memory data grid-based software for clustering 16s rrna sequence data in the cloud environment
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4783016/ https://www.ncbi.nlm.nih.gov/pubmed/26954507 http://dx.doi.org/10.1371/journal.pone.0151064
work_keys_str_mv	AT ohjeongsu clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT choichihwan clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT parkminkyu clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT kimbyungkwon clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT hwangkyuin clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT leesangheon clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT hongsoongyu clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT nasirarshan clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT chowansup clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment AT kimkyungmo clustomcloudinmemorydatagridbasedsoftwareforclustering16srrnasequencedatainthecloudenvironment

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

Ejemplares similares