Cargando…
PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis
BACKGROUND: High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data....
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4063226/ https://www.ncbi.nlm.nih.gov/pubmed/24894600 http://dx.doi.org/10.1186/1471-2105-15-167 |
_version_ | 1782321769686761472 |
---|---|
author | Maji, Ranjan Kumar Sarkar, Arijita Khatua, Sunirmal Dasgupta, Subhasis Ghosh, Zhumur |
author_facet | Maji, Ranjan Kumar Sarkar, Arijita Khatua, Sunirmal Dasgupta, Subhasis Ghosh, Zhumur |
author_sort | Maji, Ranjan Kumar |
collection | PubMed |
description | BACKGROUND: High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat’s serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently. RESULTS: We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during ‘spliced alignment’ and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time. CONCLUSIONS: PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system. |
format | Online Article Text |
id | pubmed-4063226 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-40632262014-06-30 PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis Maji, Ranjan Kumar Sarkar, Arijita Khatua, Sunirmal Dasgupta, Subhasis Ghosh, Zhumur BMC Bioinformatics Methodology Article BACKGROUND: High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat’s serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently. RESULTS: We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during ‘spliced alignment’ and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time. CONCLUSIONS: PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system. BioMed Central 2014-06-04 /pmc/articles/PMC4063226/ /pubmed/24894600 http://dx.doi.org/10.1186/1471-2105-15-167 Text en Copyright © 2014 Maji et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Maji, Ranjan Kumar Sarkar, Arijita Khatua, Sunirmal Dasgupta, Subhasis Ghosh, Zhumur PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis |
title | PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis |
title_full | PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis |
title_fullStr | PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis |
title_full_unstemmed | PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis |
title_short | PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis |
title_sort | pvt: an efficient computational procedure to speed up next-generation sequence analysis |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4063226/ https://www.ncbi.nlm.nih.gov/pubmed/24894600 http://dx.doi.org/10.1186/1471-2105-15-167 |
work_keys_str_mv | AT majiranjankumar pvtanefficientcomputationalproceduretospeedupnextgenerationsequenceanalysis AT sarkararijita pvtanefficientcomputationalproceduretospeedupnextgenerationsequenceanalysis AT khatuasunirmal pvtanefficientcomputationalproceduretospeedupnextgenerationsequenceanalysis AT dasguptasubhasis pvtanefficientcomputationalproceduretospeedupnextgenerationsequenceanalysis AT ghoshzhumur pvtanefficientcomputationalproceduretospeedupnextgenerationsequenceanalysis |