Cargando…

Data handling strategies for high throughput pyrosequencers

BACKGROUND: New high throughput pyrosequencers such as the 454 Life Sciences GS 20 are capable of massively parallelizing DNA sequencing providing an unprecedented rate of output data as well as potentially reducing costs. However, these new pyrosequencers bear a different error profile and provide...

Descripción completa

Detalles Bibliográficos
Autores principales: Trombetti, Gabriele A, Bonnal, Raoul JP, Rizzi, Ermanno, De Bellis, Gianluca, Milanesi, Luciano
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1885852/
https://www.ncbi.nlm.nih.gov/pubmed/17430567
http://dx.doi.org/10.1186/1471-2105-8-S1-S22
_version_ 1782133656352980992
author Trombetti, Gabriele A
Bonnal, Raoul JP
Rizzi, Ermanno
De Bellis, Gianluca
Milanesi, Luciano
author_facet Trombetti, Gabriele A
Bonnal, Raoul JP
Rizzi, Ermanno
De Bellis, Gianluca
Milanesi, Luciano
author_sort Trombetti, Gabriele A
collection PubMed
description BACKGROUND: New high throughput pyrosequencers such as the 454 Life Sciences GS 20 are capable of massively parallelizing DNA sequencing providing an unprecedented rate of output data as well as potentially reducing costs. However, these new pyrosequencers bear a different error profile and provide shorter reads than those of a more traditional Sanger sequencer. These facts pose new challenges regarding how the data are handled and analyzed, in addition, the steep increase in the sequencers throughput calls for much computation power at a low cost. RESULTS: To address these challenges, we created an automated multi-step computation pipeline integrated with a database storage system. This allowed us to store, handle, index and search (1) the output data from the GS20 sequencer (2) analysis projects, possibly multiple on every dataset (3) final results of analysis computations (4) intermediate results of computations (these allow hand-made comparisons and hence further searches by the biologists). Repeatability of computations was also a requirement. In order to access the needed computation power, we ported the pipeline to the European Grid: a large community of clusters, load balanced as a whole. In order to better achieve this Grid port we created Vnas: an innovative Grid job submission, virtual sandbox manager and job callback framework. After some runs of the pipeline aimed at tuning the parameters and thresholds for optimal results, we successfully analyzed 273 sequenced amplicons from a cancerous human sample and correctly found punctual mutations confirmed by either Sanger resequencing or NCBI dbSNP. The sequencing was performed with our 454 Life Sciences GS 20 pyrosequencer. CONCLUSION: We handled the steep increase in throughput from the new pyrosequencer by building an automated computation pipeline associated with database storage, and by leveraging the computing power of the European Grid. The Grid platform offers a very cost effective choice for uneven workloads, typical in many scientific research fields, provided its peculiarities can be accepted (these are discussed). The mentioned infrastructure was used to analyze human amplicons for mutations. More analyses will be performed in the future.
format Text
id pubmed-1885852
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18858522007-06-05 Data handling strategies for high throughput pyrosequencers Trombetti, Gabriele A Bonnal, Raoul JP Rizzi, Ermanno De Bellis, Gianluca Milanesi, Luciano BMC Bioinformatics Research BACKGROUND: New high throughput pyrosequencers such as the 454 Life Sciences GS 20 are capable of massively parallelizing DNA sequencing providing an unprecedented rate of output data as well as potentially reducing costs. However, these new pyrosequencers bear a different error profile and provide shorter reads than those of a more traditional Sanger sequencer. These facts pose new challenges regarding how the data are handled and analyzed, in addition, the steep increase in the sequencers throughput calls for much computation power at a low cost. RESULTS: To address these challenges, we created an automated multi-step computation pipeline integrated with a database storage system. This allowed us to store, handle, index and search (1) the output data from the GS20 sequencer (2) analysis projects, possibly multiple on every dataset (3) final results of analysis computations (4) intermediate results of computations (these allow hand-made comparisons and hence further searches by the biologists). Repeatability of computations was also a requirement. In order to access the needed computation power, we ported the pipeline to the European Grid: a large community of clusters, load balanced as a whole. In order to better achieve this Grid port we created Vnas: an innovative Grid job submission, virtual sandbox manager and job callback framework. After some runs of the pipeline aimed at tuning the parameters and thresholds for optimal results, we successfully analyzed 273 sequenced amplicons from a cancerous human sample and correctly found punctual mutations confirmed by either Sanger resequencing or NCBI dbSNP. The sequencing was performed with our 454 Life Sciences GS 20 pyrosequencer. CONCLUSION: We handled the steep increase in throughput from the new pyrosequencer by building an automated computation pipeline associated with database storage, and by leveraging the computing power of the European Grid. The Grid platform offers a very cost effective choice for uneven workloads, typical in many scientific research fields, provided its peculiarities can be accepted (these are discussed). The mentioned infrastructure was used to analyze human amplicons for mutations. More analyses will be performed in the future. BioMed Central 2007-03-08 /pmc/articles/PMC1885852/ /pubmed/17430567 http://dx.doi.org/10.1186/1471-2105-8-S1-S22 Text en Copyright © 2007 Trombetti et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Trombetti, Gabriele A
Bonnal, Raoul JP
Rizzi, Ermanno
De Bellis, Gianluca
Milanesi, Luciano
Data handling strategies for high throughput pyrosequencers
title Data handling strategies for high throughput pyrosequencers
title_full Data handling strategies for high throughput pyrosequencers
title_fullStr Data handling strategies for high throughput pyrosequencers
title_full_unstemmed Data handling strategies for high throughput pyrosequencers
title_short Data handling strategies for high throughput pyrosequencers
title_sort data handling strategies for high throughput pyrosequencers
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1885852/
https://www.ncbi.nlm.nih.gov/pubmed/17430567
http://dx.doi.org/10.1186/1471-2105-8-S1-S22
work_keys_str_mv AT trombettigabrielea datahandlingstrategiesforhighthroughputpyrosequencers
AT bonnalraouljp datahandlingstrategiesforhighthroughputpyrosequencers
AT rizziermanno datahandlingstrategiesforhighthroughputpyrosequencers
AT debellisgianluca datahandlingstrategiesforhighthroughputpyrosequencers
AT milanesiluciano datahandlingstrategiesforhighthroughputpyrosequencers