Cargando…

Sequence database versioning for command line and Galaxy bioinformatics servers

Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly upda...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dooley, Damion M., Petkau, Aaron J., Van Domselaar, Gary, Hsiao, William W.L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2016
Materias:	Applications Notes
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4824126/ https://www.ncbi.nlm.nih.gov/pubmed/26656932 http://dx.doi.org/10.1093/bioinformatics/btv724

_version_	1782426046522458112
author	Dooley, Damion M. Petkau, Aaron J. Van Domselaar, Gary Hsiao, William W.L.
author_facet	Dooley, Damion M. Petkau, Aaron J. Van Domselaar, Gary Hsiao, William W.L.
author_sort	Dooley, Damion M.
collection	PubMed
description	Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly updated reference sequence databases must be versioned and archived. Database administrators have tried to fill the requirements by supplying users with one-off versions of databases, but these are time consuming to set up and are inconsistent across resources. Disk storage and data backup performance has also discouraged maintaining multiple versions of databases since databases such as NCBI nr can consume 50 Gb or more disk space per version, with growth rates that parallel Moore's law. Results: Our end-to-end solution combines our own Kipper software package—a simple key-value large file versioning system—with BioMAJ (software for downloading sequence databases), and Galaxy (a web-based bioinformatics data processing platform). Available versions of databases can be recalled and used by command-line and Galaxy users. The Kipper data store format makes publishing curated FASTA databases convenient since in most cases it can store a range of versions into a file marginally larger than the size of the latest version. Availability and implementation: Kipper v1.0.0 and the Galaxy Versioned Data tool are written in Python and released as free and open source software available at https://github.com/Public-Health-Bioinformatics/kipper and https://github.com/Public-Health-Bioinformatics/versioned_data, respectively; detailed setup instructions can be found at https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/setup.md Contact: Damion.Dooley@Bccdc.Ca or William.Hsiao@Bccdc.Ca Supplementary information: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-4824126
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-48241262016-04-08 Sequence database versioning for command line and Galaxy bioinformatics servers Dooley, Damion M. Petkau, Aaron J. Van Domselaar, Gary Hsiao, William W.L. Bioinformatics Applications Notes Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly updated reference sequence databases must be versioned and archived. Database administrators have tried to fill the requirements by supplying users with one-off versions of databases, but these are time consuming to set up and are inconsistent across resources. Disk storage and data backup performance has also discouraged maintaining multiple versions of databases since databases such as NCBI nr can consume 50 Gb or more disk space per version, with growth rates that parallel Moore's law. Results: Our end-to-end solution combines our own Kipper software package—a simple key-value large file versioning system—with BioMAJ (software for downloading sequence databases), and Galaxy (a web-based bioinformatics data processing platform). Available versions of databases can be recalled and used by command-line and Galaxy users. The Kipper data store format makes publishing curated FASTA databases convenient since in most cases it can store a range of versions into a file marginally larger than the size of the latest version. Availability and implementation: Kipper v1.0.0 and the Galaxy Versioned Data tool are written in Python and released as free and open source software available at https://github.com/Public-Health-Bioinformatics/kipper and https://github.com/Public-Health-Bioinformatics/versioned_data, respectively; detailed setup instructions can be found at https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/setup.md Contact: Damion.Dooley@Bccdc.Ca or William.Hsiao@Bccdc.Ca Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2016-04-15 2015-12-12 /pmc/articles/PMC4824126/ /pubmed/26656932 http://dx.doi.org/10.1093/bioinformatics/btv724 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Applications Notes Dooley, Damion M. Petkau, Aaron J. Van Domselaar, Gary Hsiao, William W.L. Sequence database versioning for command line and Galaxy bioinformatics servers
title	Sequence database versioning for command line and Galaxy bioinformatics servers
title_full	Sequence database versioning for command line and Galaxy bioinformatics servers
title_fullStr	Sequence database versioning for command line and Galaxy bioinformatics servers
title_full_unstemmed	Sequence database versioning for command line and Galaxy bioinformatics servers
title_short	Sequence database versioning for command line and Galaxy bioinformatics servers
title_sort	sequence database versioning for command line and galaxy bioinformatics servers
topic	Applications Notes
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4824126/ https://www.ncbi.nlm.nih.gov/pubmed/26656932 http://dx.doi.org/10.1093/bioinformatics/btv724
work_keys_str_mv	AT dooleydamionm sequencedatabaseversioningforcommandlineandgalaxybioinformaticsservers AT petkauaaronj sequencedatabaseversioningforcommandlineandgalaxybioinformaticsservers AT vandomselaargary sequencedatabaseversioningforcommandlineandgalaxybioinformaticsservers AT hsiaowilliamwl sequencedatabaseversioningforcommandlineandgalaxybioinformaticsservers

Sequence database versioning for command line and Galaxy bioinformatics servers

Ejemplares similares