Cargando…

DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data

BACKGROUND: New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigene...

Descripción completa

Detalles Bibliográficos
Autores principales: Nettling, Martin, Thieme, Nils, Both, Andreas, Grosse, Ivo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3915617/
https://www.ncbi.nlm.nih.gov/pubmed/24495746
http://dx.doi.org/10.1186/1471-2105-15-38
_version_ 1782302605695778816
author Nettling, Martin
Thieme, Nils
Both, Andreas
Grosse, Ivo
author_facet Nettling, Martin
Thieme, Nils
Both, Andreas
Grosse, Ivo
author_sort Nettling, Martin
collection PubMed
description BACKGROUND: New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigenetic modification enriched regions or determining single nucleotide polymorphism increase this amount of position-specific DNA-related data even further. Hence, requesting data becomes challenging and expensive and is often implemented using specialised hardware. In addition, picking specific data as fast as possible becomes increasingly important in many fields of science. The general problem of handling big data sets was addressed by developing specialized databases like HBase, HyperTable or Cassandra. However, these database solutions require also specialized or distributed hardware leading to expensive investments. To the best of our knowledge, there is no database capable of (i) storing billions of position-specific DNA-related records, (ii) performing fast and resource saving requests, and (iii) running on a single standard computer hardware. RESULTS: Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfying demands (i)-(iii). It tackles the weaknesses of traditional databases while handling position-specific DNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records. Moreover, it focuses on optimizing relating single lookups as range request, which are needed permanently for computations in bioinformatics. To validate the power of DRUMS, we compare it to the widely used MySQL database. The test setting considers two biological data sets. We use standard desktop hardware as test environment. CONCLUSIONS: DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of 10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sized data sets up to several billion records without requiring cluster technology. Storing position-specific data is a general problem and the concept we present here is a generalized approach. Hence, it can be easily applied to other fields of bioinformatics.
format Online
Article
Text
id pubmed-3915617
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-39156172014-02-20 DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data Nettling, Martin Thieme, Nils Both, Andreas Grosse, Ivo BMC Bioinformatics Software BACKGROUND: New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigenetic modification enriched regions or determining single nucleotide polymorphism increase this amount of position-specific DNA-related data even further. Hence, requesting data becomes challenging and expensive and is often implemented using specialised hardware. In addition, picking specific data as fast as possible becomes increasingly important in many fields of science. The general problem of handling big data sets was addressed by developing specialized databases like HBase, HyperTable or Cassandra. However, these database solutions require also specialized or distributed hardware leading to expensive investments. To the best of our knowledge, there is no database capable of (i) storing billions of position-specific DNA-related records, (ii) performing fast and resource saving requests, and (iii) running on a single standard computer hardware. RESULTS: Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfying demands (i)-(iii). It tackles the weaknesses of traditional databases while handling position-specific DNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records. Moreover, it focuses on optimizing relating single lookups as range request, which are needed permanently for computations in bioinformatics. To validate the power of DRUMS, we compare it to the widely used MySQL database. The test setting considers two biological data sets. We use standard desktop hardware as test environment. CONCLUSIONS: DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of 10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sized data sets up to several billion records without requiring cluster technology. Storing position-specific data is a general problem and the concept we present here is a generalized approach. Hence, it can be easily applied to other fields of bioinformatics. BioMed Central 2014-02-04 /pmc/articles/PMC3915617/ /pubmed/24495746 http://dx.doi.org/10.1186/1471-2105-15-38 Text en Copyright © 2014 Nettling et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Nettling, Martin
Thieme, Nils
Both, Andreas
Grosse, Ivo
DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
title DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
title_full DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
title_fullStr DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
title_full_unstemmed DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
title_short DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
title_sort drums: disk repository with update management and select option for high throughput sequencing data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3915617/
https://www.ncbi.nlm.nih.gov/pubmed/24495746
http://dx.doi.org/10.1186/1471-2105-15-38
work_keys_str_mv AT nettlingmartin drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata
AT thiemenils drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata
AT bothandreas drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata
AT grosseivo drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata