Cargando…
DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data
BACKGROUND: New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigene...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3915617/ https://www.ncbi.nlm.nih.gov/pubmed/24495746 http://dx.doi.org/10.1186/1471-2105-15-38 |
_version_ | 1782302605695778816 |
---|---|
author | Nettling, Martin Thieme, Nils Both, Andreas Grosse, Ivo |
author_facet | Nettling, Martin Thieme, Nils Both, Andreas Grosse, Ivo |
author_sort | Nettling, Martin |
collection | PubMed |
description | BACKGROUND: New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigenetic modification enriched regions or determining single nucleotide polymorphism increase this amount of position-specific DNA-related data even further. Hence, requesting data becomes challenging and expensive and is often implemented using specialised hardware. In addition, picking specific data as fast as possible becomes increasingly important in many fields of science. The general problem of handling big data sets was addressed by developing specialized databases like HBase, HyperTable or Cassandra. However, these database solutions require also specialized or distributed hardware leading to expensive investments. To the best of our knowledge, there is no database capable of (i) storing billions of position-specific DNA-related records, (ii) performing fast and resource saving requests, and (iii) running on a single standard computer hardware. RESULTS: Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfying demands (i)-(iii). It tackles the weaknesses of traditional databases while handling position-specific DNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records. Moreover, it focuses on optimizing relating single lookups as range request, which are needed permanently for computations in bioinformatics. To validate the power of DRUMS, we compare it to the widely used MySQL database. The test setting considers two biological data sets. We use standard desktop hardware as test environment. CONCLUSIONS: DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of 10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sized data sets up to several billion records without requiring cluster technology. Storing position-specific data is a general problem and the concept we present here is a generalized approach. Hence, it can be easily applied to other fields of bioinformatics. |
format | Online Article Text |
id | pubmed-3915617 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-39156172014-02-20 DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data Nettling, Martin Thieme, Nils Both, Andreas Grosse, Ivo BMC Bioinformatics Software BACKGROUND: New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigenetic modification enriched regions or determining single nucleotide polymorphism increase this amount of position-specific DNA-related data even further. Hence, requesting data becomes challenging and expensive and is often implemented using specialised hardware. In addition, picking specific data as fast as possible becomes increasingly important in many fields of science. The general problem of handling big data sets was addressed by developing specialized databases like HBase, HyperTable or Cassandra. However, these database solutions require also specialized or distributed hardware leading to expensive investments. To the best of our knowledge, there is no database capable of (i) storing billions of position-specific DNA-related records, (ii) performing fast and resource saving requests, and (iii) running on a single standard computer hardware. RESULTS: Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfying demands (i)-(iii). It tackles the weaknesses of traditional databases while handling position-specific DNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records. Moreover, it focuses on optimizing relating single lookups as range request, which are needed permanently for computations in bioinformatics. To validate the power of DRUMS, we compare it to the widely used MySQL database. The test setting considers two biological data sets. We use standard desktop hardware as test environment. CONCLUSIONS: DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of 10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sized data sets up to several billion records without requiring cluster technology. Storing position-specific data is a general problem and the concept we present here is a generalized approach. Hence, it can be easily applied to other fields of bioinformatics. BioMed Central 2014-02-04 /pmc/articles/PMC3915617/ /pubmed/24495746 http://dx.doi.org/10.1186/1471-2105-15-38 Text en Copyright © 2014 Nettling et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software Nettling, Martin Thieme, Nils Both, Andreas Grosse, Ivo DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data |
title | DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data |
title_full | DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data |
title_fullStr | DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data |
title_full_unstemmed | DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data |
title_short | DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data |
title_sort | drums: disk repository with update management and select option for high throughput sequencing data |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3915617/ https://www.ncbi.nlm.nih.gov/pubmed/24495746 http://dx.doi.org/10.1186/1471-2105-15-38 |
work_keys_str_mv | AT nettlingmartin drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata AT thiemenils drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata AT bothandreas drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata AT grosseivo drumsdiskrepositorywithupdatemanagementandselectoptionforhighthroughputsequencingdata |