Cargando…

PyBDA: a command line tool for automated analysis of big biological data sets

BACKGROUND: Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. RESULTS: We developed a novel machine learning command line tool called PyBDA fo...

Descripción completa

Detalles Bibliográficos
Autores principales: Dirmeier, Simon, Emmenlauer, Mario, Dehio, Christoph, Beerenwinkel, Niko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6849186/
https://www.ncbi.nlm.nih.gov/pubmed/31718539
http://dx.doi.org/10.1186/s12859-019-3087-8
_version_ 1783469156215029760
author Dirmeier, Simon
Emmenlauer, Mario
Dehio, Christoph
Beerenwinkel, Niko
author_facet Dirmeier, Simon
Emmenlauer, Mario
Dehio, Christoph
Beerenwinkel, Niko
author_sort Dirmeier, Simon
collection PubMed
description BACKGROUND: Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. RESULTS: We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. CONCLUSION: PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.
format Online
Article
Text
id pubmed-6849186
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-68491862019-11-15 PyBDA: a command line tool for automated analysis of big biological data sets Dirmeier, Simon Emmenlauer, Mario Dehio, Christoph Beerenwinkel, Niko BMC Bioinformatics Software BACKGROUND: Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. RESULTS: We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. CONCLUSION: PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io. BioMed Central 2019-11-12 /pmc/articles/PMC6849186/ /pubmed/31718539 http://dx.doi.org/10.1186/s12859-019-3087-8 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Dirmeier, Simon
Emmenlauer, Mario
Dehio, Christoph
Beerenwinkel, Niko
PyBDA: a command line tool for automated analysis of big biological data sets
title PyBDA: a command line tool for automated analysis of big biological data sets
title_full PyBDA: a command line tool for automated analysis of big biological data sets
title_fullStr PyBDA: a command line tool for automated analysis of big biological data sets
title_full_unstemmed PyBDA: a command line tool for automated analysis of big biological data sets
title_short PyBDA: a command line tool for automated analysis of big biological data sets
title_sort pybda: a command line tool for automated analysis of big biological data sets
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6849186/
https://www.ncbi.nlm.nih.gov/pubmed/31718539
http://dx.doi.org/10.1186/s12859-019-3087-8
work_keys_str_mv AT dirmeiersimon pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
AT emmenlauermario pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
AT dehiochristoph pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
AT beerenwinkelniko pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets