Cargando…

Scalable transcriptomics analysis with Dask: applications in data science and machine learning

BACKGROUND: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications...

Descripción completa

Detalles Bibliográficos
Autores principales: Moreno, Marta, Vilaça, Ricardo, Ferreira, Pedro G.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710082/
https://www.ncbi.nlm.nih.gov/pubmed/36451115
http://dx.doi.org/10.1186/s12859-022-05065-3
_version_ 1784841293001654272
author Moreno, Marta
Vilaça, Ricardo
Ferreira, Pedro G.
author_facet Moreno, Marta
Vilaça, Ricardo
Ferreira, Pedro G.
author_sort Moreno, Marta
collection PubMed
description BACKGROUND: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. METHODS: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. RESULTS: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. CONCLUSION: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
format Online
Article
Text
id pubmed-9710082
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-97100822022-12-01 Scalable transcriptomics analysis with Dask: applications in data science and machine learning Moreno, Marta Vilaça, Ricardo Ferreira, Pedro G. BMC Bioinformatics Research Article BACKGROUND: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. METHODS: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. RESULTS: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. CONCLUSION: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures. BioMed Central 2022-11-30 /pmc/articles/PMC9710082/ /pubmed/36451115 http://dx.doi.org/10.1186/s12859-022-05065-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Moreno, Marta
Vilaça, Ricardo
Ferreira, Pedro G.
Scalable transcriptomics analysis with Dask: applications in data science and machine learning
title Scalable transcriptomics analysis with Dask: applications in data science and machine learning
title_full Scalable transcriptomics analysis with Dask: applications in data science and machine learning
title_fullStr Scalable transcriptomics analysis with Dask: applications in data science and machine learning
title_full_unstemmed Scalable transcriptomics analysis with Dask: applications in data science and machine learning
title_short Scalable transcriptomics analysis with Dask: applications in data science and machine learning
title_sort scalable transcriptomics analysis with dask: applications in data science and machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710082/
https://www.ncbi.nlm.nih.gov/pubmed/36451115
http://dx.doi.org/10.1186/s12859-022-05065-3
work_keys_str_mv AT morenomarta scalabletranscriptomicsanalysiswithdaskapplicationsindatascienceandmachinelearning
AT vilacaricardo scalabletranscriptomicsanalysiswithdaskapplicationsindatascienceandmachinelearning
AT ferreirapedrog scalabletranscriptomicsanalysiswithdaskapplicationsindatascienceandmachinelearning