Cargando…
Scalable transcriptomics analysis with Dask: applications in data science and machine learning
BACKGROUND: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710082/ https://www.ncbi.nlm.nih.gov/pubmed/36451115 http://dx.doi.org/10.1186/s12859-022-05065-3 |
_version_ | 1784841293001654272 |
---|---|
author | Moreno, Marta Vilaça, Ricardo Ferreira, Pedro G. |
author_facet | Moreno, Marta Vilaça, Ricardo Ferreira, Pedro G. |
author_sort | Moreno, Marta |
collection | PubMed |
description | BACKGROUND: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. METHODS: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. RESULTS: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. CONCLUSION: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures. |
format | Online Article Text |
id | pubmed-9710082 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-97100822022-12-01 Scalable transcriptomics analysis with Dask: applications in data science and machine learning Moreno, Marta Vilaça, Ricardo Ferreira, Pedro G. BMC Bioinformatics Research Article BACKGROUND: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. METHODS: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. RESULTS: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. CONCLUSION: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures. BioMed Central 2022-11-30 /pmc/articles/PMC9710082/ /pubmed/36451115 http://dx.doi.org/10.1186/s12859-022-05065-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Moreno, Marta Vilaça, Ricardo Ferreira, Pedro G. Scalable transcriptomics analysis with Dask: applications in data science and machine learning |
title | Scalable transcriptomics analysis with Dask: applications in data science and machine learning |
title_full | Scalable transcriptomics analysis with Dask: applications in data science and machine learning |
title_fullStr | Scalable transcriptomics analysis with Dask: applications in data science and machine learning |
title_full_unstemmed | Scalable transcriptomics analysis with Dask: applications in data science and machine learning |
title_short | Scalable transcriptomics analysis with Dask: applications in data science and machine learning |
title_sort | scalable transcriptomics analysis with dask: applications in data science and machine learning |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710082/ https://www.ncbi.nlm.nih.gov/pubmed/36451115 http://dx.doi.org/10.1186/s12859-022-05065-3 |
work_keys_str_mv | AT morenomarta scalabletranscriptomicsanalysiswithdaskapplicationsindatascienceandmachinelearning AT vilacaricardo scalabletranscriptomicsanalysiswithdaskapplicationsindatascienceandmachinelearning AT ferreirapedrog scalabletranscriptomicsanalysiswithdaskapplicationsindatascienceandmachinelearning |