Cargando…

Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools...

Descripción completa

Detalles Bibliográficos
Autores principales: Torres-Martos, Álvaro, Bustos-Aibar, Mireia, Ramírez-Mena, Alberto, Cámara-Sánchez, Sofía, Anguita-Ruiz, Augusto, Alcalá, Rafael, Aguilera, Concepción M., Alcalá-Fdez, Jesús
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9956296/
https://www.ncbi.nlm.nih.gov/pubmed/36833178
http://dx.doi.org/10.3390/genes14020248
_version_ 1784894558261215232
author Torres-Martos, Álvaro
Bustos-Aibar, Mireia
Ramírez-Mena, Alberto
Cámara-Sánchez, Sofía
Anguita-Ruiz, Augusto
Alcalá, Rafael
Aguilera, Concepción M.
Alcalá-Fdez, Jesús
author_facet Torres-Martos, Álvaro
Bustos-Aibar, Mireia
Ramírez-Mena, Alberto
Cámara-Sánchez, Sofía
Anguita-Ruiz, Augusto
Alcalá, Rafael
Aguilera, Concepción M.
Alcalá-Fdez, Jesús
author_sort Torres-Martos, Álvaro
collection PubMed
description The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
format Online
Article
Text
id pubmed-9956296
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-99562962023-02-25 Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity Torres-Martos, Álvaro Bustos-Aibar, Mireia Ramírez-Mena, Alberto Cámara-Sánchez, Sofía Anguita-Ruiz, Augusto Alcalá, Rafael Aguilera, Concepción M. Alcalá-Fdez, Jesús Genes (Basel) Article The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work. MDPI 2023-01-18 /pmc/articles/PMC9956296/ /pubmed/36833178 http://dx.doi.org/10.3390/genes14020248 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Torres-Martos, Álvaro
Bustos-Aibar, Mireia
Ramírez-Mena, Alberto
Cámara-Sánchez, Sofía
Anguita-Ruiz, Augusto
Alcalá, Rafael
Aguilera, Concepción M.
Alcalá-Fdez, Jesús
Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
title Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
title_full Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
title_fullStr Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
title_full_unstemmed Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
title_short Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
title_sort omics data preprocessing for machine learning: a case study in childhood obesity
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9956296/
https://www.ncbi.nlm.nih.gov/pubmed/36833178
http://dx.doi.org/10.3390/genes14020248
work_keys_str_mv AT torresmartosalvaro omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT bustosaibarmireia omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT ramirezmenaalberto omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT camarasanchezsofia omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT anguitaruizaugusto omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT alcalarafael omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT aguileraconcepcionm omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity
AT alcalafdezjesus omicsdatapreprocessingformachinelearningacasestudyinchildhoodobesity