Cargando…

Bioprocess data mining using regularized regression and random forests

BACKGROUND: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets caus...

Descripción completa

Detalles Bibliográficos
Autores principales: Hassan, Syeda Sakira, Farhan, Muhammad, Mangayil, Rahul, Huttunen, Heikki, Aho, Tommi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750505/
https://www.ncbi.nlm.nih.gov/pubmed/24268049
http://dx.doi.org/10.1186/1752-0509-7-S1-S5
_version_ 1782281428459847680
author Hassan, Syeda Sakira
Farhan, Muhammad
Mangayil, Rahul
Huttunen, Heikki
Aho, Tommi
author_facet Hassan, Syeda Sakira
Farhan, Muhammad
Mangayil, Rahul
Huttunen, Heikki
Aho, Tommi
author_sort Hassan, Syeda Sakira
collection PubMed
description BACKGROUND: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets causes that these needs cannot be properly addressed by regression models that assume linear input-output relationship or unimodality of the response function. Regularized regression and random forests, on the other hand, have several properties that may appear important in this context. They are capable, e.g., in handling small number of samples with respect to the number of variables, feature selection, and the visualization of response surfaces in order to present the prediction results in an illustrative way. RESULTS: In this work, the applicability of regularized regression (Lasso) and random forests (RF) in bioprocess data mining was examined, and their performance was benchmarked against multiple linear regression. As an example, we used data from a culture media optimization study for microbial hydrogen production. All the three methods were capable in providing a significant model when the five variables of the culture media optimization were linearly included in modeling. However, multiple linear regression failed when also the multiplications and squares of the variables were included in modeling. In this case, the modeling was still successful with Lasso (correlation between the observed and predicted yield was 0.69) and RF (0.91). CONCLUSION: We found that both regularized regression and random forests were able to produce feasible models, and the latter was efficient in capturing the non-linearity in the data. In this kind of a data mining task of bioprocess data, both methods outperform multiple linear regression.
format Online
Article
Text
id pubmed-3750505
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-37505052013-08-27 Bioprocess data mining using regularized regression and random forests Hassan, Syeda Sakira Farhan, Muhammad Mangayil, Rahul Huttunen, Heikki Aho, Tommi BMC Syst Biol Research BACKGROUND: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets causes that these needs cannot be properly addressed by regression models that assume linear input-output relationship or unimodality of the response function. Regularized regression and random forests, on the other hand, have several properties that may appear important in this context. They are capable, e.g., in handling small number of samples with respect to the number of variables, feature selection, and the visualization of response surfaces in order to present the prediction results in an illustrative way. RESULTS: In this work, the applicability of regularized regression (Lasso) and random forests (RF) in bioprocess data mining was examined, and their performance was benchmarked against multiple linear regression. As an example, we used data from a culture media optimization study for microbial hydrogen production. All the three methods were capable in providing a significant model when the five variables of the culture media optimization were linearly included in modeling. However, multiple linear regression failed when also the multiplications and squares of the variables were included in modeling. In this case, the modeling was still successful with Lasso (correlation between the observed and predicted yield was 0.69) and RF (0.91). CONCLUSION: We found that both regularized regression and random forests were able to produce feasible models, and the latter was efficient in capturing the non-linearity in the data. In this kind of a data mining task of bioprocess data, both methods outperform multiple linear regression. BioMed Central 2013-08-12 /pmc/articles/PMC3750505/ /pubmed/24268049 http://dx.doi.org/10.1186/1752-0509-7-S1-S5 Text en Copyright © 2013 Hassan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Hassan, Syeda Sakira
Farhan, Muhammad
Mangayil, Rahul
Huttunen, Heikki
Aho, Tommi
Bioprocess data mining using regularized regression and random forests
title Bioprocess data mining using regularized regression and random forests
title_full Bioprocess data mining using regularized regression and random forests
title_fullStr Bioprocess data mining using regularized regression and random forests
title_full_unstemmed Bioprocess data mining using regularized regression and random forests
title_short Bioprocess data mining using regularized regression and random forests
title_sort bioprocess data mining using regularized regression and random forests
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750505/
https://www.ncbi.nlm.nih.gov/pubmed/24268049
http://dx.doi.org/10.1186/1752-0509-7-S1-S5
work_keys_str_mv AT hassansyedasakira bioprocessdataminingusingregularizedregressionandrandomforests
AT farhanmuhammad bioprocessdataminingusingregularizedregressionandrandomforests
AT mangayilrahul bioprocessdataminingusingregularizedregressionandrandomforests
AT huttunenheikki bioprocessdataminingusingregularizedregressionandrandomforests
AT ahotommi bioprocessdataminingusingregularizedregressionandrandomforests