Cargando…
Bioprocess data mining using regularized regression and random forests
BACKGROUND: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets caus...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750505/ https://www.ncbi.nlm.nih.gov/pubmed/24268049 http://dx.doi.org/10.1186/1752-0509-7-S1-S5 |
_version_ | 1782281428459847680 |
---|---|
author | Hassan, Syeda Sakira Farhan, Muhammad Mangayil, Rahul Huttunen, Heikki Aho, Tommi |
author_facet | Hassan, Syeda Sakira Farhan, Muhammad Mangayil, Rahul Huttunen, Heikki Aho, Tommi |
author_sort | Hassan, Syeda Sakira |
collection | PubMed |
description | BACKGROUND: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets causes that these needs cannot be properly addressed by regression models that assume linear input-output relationship or unimodality of the response function. Regularized regression and random forests, on the other hand, have several properties that may appear important in this context. They are capable, e.g., in handling small number of samples with respect to the number of variables, feature selection, and the visualization of response surfaces in order to present the prediction results in an illustrative way. RESULTS: In this work, the applicability of regularized regression (Lasso) and random forests (RF) in bioprocess data mining was examined, and their performance was benchmarked against multiple linear regression. As an example, we used data from a culture media optimization study for microbial hydrogen production. All the three methods were capable in providing a significant model when the five variables of the culture media optimization were linearly included in modeling. However, multiple linear regression failed when also the multiplications and squares of the variables were included in modeling. In this case, the modeling was still successful with Lasso (correlation between the observed and predicted yield was 0.69) and RF (0.91). CONCLUSION: We found that both regularized regression and random forests were able to produce feasible models, and the latter was efficient in capturing the non-linearity in the data. In this kind of a data mining task of bioprocess data, both methods outperform multiple linear regression. |
format | Online Article Text |
id | pubmed-3750505 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-37505052013-08-27 Bioprocess data mining using regularized regression and random forests Hassan, Syeda Sakira Farhan, Muhammad Mangayil, Rahul Huttunen, Heikki Aho, Tommi BMC Syst Biol Research BACKGROUND: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets causes that these needs cannot be properly addressed by regression models that assume linear input-output relationship or unimodality of the response function. Regularized regression and random forests, on the other hand, have several properties that may appear important in this context. They are capable, e.g., in handling small number of samples with respect to the number of variables, feature selection, and the visualization of response surfaces in order to present the prediction results in an illustrative way. RESULTS: In this work, the applicability of regularized regression (Lasso) and random forests (RF) in bioprocess data mining was examined, and their performance was benchmarked against multiple linear regression. As an example, we used data from a culture media optimization study for microbial hydrogen production. All the three methods were capable in providing a significant model when the five variables of the culture media optimization were linearly included in modeling. However, multiple linear regression failed when also the multiplications and squares of the variables were included in modeling. In this case, the modeling was still successful with Lasso (correlation between the observed and predicted yield was 0.69) and RF (0.91). CONCLUSION: We found that both regularized regression and random forests were able to produce feasible models, and the latter was efficient in capturing the non-linearity in the data. In this kind of a data mining task of bioprocess data, both methods outperform multiple linear regression. BioMed Central 2013-08-12 /pmc/articles/PMC3750505/ /pubmed/24268049 http://dx.doi.org/10.1186/1752-0509-7-S1-S5 Text en Copyright © 2013 Hassan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Hassan, Syeda Sakira Farhan, Muhammad Mangayil, Rahul Huttunen, Heikki Aho, Tommi Bioprocess data mining using regularized regression and random forests |
title | Bioprocess data mining using regularized regression and random forests |
title_full | Bioprocess data mining using regularized regression and random forests |
title_fullStr | Bioprocess data mining using regularized regression and random forests |
title_full_unstemmed | Bioprocess data mining using regularized regression and random forests |
title_short | Bioprocess data mining using regularized regression and random forests |
title_sort | bioprocess data mining using regularized regression and random forests |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750505/ https://www.ncbi.nlm.nih.gov/pubmed/24268049 http://dx.doi.org/10.1186/1752-0509-7-S1-S5 |
work_keys_str_mv | AT hassansyedasakira bioprocessdataminingusingregularizedregressionandrandomforests AT farhanmuhammad bioprocessdataminingusingregularizedregressionandrandomforests AT mangayilrahul bioprocessdataminingusingregularizedregressionandrandomforests AT huttunenheikki bioprocessdataminingusingregularizedregressionandrandomforests AT ahotommi bioprocessdataminingusingregularizedregressionandrandomforests |