Cargando…

A heuristic approach to handling missing data in biologics manufacturing databases

The biologics sector has amassed a wealth of data in the past three decades, in line with the bioprocess development and manufacturing guidelines, and analysis of these data with precision is expected to reveal behavioural patterns in cell populations that can be used for making predictions on how f...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mante, Jeanet, Gangadharan, Nishanthi, Sewell, David J., Turner, Richard, Field, Ray, Oliver, Stephen G., Slater, Nigel, Dikicioglu, Duygu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2019
Materias:	Rapid Communication
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6430751/ https://www.ncbi.nlm.nih.gov/pubmed/30617419 http://dx.doi.org/10.1007/s00449-018-02059-5

_version_	1783405808837459968
author	Mante, Jeanet Gangadharan, Nishanthi Sewell, David J. Turner, Richard Field, Ray Oliver, Stephen G. Slater, Nigel Dikicioglu, Duygu
author_facet	Mante, Jeanet Gangadharan, Nishanthi Sewell, David J. Turner, Richard Field, Ray Oliver, Stephen G. Slater, Nigel Dikicioglu, Duygu
author_sort	Mante, Jeanet
collection	PubMed
description	The biologics sector has amassed a wealth of data in the past three decades, in line with the bioprocess development and manufacturing guidelines, and analysis of these data with precision is expected to reveal behavioural patterns in cell populations that can be used for making predictions on how future culture processes might behave. The historical bioprocessing data likely comprise experiments conducted using different cell lines, to produce different products and may be years apart; the situation causing inter-batch variability and missing data points to human- and instrument-associated technical oversights. These unavoidable complications necessitate the introduction of a pre-processing step prior to data mining. This study investigated the efficiency of mean imputation and multivariate regression for filling in the missing information in historical bio-manufacturing datasets, and evaluated their performance by symbolic regression models and Bayesian non-parametric models in subsequent data processing. Mean substitution was shown to be a simple and efficient imputation method for relatively smooth, non-dynamical datasets, and regression imputation was effective whilst maintaining the existing standard deviation and shape of the distribution in dynamical datasets with less than 30% missing data. The nature of the missing information, whether Missing Completely At Random, Missing At Random or Missing Not At Random, emerged as the key feature for selecting the imputation method.
format	Online Article Text
id	pubmed-6430751
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-64307512019-04-05 A heuristic approach to handling missing data in biologics manufacturing databases Mante, Jeanet Gangadharan, Nishanthi Sewell, David J. Turner, Richard Field, Ray Oliver, Stephen G. Slater, Nigel Dikicioglu, Duygu Bioprocess Biosyst Eng Rapid Communication The biologics sector has amassed a wealth of data in the past three decades, in line with the bioprocess development and manufacturing guidelines, and analysis of these data with precision is expected to reveal behavioural patterns in cell populations that can be used for making predictions on how future culture processes might behave. The historical bioprocessing data likely comprise experiments conducted using different cell lines, to produce different products and may be years apart; the situation causing inter-batch variability and missing data points to human- and instrument-associated technical oversights. These unavoidable complications necessitate the introduction of a pre-processing step prior to data mining. This study investigated the efficiency of mean imputation and multivariate regression for filling in the missing information in historical bio-manufacturing datasets, and evaluated their performance by symbolic regression models and Bayesian non-parametric models in subsequent data processing. Mean substitution was shown to be a simple and efficient imputation method for relatively smooth, non-dynamical datasets, and regression imputation was effective whilst maintaining the existing standard deviation and shape of the distribution in dynamical datasets with less than 30% missing data. The nature of the missing information, whether Missing Completely At Random, Missing At Random or Missing Not At Random, emerged as the key feature for selecting the imputation method. Springer Berlin Heidelberg 2019-01-08 2019 /pmc/articles/PMC6430751/ /pubmed/30617419 http://dx.doi.org/10.1007/s00449-018-02059-5 Text en © The Author(s) 2019 OpenAccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle	Rapid Communication Mante, Jeanet Gangadharan, Nishanthi Sewell, David J. Turner, Richard Field, Ray Oliver, Stephen G. Slater, Nigel Dikicioglu, Duygu A heuristic approach to handling missing data in biologics manufacturing databases
title	A heuristic approach to handling missing data in biologics manufacturing databases
title_full	A heuristic approach to handling missing data in biologics manufacturing databases
title_fullStr	A heuristic approach to handling missing data in biologics manufacturing databases
title_full_unstemmed	A heuristic approach to handling missing data in biologics manufacturing databases
title_short	A heuristic approach to handling missing data in biologics manufacturing databases
title_sort	heuristic approach to handling missing data in biologics manufacturing databases
topic	Rapid Communication
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6430751/ https://www.ncbi.nlm.nih.gov/pubmed/30617419 http://dx.doi.org/10.1007/s00449-018-02059-5
work_keys_str_mv	AT mantejeanet aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT gangadharannishanthi aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT sewelldavidj aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT turnerrichard aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT fieldray aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT oliverstepheng aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT slaternigel aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT dikiciogluduygu aheuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT mantejeanet heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT gangadharannishanthi heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT sewelldavidj heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT turnerrichard heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT fieldray heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT oliverstepheng heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT slaternigel heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases AT dikiciogluduygu heuristicapproachtohandlingmissingdatainbiologicsmanufacturingdatabases

A heuristic approach to handling missing data in biologics manufacturing databases

Ejemplares similares