Cargando…

The identification of informative genes from multiple datasets with increasing complexity

BACKGROUND: In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Anvar, S Yahya, 't Hoen, Peter AC, Tucker, Allan
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2822754/ https://www.ncbi.nlm.nih.gov/pubmed/20078860 http://dx.doi.org/10.1186/1471-2105-11-32

_version_	1782177551621292032
author	Anvar, S Yahya 't Hoen, Peter AC Tucker, Allan
author_facet	Anvar, S Yahya 't Hoen, Peter AC Tucker, Allan
author_sort	Anvar, S Yahya
collection	PubMed
description	BACKGROUND: In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes. RESULTS: In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes. CONCLUSIONS: We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events.
format	Text
id	pubmed-2822754
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28227542010-02-17 The identification of informative genes from multiple datasets with increasing complexity Anvar, S Yahya 't Hoen, Peter AC Tucker, Allan BMC Bioinformatics Research article BACKGROUND: In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes. RESULTS: In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes. CONCLUSIONS: We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events. BioMed Central 2010-01-15 /pmc/articles/PMC2822754/ /pubmed/20078860 http://dx.doi.org/10.1186/1471-2105-11-32 Text en Copyright ©2010 Anvar et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Anvar, S Yahya 't Hoen, Peter AC Tucker, Allan The identification of informative genes from multiple datasets with increasing complexity
title	The identification of informative genes from multiple datasets with increasing complexity
title_full	The identification of informative genes from multiple datasets with increasing complexity
title_fullStr	The identification of informative genes from multiple datasets with increasing complexity
title_full_unstemmed	The identification of informative genes from multiple datasets with increasing complexity
title_short	The identification of informative genes from multiple datasets with increasing complexity
title_sort	identification of informative genes from multiple datasets with increasing complexity
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2822754/ https://www.ncbi.nlm.nih.gov/pubmed/20078860 http://dx.doi.org/10.1186/1471-2105-11-32
work_keys_str_mv	AT anvarsyahya theidentificationofinformativegenesfrommultipledatasetswithincreasingcomplexity AT thoenpeterac theidentificationofinformativegenesfrommultipledatasetswithincreasingcomplexity AT tuckerallan theidentificationofinformativegenesfrommultipledatasetswithincreasingcomplexity AT anvarsyahya identificationofinformativegenesfrommultipledatasetswithincreasingcomplexity AT thoenpeterac identificationofinformativegenesfrommultipledatasetswithincreasingcomplexity AT tuckerallan identificationofinformativegenesfrommultipledatasetswithincreasingcomplexity

The identification of informative genes from multiple datasets with increasing complexity

Ejemplares similares