Cargando…

Mining gene expression data by interpreting principal components

BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across...

Descripción completa

Detalles Bibliográficos
Autores principales:	Roden, Joseph C, King, Brandon W, Trout, Diane, Mortazavi, Ali, Wold, Barbara J, Hart, Christopher E
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1501050/ https://www.ncbi.nlm.nih.gov/pubmed/16600052 http://dx.doi.org/10.1186/1471-2105-7-194

_version_	1782128398699593728
author	Roden, Joseph C King, Brandon W Trout, Diane Mortazavi, Ali Wold, Barbara J Hart, Christopher E
author_facet	Roden, Joseph C King, Brandon W Trout, Diane Mortazavi, Ali Wold, Barbara J Hart, Christopher E
author_sort	Roden, Joseph C
collection	PubMed
description	BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. RESULTS: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. CONCLUSION: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.
format	Text
id	pubmed-1501050
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15010502006-07-13 Mining gene expression data by interpreting principal components Roden, Joseph C King, Brandon W Trout, Diane Mortazavi, Ali Wold, Barbara J Hart, Christopher E BMC Bioinformatics Software BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. RESULTS: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. CONCLUSION: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets. BioMed Central 2006-04-07 /pmc/articles/PMC1501050/ /pubmed/16600052 http://dx.doi.org/10.1186/1471-2105-7-194 Text en Copyright © 2006 Roden et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Roden, Joseph C King, Brandon W Trout, Diane Mortazavi, Ali Wold, Barbara J Hart, Christopher E Mining gene expression data by interpreting principal components
title	Mining gene expression data by interpreting principal components
title_full	Mining gene expression data by interpreting principal components
title_fullStr	Mining gene expression data by interpreting principal components
title_full_unstemmed	Mining gene expression data by interpreting principal components
title_short	Mining gene expression data by interpreting principal components
title_sort	mining gene expression data by interpreting principal components
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1501050/ https://www.ncbi.nlm.nih.gov/pubmed/16600052 http://dx.doi.org/10.1186/1471-2105-7-194
work_keys_str_mv	AT rodenjosephc mininggeneexpressiondatabyinterpretingprincipalcomponents AT kingbrandonw mininggeneexpressiondatabyinterpretingprincipalcomponents AT troutdiane mininggeneexpressiondatabyinterpretingprincipalcomponents AT mortazaviali mininggeneexpressiondatabyinterpretingprincipalcomponents AT woldbarbaraj mininggeneexpressiondatabyinterpretingprincipalcomponents AT hartchristophere mininggeneexpressiondatabyinterpretingprincipalcomponents

Mining gene expression data by interpreting principal components

Ejemplares similares