Cargando…

Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes

BACKGROUND: Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incor...

Descripción completa

Detalles Bibliográficos
Autores principales: Bushel, Pierre R, Wolfinger, Russell D, Gibson, Greg
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839893/
https://www.ncbi.nlm.nih.gov/pubmed/17408499
http://dx.doi.org/10.1186/1752-0509-1-15
_version_ 1782132871098531840
author Bushel, Pierre R
Wolfinger, Russell D
Gibson, Greg
author_facet Bushel, Pierre R
Wolfinger, Russell D
Gibson, Greg
author_sort Bushel, Pierre R
collection PubMed
description BACKGROUND: Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology. RESULTS: We present a more formal approach, the modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations. The strategy involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The approach is shown to work well with a simulated mixed data set and two real data examples containing numeric and categorical data types. One from a heart disease study and another from acetaminophen (an analgesic) exposure in rat liver that causes centrilobular necrosis. CONCLUSION: The modk-prototypes algorithm partitioned the simulated data into clusters with samples in their respective class group and the heart disease samples into two groups (sick and buff denoting samples having pain type representative of angina and non-angina respectively) with an accuracy of 79%. This is on par with, or better than, the assignment accuracy of the heart disease samples by several well-known and successful clustering algorithms. Following modk-prototypes clustering of the acetaminophen-exposed samples, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. The biological processes cell growth and/or maintenance, amine metabolism, and stress response were shown to discern between no and moderate levels of acetaminophen-induced centrilobular necrosis. The use of well-known and traditional measurements directly in the clustering provides some guarantee that the resulting clusters will be meaningfully interpretable.
format Text
id pubmed-1839893
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18398932007-04-02 Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes Bushel, Pierre R Wolfinger, Russell D Gibson, Greg BMC Syst Biol Research Article BACKGROUND: Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology. RESULTS: We present a more formal approach, the modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations. The strategy involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The approach is shown to work well with a simulated mixed data set and two real data examples containing numeric and categorical data types. One from a heart disease study and another from acetaminophen (an analgesic) exposure in rat liver that causes centrilobular necrosis. CONCLUSION: The modk-prototypes algorithm partitioned the simulated data into clusters with samples in their respective class group and the heart disease samples into two groups (sick and buff denoting samples having pain type representative of angina and non-angina respectively) with an accuracy of 79%. This is on par with, or better than, the assignment accuracy of the heart disease samples by several well-known and successful clustering algorithms. Following modk-prototypes clustering of the acetaminophen-exposed samples, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. The biological processes cell growth and/or maintenance, amine metabolism, and stress response were shown to discern between no and moderate levels of acetaminophen-induced centrilobular necrosis. The use of well-known and traditional measurements directly in the clustering provides some guarantee that the resulting clusters will be meaningfully interpretable. BioMed Central 2007-02-23 /pmc/articles/PMC1839893/ /pubmed/17408499 http://dx.doi.org/10.1186/1752-0509-1-15 Text en Copyright © 2007 Bushel et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Bushel, Pierre R
Wolfinger, Russell D
Gibson, Greg
Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
title Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
title_full Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
title_fullStr Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
title_full_unstemmed Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
title_short Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
title_sort simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839893/
https://www.ncbi.nlm.nih.gov/pubmed/17408499
http://dx.doi.org/10.1186/1752-0509-1-15
work_keys_str_mv AT bushelpierrer simultaneousclusteringofgeneexpressiondatawithclinicalchemistryandpathologicalevaluationsrevealsphenotypicprototypes
AT wolfingerrusselld simultaneousclusteringofgeneexpressiondatawithclinicalchemistryandpathologicalevaluationsrevealsphenotypicprototypes
AT gibsongreg simultaneousclusteringofgeneexpressiondatawithclinicalchemistryandpathologicalevaluationsrevealsphenotypicprototypes