Cargando…

Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features

BACKGROUND: Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize...

Descripción completa

Detalles Bibliográficos
Autores principales: Semeiks, JR, Rizki, A, Bissell, MJ, Mian, IS
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435935/
https://www.ncbi.nlm.nih.gov/pubmed/16542449
http://dx.doi.org/10.1186/1471-2105-7-147
_version_ 1782127298871296000
author Semeiks, JR
Rizki, A
Bissell, MJ
Mian, IS
author_facet Semeiks, JR
Rizki, A
Bissell, MJ
Mian, IS
author_sort Semeiks, JR
collection PubMed
description BACKGROUND: Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells. RESULTS: Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occured consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity. CONCLUSION: Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation.
format Text
id pubmed-1435935
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14359352006-04-14 Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features Semeiks, JR Rizki, A Bissell, MJ Mian, IS BMC Bioinformatics Methodology Article BACKGROUND: Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells. RESULTS: Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occured consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity. CONCLUSION: Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation. BioMed Central 2006-03-16 /pmc/articles/PMC1435935/ /pubmed/16542449 http://dx.doi.org/10.1186/1471-2105-7-147 Text en Copyright © 2006 Semeiks et al; licensee BioMed Central Ltd.
spellingShingle Methodology Article
Semeiks, JR
Rizki, A
Bissell, MJ
Mian, IS
Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
title Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
title_full Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
title_fullStr Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
title_full_unstemmed Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
title_short Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
title_sort ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435935/
https://www.ncbi.nlm.nih.gov/pubmed/16542449
http://dx.doi.org/10.1186/1471-2105-7-147
work_keys_str_mv AT semeiksjr ensembleattributeprofileclusteringdiscoveringandcharacterizinggroupsofgeneswithsimilarpatternsofbiologicalfeatures
AT rizkia ensembleattributeprofileclusteringdiscoveringandcharacterizinggroupsofgeneswithsimilarpatternsofbiologicalfeatures
AT bissellmj ensembleattributeprofileclusteringdiscoveringandcharacterizinggroupsofgeneswithsimilarpatternsofbiologicalfeatures
AT mianis ensembleattributeprofileclusteringdiscoveringandcharacterizinggroupsofgeneswithsimilarpatternsofbiologicalfeatures