Cargando…

Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data

BACKGROUND: Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex human disease. In the absence of detailed phenoty...

Descripción completa

Detalles Bibliográficos
Autores principales:	Thornton-Wells, Tricia A, Moore, Jason H, Haines, Jonathan L
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525209/ https://www.ncbi.nlm.nih.gov/pubmed/16611359 http://dx.doi.org/10.1186/1471-2105-7-204

_version_	1782128888360468480
author	Thornton-Wells, Tricia A Moore, Jason H Haines, Jonathan L
author_facet	Thornton-Wells, Tricia A Moore, Jason H Haines, Jonathan L
author_sort	Thornton-Wells, Tricia A
collection	PubMed
description	BACKGROUND: Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex human disease. In the absence of detailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heterogeneity. The performance of three such methods – Bayesian Classification, Hypergraph-Based Clustering, and Fuzzy k-Modes Clustering – appropriate for categorical data were compared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heterogeneity and/or gene-gene interaction, which are two other complicating factors in discovering genetic models of complex human disease. To determine the efficacy of applying the Bayesian Classification method to real data, the reliability of its internal clustering metrics at finding good clusterings was evaluated using permutation testing. RESULTS: Bayesian Classification outperformed the other two methods, with the exception that the Fuzzy k-Modes Clustering performed best on the most complex genetic model. Bayesian Classification achieved excellent recovery for 75% of the datasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datasets with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfunctional loci (across all simulated models). Neither Hypergraph Clustering nor Fuzzy k-Modes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a restricted set of conditions. When using the average log of class strength as the internal clustering metric, the false positive rate was controlled very well, at three percent or less for all three significance levels (0.01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent significance level of 0.10. CONCLUSION: Bayesian Classification shows promise as an unsupervised computational method for dissecting trait heterogeneity in genotypic data. Its control of false positive and false negative rates lends confidence to the validity of its results. Further investigation of how different parameter settings may improve the performance of Bayesian Classification, especially under more complex genetic models, is ongoing.
format	Text
id	pubmed-1525209
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15252092006-08-02 Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data Thornton-Wells, Tricia A Moore, Jason H Haines, Jonathan L BMC Bioinformatics Research Article BACKGROUND: Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex human disease. In the absence of detailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heterogeneity. The performance of three such methods – Bayesian Classification, Hypergraph-Based Clustering, and Fuzzy k-Modes Clustering – appropriate for categorical data were compared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heterogeneity and/or gene-gene interaction, which are two other complicating factors in discovering genetic models of complex human disease. To determine the efficacy of applying the Bayesian Classification method to real data, the reliability of its internal clustering metrics at finding good clusterings was evaluated using permutation testing. RESULTS: Bayesian Classification outperformed the other two methods, with the exception that the Fuzzy k-Modes Clustering performed best on the most complex genetic model. Bayesian Classification achieved excellent recovery for 75% of the datasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datasets with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfunctional loci (across all simulated models). Neither Hypergraph Clustering nor Fuzzy k-Modes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a restricted set of conditions. When using the average log of class strength as the internal clustering metric, the false positive rate was controlled very well, at three percent or less for all three significance levels (0.01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent significance level of 0.10. CONCLUSION: Bayesian Classification shows promise as an unsupervised computational method for dissecting trait heterogeneity in genotypic data. Its control of false positive and false negative rates lends confidence to the validity of its results. Further investigation of how different parameter settings may improve the performance of Bayesian Classification, especially under more complex genetic models, is ongoing. BioMed Central 2006-04-12 /pmc/articles/PMC1525209/ /pubmed/16611359 http://dx.doi.org/10.1186/1471-2105-7-204 Text en Copyright © 2006 Thornton-Wells et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Thornton-Wells, Tricia A Moore, Jason H Haines, Jonathan L Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
title	Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
title_full	Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
title_fullStr	Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
title_full_unstemmed	Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
title_short	Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
title_sort	dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525209/ https://www.ncbi.nlm.nih.gov/pubmed/16611359 http://dx.doi.org/10.1186/1471-2105-7-204
work_keys_str_mv	AT thorntonwellstriciaa dissectingtraitheterogeneityacomparisonofthreeclusteringmethodsappliedtogenotypicdata AT moorejasonh dissectingtraitheterogeneityacomparisonofthreeclusteringmethodsappliedtogenotypicdata AT hainesjonathanl dissectingtraitheterogeneityacomparisonofthreeclusteringmethodsappliedtogenotypicdata

Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data

Ejemplares similares