Cargando…

A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression

BACKGROUND: Distance based unsupervised clustering of gene expression data is commonly used to identify heterogeneity in biologic samples. However, high noise levels in gene expression data and relatively high correlation between genes are often encountered, so traditional distances such as Euclidea...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Xiting, Liang, Anqi, Gomez, Jose, Cohn, Lauren, Zhao, Hongyu, Chupp, Geoffrey L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5480187/
https://www.ncbi.nlm.nih.gov/pubmed/28637421
http://dx.doi.org/10.1186/s12859-017-1727-4
_version_ 1783245256333983744
author Yan, Xiting
Liang, Anqi
Gomez, Jose
Cohn, Lauren
Zhao, Hongyu
Chupp, Geoffrey L.
author_facet Yan, Xiting
Liang, Anqi
Gomez, Jose
Cohn, Lauren
Zhao, Hongyu
Chupp, Geoffrey L.
author_sort Yan, Xiting
collection PubMed
description BACKGROUND: Distance based unsupervised clustering of gene expression data is commonly used to identify heterogeneity in biologic samples. However, high noise levels in gene expression data and relatively high correlation between genes are often encountered, so traditional distances such as Euclidean distance may not be effective at discriminating the biological differences between samples. An alternative method to examine disease phenotypes is to use pre-defined biological pathways. These pathways have been shown to be perturbed in different ways in different subjects who have similar clinical features. We hypothesize that differences in the expressions of genes in a given pathway are more predictive of differences in biological differences compared to standard approaches and if integrated into clustering analysis will enhance the robustness and accuracy of the clustering method. To examine this hypothesis, we developed a novel computational method to assess the biological differences between samples using gene expression data by assuming that ontologically defined biological pathways in biologically similar samples have similar behavior. RESULTS: Pre-defined biological pathways were downloaded and genes in each pathway were used to cluster samples using the Gaussian mixture model. The clustering results across different pathways were then summarized to calculate the pathway-based distance score between samples. This method was applied to both simulated and real data sets and compared to the traditional Euclidean distance and another pathway-based clustering method, Pathifier. The results show that the pathway-based distance score performs significantly better than the Euclidean distance, especially when the heterogeneity is low and genes in the same pathways are correlated. Compared to Pathifier, we demonstrated that our approach achieves higher accuracy and robustness for small pathways. When the pathway size is large, by downsampling the pathways into smaller pathways, our approach was able to achieve comparable performance. CONCLUSIONS: We have developed a novel distance score that represents the biological differences between samples using gene expression data and pre-defined biological pathway information. Application of this distance score results in more accurate, robust, and biologically meaningful clustering results in both simulated data and real data when compared to traditional methods. It also has comparable or better performance compared to Pathifier. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1727-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5480187
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54801872017-06-23 A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression Yan, Xiting Liang, Anqi Gomez, Jose Cohn, Lauren Zhao, Hongyu Chupp, Geoffrey L. BMC Bioinformatics Methodology Article BACKGROUND: Distance based unsupervised clustering of gene expression data is commonly used to identify heterogeneity in biologic samples. However, high noise levels in gene expression data and relatively high correlation between genes are often encountered, so traditional distances such as Euclidean distance may not be effective at discriminating the biological differences between samples. An alternative method to examine disease phenotypes is to use pre-defined biological pathways. These pathways have been shown to be perturbed in different ways in different subjects who have similar clinical features. We hypothesize that differences in the expressions of genes in a given pathway are more predictive of differences in biological differences compared to standard approaches and if integrated into clustering analysis will enhance the robustness and accuracy of the clustering method. To examine this hypothesis, we developed a novel computational method to assess the biological differences between samples using gene expression data by assuming that ontologically defined biological pathways in biologically similar samples have similar behavior. RESULTS: Pre-defined biological pathways were downloaded and genes in each pathway were used to cluster samples using the Gaussian mixture model. The clustering results across different pathways were then summarized to calculate the pathway-based distance score between samples. This method was applied to both simulated and real data sets and compared to the traditional Euclidean distance and another pathway-based clustering method, Pathifier. The results show that the pathway-based distance score performs significantly better than the Euclidean distance, especially when the heterogeneity is low and genes in the same pathways are correlated. Compared to Pathifier, we demonstrated that our approach achieves higher accuracy and robustness for small pathways. When the pathway size is large, by downsampling the pathways into smaller pathways, our approach was able to achieve comparable performance. CONCLUSIONS: We have developed a novel distance score that represents the biological differences between samples using gene expression data and pre-defined biological pathway information. Application of this distance score results in more accurate, robust, and biologically meaningful clustering results in both simulated data and real data when compared to traditional methods. It also has comparable or better performance compared to Pathifier. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1727-4) contains supplementary material, which is available to authorized users. BioMed Central 2017-06-20 /pmc/articles/PMC5480187/ /pubmed/28637421 http://dx.doi.org/10.1186/s12859-017-1727-4 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Yan, Xiting
Liang, Anqi
Gomez, Jose
Cohn, Lauren
Zhao, Hongyu
Chupp, Geoffrey L.
A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
title A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
title_full A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
title_fullStr A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
title_full_unstemmed A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
title_short A novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
title_sort novel pathway-based distance score enhances assessment of disease heterogeneity in gene expression
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5480187/
https://www.ncbi.nlm.nih.gov/pubmed/28637421
http://dx.doi.org/10.1186/s12859-017-1727-4
work_keys_str_mv AT yanxiting anovelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT lianganqi anovelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT gomezjose anovelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT cohnlauren anovelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT zhaohongyu anovelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT chuppgeoffreyl anovelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT yanxiting novelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT lianganqi novelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT gomezjose novelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT cohnlauren novelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT zhaohongyu novelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression
AT chuppgeoffreyl novelpathwaybaseddistancescoreenhancesassessmentofdiseaseheterogeneityingeneexpression