Cargando…

Literature aided determination of data quality and statistical significance threshold for gene expression studies

BACKGROUND: Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this...

Descripción completa

Detalles Bibliográficos
Autores principales: Xu, Lijing, Cheng, Cheng, George, E Olusegun, Homayouni, Ramin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3535704/
https://www.ncbi.nlm.nih.gov/pubmed/23282414
http://dx.doi.org/10.1186/1471-2164-13-S8-S23
_version_ 1782254700979027968
author Xu, Lijing
Cheng, Cheng
George, E Olusegun
Homayouni, Ramin
author_facet Xu, Lijing
Cheng, Cheng
George, E Olusegun
Homayouni, Ramin
author_sort Xu, Lijing
collection PubMed
description BACKGROUND: Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this study, we aimed to develop novel literature based approaches to integrate functional information in analysis of gene expression data. METHODS: Functional relationships between genes were derived by Latent Semantic Indexing (LSI) of Medline abstracts and used to calculate the function cohesion of gene sets. In this study, literature cohesion was applied in two ways. First, Literature-Based Functional Significance (LBFS) method was developed to calculate a p-value for the cohesion of differentially expressed genes (DEGs) in order to objectively evaluate the overall biological significance of the gene expression experiments. Second, Literature Aided Statistical Significance Threshold (LASST) was developed to determine the appropriate expression p-value threshold for a given experiment. RESULTS: We tested our methods on three different publicly available datasets. LBFS analysis demonstrated that only two experiments were significantly cohesive. For each experiment, we also compared the LBFS values of DEGs generated by four different statistical methods. We found that some statistical tests produced more functionally cohesive gene sets than others. However, no statistical test was consistently better for all experiments. This reemphasizes that a statistical test must be carefully selected for each expression study. Moreover, LASST analysis demonstrated that the expression p-value thresholds for some experiments were considerably lower (p < 0.02 and 0.01), suggesting that the arbitrary p-values and false discovery rate thresholds that are commonly used in expression studies may not be biologically sound. CONCLUSIONS: We have developed robust and objective literature-based methods to evaluate the biological support for gene expression experiments and to determine the appropriate statistical significance threshold. These methods will assist investigators to more efficiently extract biologically meaningful insights from high throughput gene expression experiments.
format Online
Article
Text
id pubmed-3535704
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35357042013-01-04 Literature aided determination of data quality and statistical significance threshold for gene expression studies Xu, Lijing Cheng, Cheng George, E Olusegun Homayouni, Ramin BMC Genomics Research BACKGROUND: Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this study, we aimed to develop novel literature based approaches to integrate functional information in analysis of gene expression data. METHODS: Functional relationships between genes were derived by Latent Semantic Indexing (LSI) of Medline abstracts and used to calculate the function cohesion of gene sets. In this study, literature cohesion was applied in two ways. First, Literature-Based Functional Significance (LBFS) method was developed to calculate a p-value for the cohesion of differentially expressed genes (DEGs) in order to objectively evaluate the overall biological significance of the gene expression experiments. Second, Literature Aided Statistical Significance Threshold (LASST) was developed to determine the appropriate expression p-value threshold for a given experiment. RESULTS: We tested our methods on three different publicly available datasets. LBFS analysis demonstrated that only two experiments were significantly cohesive. For each experiment, we also compared the LBFS values of DEGs generated by four different statistical methods. We found that some statistical tests produced more functionally cohesive gene sets than others. However, no statistical test was consistently better for all experiments. This reemphasizes that a statistical test must be carefully selected for each expression study. Moreover, LASST analysis demonstrated that the expression p-value thresholds for some experiments were considerably lower (p < 0.02 and 0.01), suggesting that the arbitrary p-values and false discovery rate thresholds that are commonly used in expression studies may not be biologically sound. CONCLUSIONS: We have developed robust and objective literature-based methods to evaluate the biological support for gene expression experiments and to determine the appropriate statistical significance threshold. These methods will assist investigators to more efficiently extract biologically meaningful insights from high throughput gene expression experiments. BioMed Central 2012-12-17 /pmc/articles/PMC3535704/ /pubmed/23282414 http://dx.doi.org/10.1186/1471-2164-13-S8-S23 Text en Copyright ©2012 Xu et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Xu, Lijing
Cheng, Cheng
George, E Olusegun
Homayouni, Ramin
Literature aided determination of data quality and statistical significance threshold for gene expression studies
title Literature aided determination of data quality and statistical significance threshold for gene expression studies
title_full Literature aided determination of data quality and statistical significance threshold for gene expression studies
title_fullStr Literature aided determination of data quality and statistical significance threshold for gene expression studies
title_full_unstemmed Literature aided determination of data quality and statistical significance threshold for gene expression studies
title_short Literature aided determination of data quality and statistical significance threshold for gene expression studies
title_sort literature aided determination of data quality and statistical significance threshold for gene expression studies
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3535704/
https://www.ncbi.nlm.nih.gov/pubmed/23282414
http://dx.doi.org/10.1186/1471-2164-13-S8-S23
work_keys_str_mv AT xulijing literatureaideddeterminationofdataqualityandstatisticalsignificancethresholdforgeneexpressionstudies
AT chengcheng literatureaideddeterminationofdataqualityandstatisticalsignificancethresholdforgeneexpressionstudies
AT georgeeolusegun literatureaideddeterminationofdataqualityandstatisticalsignificancethresholdforgeneexpressionstudies
AT homayouniramin literatureaideddeterminationofdataqualityandstatisticalsignificancethresholdforgeneexpressionstudies