Cargando…

The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis

Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance wi...

Descripción completa

Detalles Bibliográficos
Autores principales: Ho, Chi-Hsuan, Huang, Yu-Jyun, Lai, Ying-Ju, Mukherjee, Rajarshi, Hsiao, Chuhsing Kate
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8728032/
https://www.ncbi.nlm.nih.gov/pubmed/34791175
http://dx.doi.org/10.1093/g3journal/jkab365
_version_ 1784626643074023424
author Ho, Chi-Hsuan
Huang, Yu-Jyun
Lai, Ying-Ju
Mukherjee, Rajarshi
Hsiao, Chuhsing Kate
author_facet Ho, Chi-Hsuan
Huang, Yu-Jyun
Lai, Ying-Ju
Mukherjee, Rajarshi
Hsiao, Chuhsing Kate
author_sort Ho, Chi-Hsuan
collection PubMed
description Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal (MVN) distribution. Six statistical methods in three categories of MVN tests were considered and applied to a total of 24 RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and nonparametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available.
format Online
Article
Text
id pubmed-8728032
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-87280322022-01-05 The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis Ho, Chi-Hsuan Huang, Yu-Jyun Lai, Ying-Ju Mukherjee, Rajarshi Hsiao, Chuhsing Kate G3 (Bethesda) Investigation Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal (MVN) distribution. Six statistical methods in three categories of MVN tests were considered and applied to a total of 24 RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and nonparametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available. Oxford University Press 2021-10-25 /pmc/articles/PMC8728032/ /pubmed/34791175 http://dx.doi.org/10.1093/g3journal/jkab365 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of Genetics Society of America. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Investigation
Ho, Chi-Hsuan
Huang, Yu-Jyun
Lai, Ying-Ju
Mukherjee, Rajarshi
Hsiao, Chuhsing Kate
The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
title The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
title_full The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
title_fullStr The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
title_full_unstemmed The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
title_short The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
title_sort misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
topic Investigation
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8728032/
https://www.ncbi.nlm.nih.gov/pubmed/34791175
http://dx.doi.org/10.1093/g3journal/jkab365
work_keys_str_mv AT hochihsuan themisuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT huangyujyun themisuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT laiyingju themisuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT mukherjeerajarshi themisuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT hsiaochuhsingkate themisuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT hochihsuan misuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT huangyujyun misuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT laiyingju misuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT mukherjeerajarshi misuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis
AT hsiaochuhsingkate misuseofdistributionalassumptionsinfunctionalclassscoringgenesetandpathwayanalysis