Cargando…
The impact of sample imbalance on identifying differentially expressed genes
BACKGROUND: Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differentia...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1780111/ https://www.ncbi.nlm.nih.gov/pubmed/17217526 http://dx.doi.org/10.1186/1471-2105-7-S4-S8 |
_version_ | 1782131846494027776 |
---|---|
author | Yang, Kun Li, Jianzhong Gao, Hong |
author_facet | Yang, Kun Li, Jianzhong Gao, Hong |
author_sort | Yang, Kun |
collection | PubMed |
description | BACKGROUND: Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data. RESULTS: Based on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases. CONCLUSION: Two proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data. |
format | Text |
id | pubmed-1780111 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-17801112007-01-24 The impact of sample imbalance on identifying differentially expressed genes Yang, Kun Li, Jianzhong Gao, Hong BMC Bioinformatics Research BACKGROUND: Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data. RESULTS: Based on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases. CONCLUSION: Two proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data. BioMed Central 2006-12-12 /pmc/articles/PMC1780111/ /pubmed/17217526 http://dx.doi.org/10.1186/1471-2105-7-S4-S8 Text en Copyright © 2006 Yang et al; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Yang, Kun Li, Jianzhong Gao, Hong The impact of sample imbalance on identifying differentially expressed genes |
title | The impact of sample imbalance on identifying differentially expressed genes |
title_full | The impact of sample imbalance on identifying differentially expressed genes |
title_fullStr | The impact of sample imbalance on identifying differentially expressed genes |
title_full_unstemmed | The impact of sample imbalance on identifying differentially expressed genes |
title_short | The impact of sample imbalance on identifying differentially expressed genes |
title_sort | impact of sample imbalance on identifying differentially expressed genes |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1780111/ https://www.ncbi.nlm.nih.gov/pubmed/17217526 http://dx.doi.org/10.1186/1471-2105-7-S4-S8 |
work_keys_str_mv | AT yangkun theimpactofsampleimbalanceonidentifyingdifferentiallyexpressedgenes AT lijianzhong theimpactofsampleimbalanceonidentifyingdifferentiallyexpressedgenes AT gaohong theimpactofsampleimbalanceonidentifyingdifferentiallyexpressedgenes AT yangkun impactofsampleimbalanceonidentifyingdifferentiallyexpressedgenes AT lijianzhong impactofsampleimbalanceonidentifyingdifferentiallyexpressedgenes AT gaohong impactofsampleimbalanceonidentifyingdifferentiallyexpressedgenes |