Cargando…

MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning

In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-establish...

Descripción completa

Detalles Bibliográficos
Autores principales: Yin, HuaChun, Tao, JingXin, Peng, Yuyang, Xiong, Ying, Li, Bo, Li, Song, Yang, Hui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9304602/
https://www.ncbi.nlm.nih.gov/pubmed/35891786
http://dx.doi.org/10.1016/j.csbj.2022.07.022
_version_ 1784752124819668992
author Yin, HuaChun
Tao, JingXin
Peng, Yuyang
Xiong, Ying
Li, Bo
Li, Song
Yang, Hui
author_facet Yin, HuaChun
Tao, JingXin
Peng, Yuyang
Xiong, Ying
Li, Bo
Li, Song
Yang, Hui
author_sort Yin, HuaChun
collection PubMed
description In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.
format Online
Article
Text
id pubmed-9304602
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-93046022022-07-25 MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning Yin, HuaChun Tao, JingXin Peng, Yuyang Xiong, Ying Li, Bo Li, Song Yang, Hui Comput Struct Biotechnol J Research Article In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes. Research Network of Computational and Structural Biotechnology 2022-07-14 /pmc/articles/PMC9304602/ /pubmed/35891786 http://dx.doi.org/10.1016/j.csbj.2022.07.022 Text en © 2022 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Research Article
Yin, HuaChun
Tao, JingXin
Peng, Yuyang
Xiong, Ying
Li, Bo
Li, Song
Yang, Hui
MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning
title MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning
title_full MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning
title_fullStr MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning
title_full_unstemmed MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning
title_short MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning
title_sort mspj: discovering potential biomarkers in small gene expression datasets via ensemble learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9304602/
https://www.ncbi.nlm.nih.gov/pubmed/35891786
http://dx.doi.org/10.1016/j.csbj.2022.07.022
work_keys_str_mv AT yinhuachun mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning
AT taojingxin mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning
AT pengyuyang mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning
AT xiongying mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning
AT libo mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning
AT lisong mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning
AT yanghui mspjdiscoveringpotentialbiomarkersinsmallgeneexpressiondatasetsviaensemblelearning