Cargando…

Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression

Alzheimer’s disease (AD) is a complex neurodegenerative disorder that affects thinking, memory, and behavior. Limbic-predominant age-related TDP-43 encephalopathy (LATE) is a recently identified common neurodegenerative disease that mimics the clinical symptoms of AD. The development of drugs to pre...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Xinxing, Peng, Chong, Nelson, Peter T., Cheng, Qiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8423259/
https://www.ncbi.nlm.nih.gov/pubmed/34492068
http://dx.doi.org/10.1371/journal.pone.0256648
_version_ 1783749429422981120
author Wu, Xinxing
Peng, Chong
Nelson, Peter T.
Cheng, Qiang
author_facet Wu, Xinxing
Peng, Chong
Nelson, Peter T.
Cheng, Qiang
author_sort Wu, Xinxing
collection PubMed
description Alzheimer’s disease (AD) is a complex neurodegenerative disorder that affects thinking, memory, and behavior. Limbic-predominant age-related TDP-43 encephalopathy (LATE) is a recently identified common neurodegenerative disease that mimics the clinical symptoms of AD. The development of drugs to prevent or treat these neurodegenerative diseases has been slow, partly because the genes associated with these diseases are incompletely understood. A notable hindrance from data analysis perspective is that, usually, the clinical samples for patients and controls are highly imbalanced, thus rendering it challenging to apply most existing machine learning algorithms to directly analyze such datasets. Meeting this data analysis challenge is critical, as more specific disease-associated gene identification may enable new insights into underlying disease-driving mechanisms and help find biomarkers and, in turn, improve prospects for effective treatment strategies. In order to detect disease-associated genes based on imbalanced transcriptome-wide data, we proposed an integrated multiple random forests (IMRF) algorithm. IMRF is effective in differentiating putative genes associated with subjects having LATE and/or AD from controls based on transcriptome-wide data, thereby enabling effective discrimination between these samples. Various forms of validations, such as cross-domain verification of our method over other datasets, improved and competitive classification performance by using identified genes, effectiveness of testing data with a classifier that is completely independent from decision trees and random forests, and relationships with prior AD and LATE studies on the genes linked to neurodegeneration, all testify to the effectiveness of IMRF in identifying genes with altered expression in LATE and/or AD. We conclude that IMRF, as an effective feature selection algorithm for imbalanced data, is promising to facilitate the development of new gene biomarkers as well as targets for effective strategies of disease prevention and treatment.
format Online
Article
Text
id pubmed-8423259
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-84232592021-09-08 Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression Wu, Xinxing Peng, Chong Nelson, Peter T. Cheng, Qiang PLoS One Research Article Alzheimer’s disease (AD) is a complex neurodegenerative disorder that affects thinking, memory, and behavior. Limbic-predominant age-related TDP-43 encephalopathy (LATE) is a recently identified common neurodegenerative disease that mimics the clinical symptoms of AD. The development of drugs to prevent or treat these neurodegenerative diseases has been slow, partly because the genes associated with these diseases are incompletely understood. A notable hindrance from data analysis perspective is that, usually, the clinical samples for patients and controls are highly imbalanced, thus rendering it challenging to apply most existing machine learning algorithms to directly analyze such datasets. Meeting this data analysis challenge is critical, as more specific disease-associated gene identification may enable new insights into underlying disease-driving mechanisms and help find biomarkers and, in turn, improve prospects for effective treatment strategies. In order to detect disease-associated genes based on imbalanced transcriptome-wide data, we proposed an integrated multiple random forests (IMRF) algorithm. IMRF is effective in differentiating putative genes associated with subjects having LATE and/or AD from controls based on transcriptome-wide data, thereby enabling effective discrimination between these samples. Various forms of validations, such as cross-domain verification of our method over other datasets, improved and competitive classification performance by using identified genes, effectiveness of testing data with a classifier that is completely independent from decision trees and random forests, and relationships with prior AD and LATE studies on the genes linked to neurodegeneration, all testify to the effectiveness of IMRF in identifying genes with altered expression in LATE and/or AD. We conclude that IMRF, as an effective feature selection algorithm for imbalanced data, is promising to facilitate the development of new gene biomarkers as well as targets for effective strategies of disease prevention and treatment. Public Library of Science 2021-09-07 /pmc/articles/PMC8423259/ /pubmed/34492068 http://dx.doi.org/10.1371/journal.pone.0256648 Text en © 2021 Wu et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Wu, Xinxing
Peng, Chong
Nelson, Peter T.
Cheng, Qiang
Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression
title Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression
title_full Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression
title_fullStr Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression
title_full_unstemmed Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression
title_short Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression
title_sort random forest-integrated analysis in ad and late brain transcriptome-wide data to identify disease-specific gene expression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8423259/
https://www.ncbi.nlm.nih.gov/pubmed/34492068
http://dx.doi.org/10.1371/journal.pone.0256648
work_keys_str_mv AT wuxinxing randomforestintegratedanalysisinadandlatebraintranscriptomewidedatatoidentifydiseasespecificgeneexpression
AT pengchong randomforestintegratedanalysisinadandlatebraintranscriptomewidedatatoidentifydiseasespecificgeneexpression
AT nelsonpetert randomforestintegratedanalysisinadandlatebraintranscriptomewidedatatoidentifydiseasespecificgeneexpression
AT chengqiang randomforestintegratedanalysisinadandlatebraintranscriptomewidedatatoidentifydiseasespecificgeneexpression