Cargando…

Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening

BACKGROUND: Advancements in transcriptomic profiling have led to the emergence of new challenges regarding data integration and interpretability. Variability between measurement platforms makes it difficult to compare between cohorts, and large numbers of gene features have encouraged the use black...

Descripción completa

Detalles Bibliográficos
Autores principales:	Moody, Laura, Chen, Hong, Pan, Yuan-Xiang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7579924/ https://www.ncbi.nlm.nih.gov/pubmed/33087122 http://dx.doi.org/10.1186/s12920-020-00778-x

_version_	1783598692467474432
author	Moody, Laura Chen, Hong Pan, Yuan-Xiang
author_facet	Moody, Laura Chen, Hong Pan, Yuan-Xiang
author_sort	Moody, Laura
collection	PubMed
description	BACKGROUND: Advancements in transcriptomic profiling have led to the emergence of new challenges regarding data integration and interpretability. Variability between measurement platforms makes it difficult to compare between cohorts, and large numbers of gene features have encouraged the use black box methods that are not easily translated into biologically and clinically meaningful findings. We propose that gene rankings and algorithms that rely on relative expression within gene pairs can address such obstacles. METHODS: We implemented an innovative process to evaluate the performance of five feature selection methods on simulated gene-pair data. Along with TSP, we consider other methods that retain more information in their score calculations, including the magnitude of gene expression change as well as within-class variation. Tree-based rule extraction was also applied to serum microRNA (miRNA) pairs in order to devise a noninvasive screening tool for pancreatic and ovarian cancer. RESULTS: Gene pair data were simulated using different types of signal and noise. Pairs were filtered using feature selection approaches, including top-scoring pairs (TSP), absolute differences between gene ranks, and Fisher scores. Methods that retain more information, such as the magnitude of expression change and within-class variance, yielded higher classification accuracy using a random forest model. We then demonstrate two powerful applications of gene pairs by first performing large-scale integration of 52 breast cancer datasets consisting of 10,350 patients. Not only did we confirm known oncogenes, but we also propose novel tumorigenic genes, such as BSDC1 and U2AF1, that could distinguish between tumor subtypes. Finally, circulating miRNA pairs were filtered and salient rules were extracted to build simplified tree ensemble learners (STELs) for four types of cancer. These accessible clinical frameworks detected pancreatic and ovarian cancer with 84.8 and 93.6% accuracy, respectively. CONCLUSION: Rank-based gene pair classification benefits from careful feature selection methods that preserve maximal information. Gene pairs enable dataset integration for greater statistical power and discovery of robust biomarkers as well as facilitate construction of user-friendly clinical screening tools.
format	Online Article Text
id	pubmed-7579924
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-75799242020-10-22 Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening Moody, Laura Chen, Hong Pan, Yuan-Xiang BMC Med Genomics Research BACKGROUND: Advancements in transcriptomic profiling have led to the emergence of new challenges regarding data integration and interpretability. Variability between measurement platforms makes it difficult to compare between cohorts, and large numbers of gene features have encouraged the use black box methods that are not easily translated into biologically and clinically meaningful findings. We propose that gene rankings and algorithms that rely on relative expression within gene pairs can address such obstacles. METHODS: We implemented an innovative process to evaluate the performance of five feature selection methods on simulated gene-pair data. Along with TSP, we consider other methods that retain more information in their score calculations, including the magnitude of gene expression change as well as within-class variation. Tree-based rule extraction was also applied to serum microRNA (miRNA) pairs in order to devise a noninvasive screening tool for pancreatic and ovarian cancer. RESULTS: Gene pair data were simulated using different types of signal and noise. Pairs were filtered using feature selection approaches, including top-scoring pairs (TSP), absolute differences between gene ranks, and Fisher scores. Methods that retain more information, such as the magnitude of expression change and within-class variance, yielded higher classification accuracy using a random forest model. We then demonstrate two powerful applications of gene pairs by first performing large-scale integration of 52 breast cancer datasets consisting of 10,350 patients. Not only did we confirm known oncogenes, but we also propose novel tumorigenic genes, such as BSDC1 and U2AF1, that could distinguish between tumor subtypes. Finally, circulating miRNA pairs were filtered and salient rules were extracted to build simplified tree ensemble learners (STELs) for four types of cancer. These accessible clinical frameworks detected pancreatic and ovarian cancer with 84.8 and 93.6% accuracy, respectively. CONCLUSION: Rank-based gene pair classification benefits from careful feature selection methods that preserve maximal information. Gene pairs enable dataset integration for greater statistical power and discovery of robust biomarkers as well as facilitate construction of user-friendly clinical screening tools. BioMed Central 2020-10-22 /pmc/articles/PMC7579924/ /pubmed/33087122 http://dx.doi.org/10.1186/s12920-020-00778-x Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Moody, Laura Chen, Hong Pan, Yuan-Xiang Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
title	Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
title_full	Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
title_fullStr	Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
title_full_unstemmed	Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
title_short	Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
title_sort	considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7579924/ https://www.ncbi.nlm.nih.gov/pubmed/33087122 http://dx.doi.org/10.1186/s12920-020-00778-x
work_keys_str_mv	AT moodylaura considerationsforfeatureselectionusinggenepairsandapplicationsinlargescaledatasetintegrationnoveloncogenediscoveryandinterpretablecancerscreening AT chenhong considerationsforfeatureselectionusinggenepairsandapplicationsinlargescaledatasetintegrationnoveloncogenediscoveryandinterpretablecancerscreening AT panyuanxiang considerationsforfeatureselectionusinggenepairsandapplicationsinlargescaledatasetintegrationnoveloncogenediscoveryandinterpretablecancerscreening

Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening

Ejemplares similares