Cargando…

Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

BACKGROUND: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and...

Descripción completa

Detalles Bibliográficos
Autores principales:	O’Shea, Robert J, Tsoka, Sophia, Cook, Gary JR, Goh, Vicky
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	SAGE Publications 2021
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640984/ https://www.ncbi.nlm.nih.gov/pubmed/34866896 http://dx.doi.org/10.1177/11769351211056298

_version_	1784609419153113088
author	O’Shea, Robert J Tsoka, Sophia Cook, Gary JR Goh, Vicky
author_facet	O’Shea, Robert J Tsoka, Sophia Cook, Gary JR Goh, Vicky
author_sort	O’Shea, Robert J
collection	PubMed
description	BACKGROUND: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, [Formula: see text] penalisation and [Formula: see text] penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation. METHODS: Five large [Formula: see text] genomic datasets were extracted from Gene Expression Omnibus. ‘Gold-standard’ regression models were trained on subspaces of these datasets ( [Formula: see text] , [Formula: see text] ). Penalised regression models were trained on small samples from these subspaces ( [Formula: see text] ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty ‘preselection’ according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation. RESULTS: [Formula: see text] -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. [Formula: see text] -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. [Formula: see text] also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics. CONCLUSIONS: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of [Formula: see text] penalisation for structural selection and [Formula: see text] penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.
format	Online Article Text
id	pubmed-8640984
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	SAGE Publications
record_format	MEDLINE/PubMed
spelling	pubmed-86409842021-12-04 Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data O’Shea, Robert J Tsoka, Sophia Cook, Gary JR Goh, Vicky Cancer Inform Original Research BACKGROUND: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, [Formula: see text] penalisation and [Formula: see text] penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation. METHODS: Five large [Formula: see text] genomic datasets were extracted from Gene Expression Omnibus. ‘Gold-standard’ regression models were trained on subspaces of these datasets ( [Formula: see text] , [Formula: see text] ). Penalised regression models were trained on small samples from these subspaces ( [Formula: see text] ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty ‘preselection’ according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation. RESULTS: [Formula: see text] -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. [Formula: see text] -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. [Formula: see text] also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics. CONCLUSIONS: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of [Formula: see text] penalisation for structural selection and [Formula: see text] penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks. SAGE Publications 2021-11-27 /pmc/articles/PMC8640984/ /pubmed/34866896 http://dx.doi.org/10.1177/11769351211056298 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle	Original Research O’Shea, Robert J Tsoka, Sophia Cook, Gary JR Goh, Vicky Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_full	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_fullStr	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_full_unstemmed	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_short	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_sort	sparse regression in cancer genomics: comparing variable selection and predictions in real world data
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640984/ https://www.ncbi.nlm.nih.gov/pubmed/34866896 http://dx.doi.org/10.1177/11769351211056298
work_keys_str_mv	AT oshearobertj sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata AT tsokasophia sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata AT cookgaryjr sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata AT gohvicky sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata

Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

Ejemplares similares