Cargando…

Learning sparse log-ratios for high-throughput sequencing data

MOTIVATION: The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios bet...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gordon-Rodriguez, Elliott, Quinn, Thomas P, Cunningham, John P
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8696089/ https://www.ncbi.nlm.nih.gov/pubmed/34498030 http://dx.doi.org/10.1093/bioinformatics/btab645

_version_	1784619728806871040
author	Gordon-Rodriguez, Elliott Quinn, Thomas P Cunningham, John P
author_facet	Gordon-Rodriguez, Elliott Quinn, Thomas P Cunningham, John P
author_sort	Gordon-Rodriguez, Elliott
collection	PubMed
description	MOTIVATION: The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. RESULTS: Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. AVAILABILITY AND IMPLEMENTATION: The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-8696089
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-86960892022-01-04 Learning sparse log-ratios for high-throughput sequencing data Gordon-Rodriguez, Elliott Quinn, Thomas P Cunningham, John P Bioinformatics Original Papers MOTIVATION: The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. RESULTS: Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. AVAILABILITY AND IMPLEMENTATION: The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-09-08 /pmc/articles/PMC8696089/ /pubmed/34498030 http://dx.doi.org/10.1093/bioinformatics/btab645 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Gordon-Rodriguez, Elliott Quinn, Thomas P Cunningham, John P Learning sparse log-ratios for high-throughput sequencing data
title	Learning sparse log-ratios for high-throughput sequencing data
title_full	Learning sparse log-ratios for high-throughput sequencing data
title_fullStr	Learning sparse log-ratios for high-throughput sequencing data
title_full_unstemmed	Learning sparse log-ratios for high-throughput sequencing data
title_short	Learning sparse log-ratios for high-throughput sequencing data
title_sort	learning sparse log-ratios for high-throughput sequencing data
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8696089/ https://www.ncbi.nlm.nih.gov/pubmed/34498030 http://dx.doi.org/10.1093/bioinformatics/btab645
work_keys_str_mv	AT gordonrodriguezelliott learningsparselogratiosforhighthroughputsequencingdata AT quinnthomasp learningsparselogratiosforhighthroughputsequencingdata AT cunninghamjohnp learningsparselogratiosforhighthroughputsequencingdata

Learning sparse log-ratios for high-throughput sequencing data

Ejemplares similares