Cargando…

beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types

Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data...

Descripción completa

Detalles Bibliográficos
Autores principales: Lun, Aaron T. L., Pagès, Hervé, Smith, Mike L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953501/
https://www.ncbi.nlm.nih.gov/pubmed/29723188
http://dx.doi.org/10.1371/journal.pcbi.1006135
_version_ 1783323368666169344
author Lun, Aaron T. L.
Pagès, Hervé
Smith, Mike L.
author_facet Lun, Aaron T. L.
Pagès, Hervé
Smith, Mike L.
author_sort Lun, Aaron T. L.
collection PubMed
description Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.
format Online
Article
Text
id pubmed-5953501
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-59535012018-05-25 beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types Lun, Aaron T. L. Pagès, Hervé Smith, Mike L. PLoS Comput Biol Research Article Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set. Public Library of Science 2018-05-03 /pmc/articles/PMC5953501/ /pubmed/29723188 http://dx.doi.org/10.1371/journal.pcbi.1006135 Text en © 2018 Lun et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Lun, Aaron T. L.
Pagès, Hervé
Smith, Mike L.
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
title beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
title_full beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
title_fullStr beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
title_full_unstemmed beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
title_short beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
title_sort beachmat: a bioconductor c++ api for accessing high-throughput biological data from a variety of r matrix types
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953501/
https://www.ncbi.nlm.nih.gov/pubmed/29723188
http://dx.doi.org/10.1371/journal.pcbi.1006135
work_keys_str_mv AT lunaarontl beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes
AT pagesherve beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes
AT smithmikel beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes