Cargando…
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types
Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953501/ https://www.ncbi.nlm.nih.gov/pubmed/29723188 http://dx.doi.org/10.1371/journal.pcbi.1006135 |
_version_ | 1783323368666169344 |
---|---|
author | Lun, Aaron T. L. Pagès, Hervé Smith, Mike L. |
author_facet | Lun, Aaron T. L. Pagès, Hervé Smith, Mike L. |
author_sort | Lun, Aaron T. L. |
collection | PubMed |
description | Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set. |
format | Online Article Text |
id | pubmed-5953501 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-59535012018-05-25 beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types Lun, Aaron T. L. Pagès, Hervé Smith, Mike L. PLoS Comput Biol Research Article Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set. Public Library of Science 2018-05-03 /pmc/articles/PMC5953501/ /pubmed/29723188 http://dx.doi.org/10.1371/journal.pcbi.1006135 Text en © 2018 Lun et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Lun, Aaron T. L. Pagès, Hervé Smith, Mike L. beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types |
title | beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types |
title_full | beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types |
title_fullStr | beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types |
title_full_unstemmed | beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types |
title_short | beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types |
title_sort | beachmat: a bioconductor c++ api for accessing high-throughput biological data from a variety of r matrix types |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953501/ https://www.ncbi.nlm.nih.gov/pubmed/29723188 http://dx.doi.org/10.1371/journal.pcbi.1006135 |
work_keys_str_mv | AT lunaarontl beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes AT pagesherve beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes AT smithmikel beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes |