Cargando…

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

MOTIVATION: Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing...

Descripción completa

Detalles Bibliográficos
Autores principales:	Privé, Florian, Aschard, Hugues, Ziyatdinov, Andrey, Blum, Michael G B
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6084588/ https://www.ncbi.nlm.nih.gov/pubmed/29617937 http://dx.doi.org/10.1093/bioinformatics/bty185

_version_	1783346198686466048
author	Privé, Florian Aschard, Hugues Ziyatdinov, Andrey Blum, Michael G B
author_facet	Privé, Florian Aschard, Hugues Ziyatdinov, Andrey Blum, Michael G B
author_sort	Privé, Florian
collection	PubMed
description	MOTIVATION: Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools. RESULTS: Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case–control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer. AVAILABILITY AND IMPLEMENTATION: https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6084588
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-60845882018-08-14 Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr Privé, Florian Aschard, Hugues Ziyatdinov, Andrey Blum, Michael G B Bioinformatics Original Papers MOTIVATION: Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools. RESULTS: Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case–control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer. AVAILABILITY AND IMPLEMENTATION: https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-08-15 2018-03-30 /pmc/articles/PMC6084588/ /pubmed/29617937 http://dx.doi.org/10.1093/bioinformatics/bty185 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Privé, Florian Aschard, Hugues Ziyatdinov, Andrey Blum, Michael G B Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
title	Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
title_full	Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
title_fullStr	Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
title_full_unstemmed	Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
title_short	Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
title_sort	efficient analysis of large-scale genome-wide data with two r packages: bigstatsr and bigsnpr
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6084588/ https://www.ncbi.nlm.nih.gov/pubmed/29617937 http://dx.doi.org/10.1093/bioinformatics/bty185
work_keys_str_mv	AT priveflorian efficientanalysisoflargescalegenomewidedatawithtworpackagesbigstatsrandbigsnpr AT aschardhugues efficientanalysisoflargescalegenomewidedatawithtworpackagesbigstatsrandbigsnpr AT ziyatdinovandrey efficientanalysisoflargescalegenomewidedatawithtworpackagesbigstatsrandbigsnpr AT blummichaelgb efficientanalysisoflargescalegenomewidedatawithtworpackagesbigstatsrandbigsnpr

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

Ejemplares similares