Cargando…

VariantSpark: population scale clustering of genotype information

BACKGROUND: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, prov...

Descripción completa

Detalles Bibliográficos
Autores principales:	O’Brien, Aidan R., Saunders, Neil F. W., Guo, Yi, Buske, Fabian A., Scott, Rodney J., Bauer, Denis C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4676146/ https://www.ncbi.nlm.nih.gov/pubmed/26651996 http://dx.doi.org/10.1186/s12864-015-2269-7

_version_	1782405121413480448
author	O’Brien, Aidan R. Saunders, Neil F. W. Guo, Yi Buske, Fabian A. Scott, Rodney J. Bauer, Denis C.
author_facet	O’Brien, Aidan R. Saunders, Neil F. W. Guo, Yi Buske, Fabian A. Scott, Rodney J. Bauer, Denis C.
author_sort	O’Brien, Aidan R.
collection	PubMed
description	BACKGROUND: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed Spark engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VariantSpark provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. RESULTS: To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80 % faster than the Spark-based genome clustering approach, adam, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. CONCLUSION: The benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-2269-7) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4676146
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46761462015-12-12 VariantSpark: population scale clustering of genotype information O’Brien, Aidan R. Saunders, Neil F. W. Guo, Yi Buske, Fabian A. Scott, Rodney J. Bauer, Denis C. BMC Genomics Software BACKGROUND: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed Spark engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VariantSpark provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. RESULTS: To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80 % faster than the Spark-based genome clustering approach, adam, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. CONCLUSION: The benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-2269-7) contains supplementary material, which is available to authorized users. BioMed Central 2015-12-10 /pmc/articles/PMC4676146/ /pubmed/26651996 http://dx.doi.org/10.1186/s12864-015-2269-7 Text en © O’Brien et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software O’Brien, Aidan R. Saunders, Neil F. W. Guo, Yi Buske, Fabian A. Scott, Rodney J. Bauer, Denis C. VariantSpark: population scale clustering of genotype information
title	VariantSpark: population scale clustering of genotype information
title_full	VariantSpark: population scale clustering of genotype information
title_fullStr	VariantSpark: population scale clustering of genotype information
title_full_unstemmed	VariantSpark: population scale clustering of genotype information
title_short	VariantSpark: population scale clustering of genotype information
title_sort	variantspark: population scale clustering of genotype information
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4676146/ https://www.ncbi.nlm.nih.gov/pubmed/26651996 http://dx.doi.org/10.1186/s12864-015-2269-7
work_keys_str_mv	AT obrienaidanr variantsparkpopulationscaleclusteringofgenotypeinformation AT saundersneilfw variantsparkpopulationscaleclusteringofgenotypeinformation AT guoyi variantsparkpopulationscaleclusteringofgenotypeinformation AT buskefabiana variantsparkpopulationscaleclusteringofgenotypeinformation AT scottrodneyj variantsparkpopulationscaleclusteringofgenotypeinformation AT bauerdenisc variantsparkpopulationscaleclusteringofgenotypeinformation

VariantSpark: population scale clustering of genotype information

Ejemplares similares