Cargando…

Nephele: genotyping via complete composition vectors and MapReduce

BACKGROUND: Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evol...

Descripción completa

Detalles Bibliográficos
Autores principales:	Colosimo, Marc E, Peterson, Matthew W, Mardis, Scott, Hirschman, Lynette
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Software Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3182884/ https://www.ncbi.nlm.nih.gov/pubmed/21851626 http://dx.doi.org/10.1186/1751-0473-6-13

_version_	1782212938179805184
author	Colosimo, Marc E Peterson, Matthew W Mardis, Scott Hirschman, Lynette
author_facet	Colosimo, Marc E Peterson, Matthew W Mardis, Scott Hirschman, Lynette
author_sort	Colosimo, Marc E
collection	PubMed
description	BACKGROUND: Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences. RESULTS: Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours. CONCLUSIONS: We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.
format	Online Article Text
id	pubmed-3182884
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31828842011-09-30 Nephele: genotyping via complete composition vectors and MapReduce Colosimo, Marc E Peterson, Matthew W Mardis, Scott Hirschman, Lynette Source Code Biol Med Software Review BACKGROUND: Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences. RESULTS: Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours. CONCLUSIONS: We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage. BioMed Central 2011-08-18 /pmc/articles/PMC3182884/ /pubmed/21851626 http://dx.doi.org/10.1186/1751-0473-6-13 Text en Copyright ©2011 Colosimo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Review Colosimo, Marc E Peterson, Matthew W Mardis, Scott Hirschman, Lynette Nephele: genotyping via complete composition vectors and MapReduce
title	Nephele: genotyping via complete composition vectors and MapReduce
title_full	Nephele: genotyping via complete composition vectors and MapReduce
title_fullStr	Nephele: genotyping via complete composition vectors and MapReduce
title_full_unstemmed	Nephele: genotyping via complete composition vectors and MapReduce
title_short	Nephele: genotyping via complete composition vectors and MapReduce
title_sort	nephele: genotyping via complete composition vectors and mapreduce
topic	Software Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3182884/ https://www.ncbi.nlm.nih.gov/pubmed/21851626 http://dx.doi.org/10.1186/1751-0473-6-13
work_keys_str_mv	AT colosimomarce nephelegenotypingviacompletecompositionvectorsandmapreduce AT petersonmattheww nephelegenotypingviacompletecompositionvectorsandmapreduce AT mardisscott nephelegenotypingviacompletecompositionvectorsandmapreduce AT hirschmanlynette nephelegenotypingviacompletecompositionvectorsandmapreduce

Nephele: genotyping via complete composition vectors and MapReduce

Ejemplares similares