Cargando…

h5vc: scalable nucleotide tallies with HDF5

Summary: As applications of genome sequencing, including exomes and whole genomes, are expanding, there is a need for analysis tools that are scalable to large sets of samples and/or ultra-deep coverage. Many current tool chains are based on the widely used file formats BAM and VCF or VCF-derivative...

Descripción completa

Detalles Bibliográficos
Autores principales: Pyl, Paul Theodor, Gehring, Julian, Fischer, Bernd, Huber, Wolfgang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016699/
https://www.ncbi.nlm.nih.gov/pubmed/24451629
http://dx.doi.org/10.1093/bioinformatics/btu026
_version_ 1782315552005423104
author Pyl, Paul Theodor
Gehring, Julian
Fischer, Bernd
Huber, Wolfgang
author_facet Pyl, Paul Theodor
Gehring, Julian
Fischer, Bernd
Huber, Wolfgang
author_sort Pyl, Paul Theodor
collection PubMed
description Summary: As applications of genome sequencing, including exomes and whole genomes, are expanding, there is a need for analysis tools that are scalable to large sets of samples and/or ultra-deep coverage. Many current tool chains are based on the widely used file formats BAM and VCF or VCF-derivatives. However, for some desirable analyses, data management with these formats creates substantial implementation overhead, and much time is spent parsing files and collating data. We observe that a tally data structure, i.e. the table of counts of nucleotides × samples × strands × genomic positions, provides a reasonable intermediate level of abstraction for many genomics analyses, including single nucleotide variant (SNV) and InDel calling, copy-number estimation and mutation spectrum analysis. Here we present h5vc, a data structure and associated software for managing tallies. The software contains functionality for creating tallies from BAM files, flexible and scalable data visualization, data quality assessment, computing statistics relevant to variant calling and other applications. Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers. Availability and implementation: The package h5vc for the statistical environment R is available through the Bioconductor project. The HDF5 system is used as the core of our implementation. Contact: pyl@embl.de or whuber@embl.de Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-4016699
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-40166992014-05-12 h5vc: scalable nucleotide tallies with HDF5 Pyl, Paul Theodor Gehring, Julian Fischer, Bernd Huber, Wolfgang Bioinformatics Applications Notes Summary: As applications of genome sequencing, including exomes and whole genomes, are expanding, there is a need for analysis tools that are scalable to large sets of samples and/or ultra-deep coverage. Many current tool chains are based on the widely used file formats BAM and VCF or VCF-derivatives. However, for some desirable analyses, data management with these formats creates substantial implementation overhead, and much time is spent parsing files and collating data. We observe that a tally data structure, i.e. the table of counts of nucleotides × samples × strands × genomic positions, provides a reasonable intermediate level of abstraction for many genomics analyses, including single nucleotide variant (SNV) and InDel calling, copy-number estimation and mutation spectrum analysis. Here we present h5vc, a data structure and associated software for managing tallies. The software contains functionality for creating tallies from BAM files, flexible and scalable data visualization, data quality assessment, computing statistics relevant to variant calling and other applications. Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers. Availability and implementation: The package h5vc for the statistical environment R is available through the Bioconductor project. The HDF5 system is used as the core of our implementation. Contact: pyl@embl.de or whuber@embl.de Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2014-05-15 2014-01-21 /pmc/articles/PMC4016699/ /pubmed/24451629 http://dx.doi.org/10.1093/bioinformatics/btu026 Text en © The Author 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Notes
Pyl, Paul Theodor
Gehring, Julian
Fischer, Bernd
Huber, Wolfgang
h5vc: scalable nucleotide tallies with HDF5
title h5vc: scalable nucleotide tallies with HDF5
title_full h5vc: scalable nucleotide tallies with HDF5
title_fullStr h5vc: scalable nucleotide tallies with HDF5
title_full_unstemmed h5vc: scalable nucleotide tallies with HDF5
title_short h5vc: scalable nucleotide tallies with HDF5
title_sort h5vc: scalable nucleotide tallies with hdf5
topic Applications Notes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016699/
https://www.ncbi.nlm.nih.gov/pubmed/24451629
http://dx.doi.org/10.1093/bioinformatics/btu026
work_keys_str_mv AT pylpaultheodor h5vcscalablenucleotidetallieswithhdf5
AT gehringjulian h5vcscalablenucleotidetallieswithhdf5
AT fischerbernd h5vcscalablenucleotidetallieswithhdf5
AT huberwolfgang h5vcscalablenucleotidetallieswithhdf5