Cargando…

Sparse Project VCF: efficient encoding of population genotype matrices

SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduct...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, Michael F, Bai, Xiaodong, Salerno, William J, Reid, Jeffrey G
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016461/
https://www.ncbi.nlm.nih.gov/pubmed/33300997
http://dx.doi.org/10.1093/bioinformatics/btaa1004
_version_ 1783673864848408576
author Lin, Michael F
Bai, Xiaodong
Salerno, William J
Reid, Jeffrey G
author_facet Lin, Michael F
Bai, Xiaodong
Salerno, William J
Reid, Jeffrey G
author_sort Lin, Michael F
collection PubMed
description SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. AVAILABILITY AND IMPLEMENTATION: Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8016461
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-80164612021-04-07 Sparse Project VCF: efficient encoding of population genotype matrices Lin, Michael F Bai, Xiaodong Salerno, William J Reid, Jeffrey G Bioinformatics Applications Notes SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. AVAILABILITY AND IMPLEMENTATION: Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-12-10 /pmc/articles/PMC8016461/ /pubmed/33300997 http://dx.doi.org/10.1093/bioinformatics/btaa1004 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Notes
Lin, Michael F
Bai, Xiaodong
Salerno, William J
Reid, Jeffrey G
Sparse Project VCF: efficient encoding of population genotype matrices
title Sparse Project VCF: efficient encoding of population genotype matrices
title_full Sparse Project VCF: efficient encoding of population genotype matrices
title_fullStr Sparse Project VCF: efficient encoding of population genotype matrices
title_full_unstemmed Sparse Project VCF: efficient encoding of population genotype matrices
title_short Sparse Project VCF: efficient encoding of population genotype matrices
title_sort sparse project vcf: efficient encoding of population genotype matrices
topic Applications Notes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016461/
https://www.ncbi.nlm.nih.gov/pubmed/33300997
http://dx.doi.org/10.1093/bioinformatics/btaa1004
work_keys_str_mv AT linmichaelf sparseprojectvcfefficientencodingofpopulationgenotypematrices
AT baixiaodong sparseprojectvcfefficientencodingofpopulationgenotypematrices
AT salernowilliamj sparseprojectvcfefficientencodingofpopulationgenotypematrices
AT reidjeffreyg sparseprojectvcfefficientencodingofpopulationgenotypematrices