Cargando…
Sparse Project VCF: efficient encoding of population genotype matrices
SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduct...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016461/ https://www.ncbi.nlm.nih.gov/pubmed/33300997 http://dx.doi.org/10.1093/bioinformatics/btaa1004 |
_version_ | 1783673864848408576 |
---|---|
author | Lin, Michael F Bai, Xiaodong Salerno, William J Reid, Jeffrey G |
author_facet | Lin, Michael F Bai, Xiaodong Salerno, William J Reid, Jeffrey G |
author_sort | Lin, Michael F |
collection | PubMed |
description | SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. AVAILABILITY AND IMPLEMENTATION: Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8016461 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-80164612021-04-07 Sparse Project VCF: efficient encoding of population genotype matrices Lin, Michael F Bai, Xiaodong Salerno, William J Reid, Jeffrey G Bioinformatics Applications Notes SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. AVAILABILITY AND IMPLEMENTATION: Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-12-10 /pmc/articles/PMC8016461/ /pubmed/33300997 http://dx.doi.org/10.1093/bioinformatics/btaa1004 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Applications Notes Lin, Michael F Bai, Xiaodong Salerno, William J Reid, Jeffrey G Sparse Project VCF: efficient encoding of population genotype matrices |
title | Sparse Project VCF: efficient encoding of population genotype matrices |
title_full | Sparse Project VCF: efficient encoding of population genotype matrices |
title_fullStr | Sparse Project VCF: efficient encoding of population genotype matrices |
title_full_unstemmed | Sparse Project VCF: efficient encoding of population genotype matrices |
title_short | Sparse Project VCF: efficient encoding of population genotype matrices |
title_sort | sparse project vcf: efficient encoding of population genotype matrices |
topic | Applications Notes |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016461/ https://www.ncbi.nlm.nih.gov/pubmed/33300997 http://dx.doi.org/10.1093/bioinformatics/btaa1004 |
work_keys_str_mv | AT linmichaelf sparseprojectvcfefficientencodingofpopulationgenotypematrices AT baixiaodong sparseprojectvcfefficientencodingofpopulationgenotypematrices AT salernowilliamj sparseprojectvcfefficientencodingofpopulationgenotypematrices AT reidjeffreyg sparseprojectvcfefficientencodingofpopulationgenotypematrices |