Cargando…

A negative storage model for precise but compact storage of genetic variation data

Falling sequencing costs and large initiatives are resulting in increasing amounts of data available for investigator use. However, there are informatics challenges in being able to access genomic data. Performance and storage are well-appreciated issues, but precision is critical for meaningful ana...

Descripción completa

Detalles Bibliográficos
Autores principales: Gonzalez-Calderon, Guillermo, Liu, Ruizheng, Carvajal, Rodrigo, Teer, Jamie K
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7157186/
https://www.ncbi.nlm.nih.gov/pubmed/32293013
http://dx.doi.org/10.1093/database/baz158
_version_ 1783522326781886464
author Gonzalez-Calderon, Guillermo
Liu, Ruizheng
Carvajal, Rodrigo
Teer, Jamie K
author_facet Gonzalez-Calderon, Guillermo
Liu, Ruizheng
Carvajal, Rodrigo
Teer, Jamie K
author_sort Gonzalez-Calderon, Guillermo
collection PubMed
description Falling sequencing costs and large initiatives are resulting in increasing amounts of data available for investigator use. However, there are informatics challenges in being able to access genomic data. Performance and storage are well-appreciated issues, but precision is critical for meaningful analysis and interpretation of genomic data. There is an inherent accuracy vs. performance trade-off with existing solutions. The most common approach (Variant-only Storage Model, VOSM) stores only variant data. Systems must therefore assume that everything not variant is reference, sacrificing precision and potentially accuracy. A more complete model (Full Storage Model, FSM) would store the state of every base (variant, reference and missing) in the genome thereby sacrificing performance. A compressed variation of the FSM can store the state of contiguous regions of the genome as blocks (Block Storage Model, BLSM), much like the file-based gVCF model. We propose a novel approach by which this state is encoded such that both performance and accuracy are maintained. The Negative Storage Model (NSM) can store and retrieve precise genomic state from different sequencing sources, including clinical and whole exome sequencing panels. Reduced storage requirements are achieved by storing only the variant and missing states and inferring the reference state. We evaluate the performance characteristics of FSM, BLSM and NSM and demonstrate dramatic improvements in storage and performance using the NSM approach.
format Online
Article
Text
id pubmed-7157186
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-71571862020-04-20 A negative storage model for precise but compact storage of genetic variation data Gonzalez-Calderon, Guillermo Liu, Ruizheng Carvajal, Rodrigo Teer, Jamie K Database (Oxford) Technical Report Falling sequencing costs and large initiatives are resulting in increasing amounts of data available for investigator use. However, there are informatics challenges in being able to access genomic data. Performance and storage are well-appreciated issues, but precision is critical for meaningful analysis and interpretation of genomic data. There is an inherent accuracy vs. performance trade-off with existing solutions. The most common approach (Variant-only Storage Model, VOSM) stores only variant data. Systems must therefore assume that everything not variant is reference, sacrificing precision and potentially accuracy. A more complete model (Full Storage Model, FSM) would store the state of every base (variant, reference and missing) in the genome thereby sacrificing performance. A compressed variation of the FSM can store the state of contiguous regions of the genome as blocks (Block Storage Model, BLSM), much like the file-based gVCF model. We propose a novel approach by which this state is encoded such that both performance and accuracy are maintained. The Negative Storage Model (NSM) can store and retrieve precise genomic state from different sequencing sources, including clinical and whole exome sequencing panels. Reduced storage requirements are achieved by storing only the variant and missing states and inferring the reference state. We evaluate the performance characteristics of FSM, BLSM and NSM and demonstrate dramatic improvements in storage and performance using the NSM approach. Oxford University Press 2020-04-15 /pmc/articles/PMC7157186/ /pubmed/32293013 http://dx.doi.org/10.1093/database/baz158 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Report
Gonzalez-Calderon, Guillermo
Liu, Ruizheng
Carvajal, Rodrigo
Teer, Jamie K
A negative storage model for precise but compact storage of genetic variation data
title A negative storage model for precise but compact storage of genetic variation data
title_full A negative storage model for precise but compact storage of genetic variation data
title_fullStr A negative storage model for precise but compact storage of genetic variation data
title_full_unstemmed A negative storage model for precise but compact storage of genetic variation data
title_short A negative storage model for precise but compact storage of genetic variation data
title_sort negative storage model for precise but compact storage of genetic variation data
topic Technical Report
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7157186/
https://www.ncbi.nlm.nih.gov/pubmed/32293013
http://dx.doi.org/10.1093/database/baz158
work_keys_str_mv AT gonzalezcalderonguillermo anegativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT liuruizheng anegativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT carvajalrodrigo anegativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT teerjamiek anegativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT gonzalezcalderonguillermo negativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT liuruizheng negativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT carvajalrodrigo negativestoragemodelforprecisebutcompactstorageofgeneticvariationdata
AT teerjamiek negativestoragemodelforprecisebutcompactstorageofgeneticvariationdata