Cargando…

Partitioned Interleaved Bloom filters using Optane DC Persistent Memory

<!--HTML-->The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months to about 400 billion bas...

Descripción completa

Detalles Bibliográficos
Autor principal: Seiler, Enrico
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:http://cds.cern.ch/record/2691435
_version_ 1780963853954187264
author Seiler, Enrico
author_facet Seiler, Enrico
author_sort Seiler, Enrico
collection CERN
description <!--HTML-->The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000). The costs for producing one million base pairs could also be reduced from 140,000 dollars to a few cents. As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices will decrease. The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (i.e. exon/intron boundary read-through patterns) or expression profiles in large collections of sequences (i.e. a database). Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time. In this work we describe results of our new data structure, called binning directory that can distribute approximate search queries based on an extension of our recently introduced Interleaved Bloom Filters (IBF) called x-partitioned IBF (x-PIBF). The results presented here make use of Intel's Optane DC Persistent Memory architecture and achieves significant speedups compared to a disk based solution.
id cern-2691435
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2019
record_format invenio
spelling cern-26914352022-11-02T22:24:40Zhttp://cds.cern.ch/record/2691435engSeiler, EnricoPartitioned Interleaved Bloom filters using Optane DC Persistent MemoryIXPUG 2019 Annual Conference at CERNother events or meetings<!--HTML-->The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000). The costs for producing one million base pairs could also be reduced from 140,000 dollars to a few cents. As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices will decrease. The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (i.e. exon/intron boundary read-through patterns) or expression profiles in large collections of sequences (i.e. a database). Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time. In this work we describe results of our new data structure, called binning directory that can distribute approximate search queries based on an extension of our recently introduced Interleaved Bloom Filters (IBF) called x-partitioned IBF (x-PIBF). The results presented here make use of Intel's Optane DC Persistent Memory architecture and achieves significant speedups compared to a disk based solution.oai:cds.cern.ch:26914352019
spellingShingle other events or meetings
Seiler, Enrico
Partitioned Interleaved Bloom filters using Optane DC Persistent Memory
title Partitioned Interleaved Bloom filters using Optane DC Persistent Memory
title_full Partitioned Interleaved Bloom filters using Optane DC Persistent Memory
title_fullStr Partitioned Interleaved Bloom filters using Optane DC Persistent Memory
title_full_unstemmed Partitioned Interleaved Bloom filters using Optane DC Persistent Memory
title_short Partitioned Interleaved Bloom filters using Optane DC Persistent Memory
title_sort partitioned interleaved bloom filters using optane dc persistent memory
topic other events or meetings
url http://cds.cern.ch/record/2691435
work_keys_str_mv AT seilerenrico partitionedinterleavedbloomfiltersusingoptanedcpersistentmemory
AT seilerenrico ixpug2019annualconferenceatcern