Cargando…

Fast probabilistic file fingerprinting for big data

BACKGROUND: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concern...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tretyakov, Konstantin, Laur, Sven, Smant, Geert, Vilo, Jaak, Prins, Pjotr
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3582436/ https://www.ncbi.nlm.nih.gov/pubmed/23445565 http://dx.doi.org/10.1186/1471-2164-14-S2-S8

_version_	1782260561881333760
author	Tretyakov, Konstantin Laur, Sven Smant, Geert Vilo, Jaak Prins, Pjotr
author_facet	Tretyakov, Konstantin Laur, Sven Smant, Geert Vilo, Jaak Prins, Pjotr
author_sort	Tretyakov, Konstantin
collection	PubMed
description	BACKGROUND: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. RESULTS: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. CONCLUSIONS: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.
format	Online Article Text
id	pubmed-3582436
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35824362013-03-05 Fast probabilistic file fingerprinting for big data Tretyakov, Konstantin Laur, Sven Smant, Geert Vilo, Jaak Prins, Pjotr BMC Genomics Research BACKGROUND: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. RESULTS: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. CONCLUSIONS: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff. BioMed Central 2013-02-15 /pmc/articles/PMC3582436/ /pubmed/23445565 http://dx.doi.org/10.1186/1471-2164-14-S2-S8 Text en Copyright ©2013 Tretyakov et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Tretyakov, Konstantin Laur, Sven Smant, Geert Vilo, Jaak Prins, Pjotr Fast probabilistic file fingerprinting for big data
title	Fast probabilistic file fingerprinting for big data
title_full	Fast probabilistic file fingerprinting for big data
title_fullStr	Fast probabilistic file fingerprinting for big data
title_full_unstemmed	Fast probabilistic file fingerprinting for big data
title_short	Fast probabilistic file fingerprinting for big data
title_sort	fast probabilistic file fingerprinting for big data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3582436/ https://www.ncbi.nlm.nih.gov/pubmed/23445565 http://dx.doi.org/10.1186/1471-2164-14-S2-S8
work_keys_str_mv	AT tretyakovkonstantin fastprobabilisticfilefingerprintingforbigdata AT laursven fastprobabilisticfilefingerprintingforbigdata AT smantgeert fastprobabilisticfilefingerprintingforbigdata AT vilojaak fastprobabilisticfilefingerprintingforbigdata AT prinspjotr fastprobabilisticfilefingerprintingforbigdata

Fast probabilistic file fingerprinting for big data

Ejemplares similares