Cargando…

Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data

A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job...

Descripción completa

Detalles Bibliográficos
Autores principales: Dykstra, D, Bockelman, B, Blomer, J, Herner, K, Levshina, T, Slyz, M
Lenguaje:eng
Publicado: 2015
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/664/4/042012
http://cds.cern.ch/record/2134558
_version_ 1780949904701521920
author Dykstra, D
Bockelman, B
Blomer, J
Herner, K
Levshina, T
Slyz, M
author_facet Dykstra, D
Bockelman, B
Blomer, J
Herner, K
Levshina, T
Slyz, M
author_sort Dykstra, D
collection CERN
description A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliary data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called 'alien cache' to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.
id oai-inspirehep.net-1413844
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2015
record_format invenio
spelling oai-inspirehep.net-14138442022-08-10T13:00:52Zdoi:10.1088/1742-6596/664/4/042012http://cds.cern.ch/record/2134558engDykstra, DBockelman, BBlomer, JHerner, KLevshina, TSlyz, MEngineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics DataComputing and ComputersA common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliary data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called 'alien cache' to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.FERMILAB-CONF-15-211-CDoai:inspirehep.net:14138442015
spellingShingle Computing and Computers
Dykstra, D
Bockelman, B
Blomer, J
Herner, K
Levshina, T
Slyz, M
Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data
title Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data
title_full Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data
title_fullStr Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data
title_full_unstemmed Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data
title_short Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data
title_sort engineering the cernvm-filesystem as a high bandwidth distributed filesystem for auxiliary physics data
topic Computing and Computers
url https://dx.doi.org/10.1088/1742-6596/664/4/042012
http://cds.cern.ch/record/2134558
work_keys_str_mv AT dykstrad engineeringthecernvmfilesystemasahighbandwidthdistributedfilesystemforauxiliaryphysicsdata
AT bockelmanb engineeringthecernvmfilesystemasahighbandwidthdistributedfilesystemforauxiliaryphysicsdata
AT blomerj engineeringthecernvmfilesystemasahighbandwidthdistributedfilesystemforauxiliaryphysicsdata
AT hernerk engineeringthecernvmfilesystemasahighbandwidthdistributedfilesystemforauxiliaryphysicsdata
AT levshinat engineeringthecernvmfilesystemasahighbandwidthdistributedfilesystemforauxiliaryphysicsdata
AT slyzm engineeringthecernvmfilesystemasahighbandwidthdistributedfilesystemforauxiliaryphysicsdata