Cargando…

Controlled overflowing of data-intensive jobs from oversubscribed sites

The CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when...

Descripción completa

Detalles Bibliográficos
Autores principales: Sfiligoi, Igor, Wuerthwein, Frank Karl, Bockelman, Brian Paul, Bradley, Daniel Charles, Tadel, Matevz, Bloom, Kenneth Arthur, Letts, James, Mrak Tadel, Alja
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/396/3/032102
http://cds.cern.ch/record/1458476
_version_ 1780925160457502720
author Sfiligoi, Igor
Wuerthwein, Frank Karl
Bockelman, Brian Paul
Bradley, Daniel Charles
Tadel, Matevz
Bloom, Kenneth Arthur
Letts, James
Mrak Tadel, Alja
author_facet Sfiligoi, Igor
Wuerthwein, Frank Karl
Bockelman, Brian Paul
Bradley, Daniel Charles
Tadel, Matevz
Bloom, Kenneth Arthur
Letts, James
Mrak Tadel, Alja
author_sort Sfiligoi, Igor
collection CERN
description The CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when a large fraction of CPUs at certain sites were sitting idle due to lack of demand, all while Terabytes of data were never accessed. To improve the utilization of both CPU and disks, CMS is moving toward controlled overflowing of jobs from sites that have data but are oversubscribed to others with spare CPU and network capacity, with those jobs accessing the data through real time Xrootd streaming over WAN. The major limiting factor for remote data access is the ability of the source storage system to serve such data, so the number of jobs accessing it must be carefully controlled. The CMS approach to this is to implement the overflowing by means of glideinWMS, a Condor based pilot system, and by providing the WMS with the known storage limits and let it schedule jobs within those limits. This paper presents the detailed architecture of the overflow-enabled glideinWMS system, together with operational experience of the past 6 months.
id cern-1458476
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2012
record_format invenio
spelling cern-14584762019-09-30T06:29:59Zdoi:10.1088/1742-6596/396/3/032102http://cds.cern.ch/record/1458476engSfiligoi, IgorWuerthwein, Frank KarlBockelman, Brian PaulBradley, Daniel CharlesTadel, MatevzBloom, Kenneth ArthurLetts, JamesMrak Tadel, AljaControlled overflowing of data-intensive jobs from oversubscribed sitesDetectors and Experimental TechniquesThe CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when a large fraction of CPUs at certain sites were sitting idle due to lack of demand, all while Terabytes of data were never accessed. To improve the utilization of both CPU and disks, CMS is moving toward controlled overflowing of jobs from sites that have data but are oversubscribed to others with spare CPU and network capacity, with those jobs accessing the data through real time Xrootd streaming over WAN. The major limiting factor for remote data access is the ability of the source storage system to serve such data, so the number of jobs accessing it must be carefully controlled. The CMS approach to this is to implement the overflowing by means of glideinWMS, a Condor based pilot system, and by providing the WMS with the known storage limits and let it schedule jobs within those limits. This paper presents the detailed architecture of the overflow-enabled glideinWMS system, together with operational experience of the past 6 months.CMS-CR-2012-069oai:cds.cern.ch:14584762012-05-10
spellingShingle Detectors and Experimental Techniques
Sfiligoi, Igor
Wuerthwein, Frank Karl
Bockelman, Brian Paul
Bradley, Daniel Charles
Tadel, Matevz
Bloom, Kenneth Arthur
Letts, James
Mrak Tadel, Alja
Controlled overflowing of data-intensive jobs from oversubscribed sites
title Controlled overflowing of data-intensive jobs from oversubscribed sites
title_full Controlled overflowing of data-intensive jobs from oversubscribed sites
title_fullStr Controlled overflowing of data-intensive jobs from oversubscribed sites
title_full_unstemmed Controlled overflowing of data-intensive jobs from oversubscribed sites
title_short Controlled overflowing of data-intensive jobs from oversubscribed sites
title_sort controlled overflowing of data-intensive jobs from oversubscribed sites
topic Detectors and Experimental Techniques
url https://dx.doi.org/10.1088/1742-6596/396/3/032102
http://cds.cern.ch/record/1458476
work_keys_str_mv AT sfiligoiigor controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT wuerthweinfrankkarl controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT bockelmanbrianpaul controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT bradleydanielcharles controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT tadelmatevz controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT bloomkennetharthur controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT lettsjames controlledoverflowingofdataintensivejobsfromoversubscribedsites
AT mraktadelalja controlledoverflowingofdataintensivejobsfromoversubscribedsites