Cargando…
Controlled overflowing of data-intensive jobs from oversubscribed sites
The CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when...
Autores principales: | , , , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2012
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1088/1742-6596/396/3/032102 http://cds.cern.ch/record/1458476 |
_version_ | 1780925160457502720 |
---|---|
author | Sfiligoi, Igor Wuerthwein, Frank Karl Bockelman, Brian Paul Bradley, Daniel Charles Tadel, Matevz Bloom, Kenneth Arthur Letts, James Mrak Tadel, Alja |
author_facet | Sfiligoi, Igor Wuerthwein, Frank Karl Bockelman, Brian Paul Bradley, Daniel Charles Tadel, Matevz Bloom, Kenneth Arthur Letts, James Mrak Tadel, Alja |
author_sort | Sfiligoi, Igor |
collection | CERN |
description | The CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when a large fraction of CPUs at certain sites were sitting idle due to lack of demand, all while Terabytes of data were never accessed. To improve the utilization of both CPU and disks, CMS is moving toward controlled overflowing of jobs from sites that have data but are oversubscribed to others with spare CPU and network capacity, with those jobs accessing the data through real time Xrootd streaming over WAN. The major limiting factor for remote data access is the ability of the source storage system to serve such data, so the number of jobs accessing it must be carefully controlled. The CMS approach to this is to implement the overflowing by means of glideinWMS, a Condor based pilot system, and by providing the WMS with the known storage limits and let it schedule jobs within those limits. This paper presents the detailed architecture of the overflow-enabled glideinWMS system, together with operational experience of the past 6 months. |
id | cern-1458476 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2012 |
record_format | invenio |
spelling | cern-14584762019-09-30T06:29:59Zdoi:10.1088/1742-6596/396/3/032102http://cds.cern.ch/record/1458476engSfiligoi, IgorWuerthwein, Frank KarlBockelman, Brian PaulBradley, Daniel CharlesTadel, MatevzBloom, Kenneth ArthurLetts, JamesMrak Tadel, AljaControlled overflowing of data-intensive jobs from oversubscribed sitesDetectors and Experimental TechniquesThe CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when a large fraction of CPUs at certain sites were sitting idle due to lack of demand, all while Terabytes of data were never accessed. To improve the utilization of both CPU and disks, CMS is moving toward controlled overflowing of jobs from sites that have data but are oversubscribed to others with spare CPU and network capacity, with those jobs accessing the data through real time Xrootd streaming over WAN. The major limiting factor for remote data access is the ability of the source storage system to serve such data, so the number of jobs accessing it must be carefully controlled. The CMS approach to this is to implement the overflowing by means of glideinWMS, a Condor based pilot system, and by providing the WMS with the known storage limits and let it schedule jobs within those limits. This paper presents the detailed architecture of the overflow-enabled glideinWMS system, together with operational experience of the past 6 months.CMS-CR-2012-069oai:cds.cern.ch:14584762012-05-10 |
spellingShingle | Detectors and Experimental Techniques Sfiligoi, Igor Wuerthwein, Frank Karl Bockelman, Brian Paul Bradley, Daniel Charles Tadel, Matevz Bloom, Kenneth Arthur Letts, James Mrak Tadel, Alja Controlled overflowing of data-intensive jobs from oversubscribed sites |
title | Controlled overflowing of data-intensive jobs from oversubscribed sites |
title_full | Controlled overflowing of data-intensive jobs from oversubscribed sites |
title_fullStr | Controlled overflowing of data-intensive jobs from oversubscribed sites |
title_full_unstemmed | Controlled overflowing of data-intensive jobs from oversubscribed sites |
title_short | Controlled overflowing of data-intensive jobs from oversubscribed sites |
title_sort | controlled overflowing of data-intensive jobs from oversubscribed sites |
topic | Detectors and Experimental Techniques |
url | https://dx.doi.org/10.1088/1742-6596/396/3/032102 http://cds.cern.ch/record/1458476 |
work_keys_str_mv | AT sfiligoiigor controlledoverflowingofdataintensivejobsfromoversubscribedsites AT wuerthweinfrankkarl controlledoverflowingofdataintensivejobsfromoversubscribedsites AT bockelmanbrianpaul controlledoverflowingofdataintensivejobsfromoversubscribedsites AT bradleydanielcharles controlledoverflowingofdataintensivejobsfromoversubscribedsites AT tadelmatevz controlledoverflowingofdataintensivejobsfromoversubscribedsites AT bloomkennetharthur controlledoverflowingofdataintensivejobsfromoversubscribedsites AT lettsjames controlledoverflowingofdataintensivejobsfromoversubscribedsites AT mraktadelalja controlledoverflowingofdataintensivejobsfromoversubscribedsites |