Cargando…
Current practical experience with the distributed cloud data services
<!--HTML-->We are currently witnessing data explosion and exponential data growth. I will talk about real world experience with very large data sets storage and services. We are storing Peta and in the near future Exa bytes and hundreds or thousands millions of data sets. The one problem is ve...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2014
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1970466 |
_version_ | 1780944827266891776 |
---|---|
author | JAROSLAV KREMENEK, George |
author_facet | JAROSLAV KREMENEK, George |
author_sort | JAROSLAV KREMENEK, George |
collection | CERN |
description | <!--HTML-->We are currently witnessing data explosion and exponential data growth. I will talk about real world experience with very large data sets storage and services. We are storing Peta and in the near future Exa bytes and hundreds or thousands millions of data sets. The one problem is very large number of data objects. File systems were not created to effectively manage thousands of million data items. Inode space is often limited. Storing such large data sets is costly when using rotating storage. Electricity bills for cooling and spinning disks can be prohibitive. Therefore we prefer using magnetic tape technology for the lowest tier of our HSM solution. At OPESOL we employ the LTFS to overcome the old tape technology based limitations and we can provide full POSIX I/O capabilities even for data stored on magnetic tapes
The management of masses of data is a key issue: should both ensure the availability and continuity of access for ever. Scientific collaborations are usually geographically dispersed, which requires the ability to share, distribute and manage efficiently and securely. The media, hardware and software storage systems used can differ greatly from one treatment center to another. At this heterogeneity, are added the continuing evolution of storage media (which induces physical migration) and technological developments in software (which may involve changes in the naming or data access protocols the latter).
Such environments can take advantage of middleware for the management and distribution of data in a heterogeneous environment, including virtualizing storage, that is to say, hiding the complexity and diversity of systems underlying storage while federating data access.
Virtual distributed hierarchical storage system, data grids or data clouds require using and re-using existing underlying storage systems. Creating completely new vertically integrated system is out of the question. iRODS based solutions can take advantage of existing HSM like IBM’s HPSS and TSM, SGI DMF, ORACLE (SUN) SAM QFS or emerging cloud storage system like Amazon S3, Google, Microsoft Azure and other Good middleware distributed cloud service for very large data sets should work with all main existing such system and be extensible enough to support main future systems.
iRODS (integrated Rule based Data System) is being developed for over 20 years mainly by the DICE group bi-located at the University of California at San Diego and the University of North Carolina at Chapel Hill. iRODS provides a rich palette if management tools (metadata extraction, data integrity and more). iRODS can interface with virtually unlimited existing and even future storage technologies (mass storage systems, distributed file systems, relational databases, Amazon S3, Hadoop and more). iRODS is company agnostic and the users have all the source code. Migration from one storage resource to another (new) one is just one iRODS command regardless of the data size or number of data objects.
But what makes iRODS particularly attractive is its rules engine that has no equivalent among its competitors. The rules engine allows complex tasks at data management. These policies are remote management of the server side: for example, when data is stored in iRODS, background tasks can be triggered automatically on the server side such as replication across multiple sites, data integrity checks, post-treatment on them (metadata extraction ....) without specific action on the client side. So, the management policy data is virtualized. This virtualization ensures strict rules set by users, regardless of location data or application that accesses iRODS.
iRODS like systems can deliver full vertical data storage stack including complex tape system management using the existing standard LFTS technology. OPESOL (Open Solutions Inc.). delivers such a system for free. This solution is using LTFS (http://en.wikipedia.org/wiki/Linear_Tape_File_System) which exists on all modern tape drives and tape libraries
I will talk today about sever sites which have chosen a data grid system based on the iRODS (Rule-Oriented Data management) system. IRODS provides a rule-based system management approach which makes data replication much easier and provides extra data protection. Unlike the metadata provided by traditional file systems, the metadata system of iRODS is comprehensive and extensible by user and allows users to customize their own application level metadata. Users can then query the metadata to find and track data.
The iRODS is used in production at L'Institut national de physique nucléaire et de physique des particules (IN2P3).
The Computing Center of IN2P3 (CCIN2P3) offers iRODS service IN2P3 since 2008. This service is open to all who wish to use it. Currently, 34 groups in the fields of particle physics (BaBar, dChooz ...), nuclear physics (Indra, Fazia ...), astroparticle physics and astrophysics (AMS, Antares, Auger, Virgo, OHSA ...), Science Human and Social (Huma-Num) and biology, using the iRODS service CC-IN2P3 for the management and dissemination of data. The CC-IN2P3 also provides hosting the central catalog of iRODS newly created service of France Grille, as well as support administrators to France on the use of Grid technology.
The iRODS service has its own disk servers and is interfaced with our HPSS mass storage (storage on magnetic tape) currently managing over 8 petabytes of data, making it the largest volume service identified internationally.
The service is federated with other services such as iRODS SLAC example. In this perspective, it is also quite possible to federate storage servers available in laboratories with iRODS service CC-IN2P3.
The BNF (Bibliothèque nationale de France)
The Bnf is using iRODS together with open (closed) SAM QFS to store hundreds of million books in its Long-term data preservation. BnF is using iRODS to provide a distributed private data cloud where multiple replicas of data sets are kept at primary BnF site in Paris and secondary site about 40 km From Paris. BnF created a tool to implement its policies for digital preservation SPAR System (Distributed Archiving and Preservation), launched in May 2010, and is continually updated with new collections and feature. BnF employs SPAR and Gallicca for the WEB interface to the distributed private data cloud in iRODS.
The NKP and NDK site and project (Czech National Library, Czech National Digital Library).
I have helped to implement iRODS together with Fedora Commons and other tools at NKP in Prague Czech republic as a base for the EU funded digital library project. The system is now in a full production. It is using IBMS GPFS and TSM as a base layer for its HSM.
The system stores over 300 million data objects. Its data comes nonstop from scanning paper books, electronic data input from Born Digital documents, constant WEB archiving of the “.cz” domain and from all Czech TV and radio broadcasts among others. |
id | cern-1970466 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2014 |
record_format | invenio |
spelling | cern-19704662022-11-02T22:12:16Zhttp://cds.cern.ch/record/1970466engJAROSLAV KREMENEK, GeorgeCurrent practical experience with the distributed cloud data servicesWorkshop on Cloud Services for File Synchronisation and SharingHEP Computing<!--HTML-->We are currently witnessing data explosion and exponential data growth. I will talk about real world experience with very large data sets storage and services. We are storing Peta and in the near future Exa bytes and hundreds or thousands millions of data sets. The one problem is very large number of data objects. File systems were not created to effectively manage thousands of million data items. Inode space is often limited. Storing such large data sets is costly when using rotating storage. Electricity bills for cooling and spinning disks can be prohibitive. Therefore we prefer using magnetic tape technology for the lowest tier of our HSM solution. At OPESOL we employ the LTFS to overcome the old tape technology based limitations and we can provide full POSIX I/O capabilities even for data stored on magnetic tapes The management of masses of data is a key issue: should both ensure the availability and continuity of access for ever. Scientific collaborations are usually geographically dispersed, which requires the ability to share, distribute and manage efficiently and securely. The media, hardware and software storage systems used can differ greatly from one treatment center to another. At this heterogeneity, are added the continuing evolution of storage media (which induces physical migration) and technological developments in software (which may involve changes in the naming or data access protocols the latter). Such environments can take advantage of middleware for the management and distribution of data in a heterogeneous environment, including virtualizing storage, that is to say, hiding the complexity and diversity of systems underlying storage while federating data access. Virtual distributed hierarchical storage system, data grids or data clouds require using and re-using existing underlying storage systems. Creating completely new vertically integrated system is out of the question. iRODS based solutions can take advantage of existing HSM like IBM’s HPSS and TSM, SGI DMF, ORACLE (SUN) SAM QFS or emerging cloud storage system like Amazon S3, Google, Microsoft Azure and other Good middleware distributed cloud service for very large data sets should work with all main existing such system and be extensible enough to support main future systems. iRODS (integrated Rule based Data System) is being developed for over 20 years mainly by the DICE group bi-located at the University of California at San Diego and the University of North Carolina at Chapel Hill. iRODS provides a rich palette if management tools (metadata extraction, data integrity and more). iRODS can interface with virtually unlimited existing and even future storage technologies (mass storage systems, distributed file systems, relational databases, Amazon S3, Hadoop and more). iRODS is company agnostic and the users have all the source code. Migration from one storage resource to another (new) one is just one iRODS command regardless of the data size or number of data objects. But what makes iRODS particularly attractive is its rules engine that has no equivalent among its competitors. The rules engine allows complex tasks at data management. These policies are remote management of the server side: for example, when data is stored in iRODS, background tasks can be triggered automatically on the server side such as replication across multiple sites, data integrity checks, post-treatment on them (metadata extraction ....) without specific action on the client side. So, the management policy data is virtualized. This virtualization ensures strict rules set by users, regardless of location data or application that accesses iRODS. iRODS like systems can deliver full vertical data storage stack including complex tape system management using the existing standard LFTS technology. OPESOL (Open Solutions Inc.). delivers such a system for free. This solution is using LTFS (http://en.wikipedia.org/wiki/Linear_Tape_File_System) which exists on all modern tape drives and tape libraries I will talk today about sever sites which have chosen a data grid system based on the iRODS (Rule-Oriented Data management) system. IRODS provides a rule-based system management approach which makes data replication much easier and provides extra data protection. Unlike the metadata provided by traditional file systems, the metadata system of iRODS is comprehensive and extensible by user and allows users to customize their own application level metadata. Users can then query the metadata to find and track data. The iRODS is used in production at L'Institut national de physique nucléaire et de physique des particules (IN2P3). The Computing Center of IN2P3 (CCIN2P3) offers iRODS service IN2P3 since 2008. This service is open to all who wish to use it. Currently, 34 groups in the fields of particle physics (BaBar, dChooz ...), nuclear physics (Indra, Fazia ...), astroparticle physics and astrophysics (AMS, Antares, Auger, Virgo, OHSA ...), Science Human and Social (Huma-Num) and biology, using the iRODS service CC-IN2P3 for the management and dissemination of data. The CC-IN2P3 also provides hosting the central catalog of iRODS newly created service of France Grille, as well as support administrators to France on the use of Grid technology. The iRODS service has its own disk servers and is interfaced with our HPSS mass storage (storage on magnetic tape) currently managing over 8 petabytes of data, making it the largest volume service identified internationally. The service is federated with other services such as iRODS SLAC example. In this perspective, it is also quite possible to federate storage servers available in laboratories with iRODS service CC-IN2P3. The BNF (Bibliothèque nationale de France) The Bnf is using iRODS together with open (closed) SAM QFS to store hundreds of million books in its Long-term data preservation. BnF is using iRODS to provide a distributed private data cloud where multiple replicas of data sets are kept at primary BnF site in Paris and secondary site about 40 km From Paris. BnF created a tool to implement its policies for digital preservation SPAR System (Distributed Archiving and Preservation), launched in May 2010, and is continually updated with new collections and feature. BnF employs SPAR and Gallicca for the WEB interface to the distributed private data cloud in iRODS. The NKP and NDK site and project (Czech National Library, Czech National Digital Library). I have helped to implement iRODS together with Fedora Commons and other tools at NKP in Prague Czech republic as a base for the EU funded digital library project. The system is now in a full production. It is using IBMS GPFS and TSM as a base layer for its HSM. The system stores over 300 million data objects. Its data comes nonstop from scanning paper books, electronic data input from Born Digital documents, constant WEB archiving of the “.cz” domain and from all Czech TV and radio broadcasts among others.oai:cds.cern.ch:19704662014 |
spellingShingle | HEP Computing JAROSLAV KREMENEK, George Current practical experience with the distributed cloud data services |
title | Current practical experience with the distributed cloud data services |
title_full | Current practical experience with the distributed cloud data services |
title_fullStr | Current practical experience with the distributed cloud data services |
title_full_unstemmed | Current practical experience with the distributed cloud data services |
title_short | Current practical experience with the distributed cloud data services |
title_sort | current practical experience with the distributed cloud data services |
topic | HEP Computing |
url | http://cds.cern.ch/record/1970466 |
work_keys_str_mv | AT jaroslavkremenekgeorge currentpracticalexperiencewiththedistributedclouddataservices AT jaroslavkremenekgeorge workshoponcloudservicesforfilesynchronisationandsharing |