Cargando…

Data Integration against Multiple Evolving Autonomous Schemata

Research in the area of data integration has resulted in approaches such as federated and multidatabases, mediation, data warehousing, global information systems, and the model management/schema matching approach. Architecturally, approaches can be categorized into those that integrate against a sin...

Descripción completa

Detalles Bibliográficos
Autor principal: Koch, Christoph
Lenguaje:eng
Publicado: 2011
Materias:
Acceso en línea:http://cds.cern.ch/record/1387966
Descripción
Sumario:Research in the area of data integration has resulted in approaches such as federated and multidatabases, mediation, data warehousing, global information systems, and the model management/schema matching approach. Architecturally, approaches can be categorized into those that integrate against a single global schema and those that do not, while on the level of inter-schema constraints, most work can be classied either as so-called global-as-view or as local-as-view integration. These approaches dier widely in their strengths and weaknesses. Federated databases have been found applicable in environments in which several autonomous information systems coexist { each with their individual schemata { and need to share data. However, this approach does not provide sucient support for dealing with change of schemata and requirements. Other approaches to data integration which are centered around a single \global" integration schema, on the other hand, cannot handle design autonomy of information systems. Under evolution, this type of autonomy eventually leads to schemata between which neither the global-as-view nor the local-as-view approaches to source integration can be used to express the inter-schema semantics. In this thesis, this issue is addressed with a novel approach to data integration which combines techniques from model management, mediation, and local-asview integration. It allows for the design of inter-schema mappings that are more robust when change occurs. The work has been motivated by the requirements of large scientic collaborations in high-energy physics, as encountered by the author during his stay at CERN. The approach presented here is based on two foundations. The rst is query rewriting with very expressive symmetric inter-schema constraints, called con- junctive inclusion dependencies (cind's). These are containment relationships between conjunctive queries. We address a very general form of the source integration problem, in which several schemata may coexist, each of them containing a number of purely logical as well as a number of source entities. For the source entities, the information system that belongs to the schema holds data, while the logical entities are meant to allow schema entities from other information systems to be integrated against. The query rewriting problem now aims at rewriting a query over (possibly) both source and logical schema entities of one schema into source entities only, which may be part of any of the schemata known. Under the classical logical semantics, and given a conjunctive input query, we address the problem of nding maximally contained positive rewritings under a set of cind's. Such rewritten queries can then be optimized and eciently answered using classical distributed database techniques. For the purpose of data integration and the sake of computability, we require the dependency graph of a set of cind's to be acyclic with respect to inclusion direction. Regarding the query rewriting problem, we rst present semantics and main theoretical properties. Subsequently, algorithms and optimizations based on techniques from database theory are presented, which have been implemented in a research prototype. Finally, experimental results based on this prototype are presented, which demonstrate the practical feasibility of our approach. Reasoning is done exclusively over schemata and queries, and is independent from data volumes, which renders it highly scalable. Apart from that, this avor of query rewriting has another important strength. The expressiveness of the constraints allows for much freedom and exibility for modeling the peculiarities of a mapping problem. For instance, both global-as-view and local-as-view integration are special cases of the query rewriting problem addressed in this thesis. As will be shown, this exibility allows to design mappings that are robust with respect to change, as principles such as the decoupling of inter-schema dependencies can be implemented. It is furthermore clear that query rewriting with cind's also permits to deal with concept mismatch in a very wide sense, as each pair of corresponding concepts in two schemata can be modeled as conjunctive queries. The second foundation is model management based on cind's as inter-schema constraints. Under the model management approach to data integration, schemata and mappings are treated as rst-class citizens in a repository, on which model management operations can be applied. This thesis proposes denitions of schemata and mappings, as well as an array of powerful operations, which are well suited for designing and maintaining mappings between information systems when change is an issue. To complete this work, we propose a methodology for dealing with evolving schemata as well as changing integration requirements. The combination of the contributions of this thesis brings a practical improvement of openness and exibility to the federated database and model management approaches to data integration, and a rst practical integration architecture to large, complex, and evolving computing environments such as those encountered in large scientic collaborations.