Cargando…

Management of Scientific Images: an approach to the extraction, annotation and retrieval of figures in the field of High Energy Physics

The information environment of the first decade of the XXIst century is unprecedented. The physical barriers limiting access to the knowledge are disappearing as traditional methods of accessing information are being replaced or enhanced by computer systems. Digital systems are able to manage much l...

Descripción completa

Detalles Bibliográficos
Autor principal: Praczyk, Piotr Adam
Lenguaje:eng
Publicado: 2013
Materias:
Acceso en línea:http://cds.cern.ch/record/1624334
Descripción
Sumario:The information environment of the first decade of the XXIst century is unprecedented. The physical barriers limiting access to the knowledge are disappearing as traditional methods of accessing information are being replaced or enhanced by computer systems. Digital systems are able to manage much larger sets of documents, confronting information users with the deluge of documents related to their topic of interest. This new situation created an incentive for the rapid development of Data Mining techniques and to the creation of more efficient search engines capable of limiting the search results to a small subset of the most relevant ones. However, most of the up to date search engines operate using the text descriptions of the documents. Those descriptions can either be extracted from the content of the document or be obtained from the external sources. The retrieval based on the non-textual content of documents is a subject of ongoing research. In particular, the retrieval of images and unlocking the information carried by them attracts a lot of attention of the scientific community. Digital libraries hold a special position amongst the systems allowing the access to knowledge. They serve the role of repositories of documents which share some common characteristics (e.g. belonging to the same area of knowledge or produced in the same institution) and as such, contain documents selected as interesting for a particular group of users. In addition, they provide retrieval facilities on top of the managed collections. Typically, scholarly publications are the smallest units of information managed in scientific digital libraries. However, there are different types of artifacts produced and used in the scientific process, among others: figures and datasets. Figures play a particularly important role in the process of scholarly publishing. Representing data in a graphical manner allows showing patterns in large datasets and to make complicated ideas easier to understand. The existing digital library systems enable the access to figures only as part of the files used for the serialisation of the entire publication. The objective of this thesis is to propose a set of methods and techniques in order to transform figures into first-class products within the scientific publication process, allowing researchers to get the maximum benefit from the search and review of bibliography. The proposed methods and techniques are oriented towards the acquisition, semantic annotation and search of figures contained in scholarly publications. Leveraging the completeness of the field and the existing community, we illustrated the described theory with examples from High-Energy Physics (HEP). At every place requiring more focused considerations, we concentrated on the type of figures that appear more frequently in the corpus of HEP publications: the plots. The described prototypes capable of processing figures have been partially integrated with the Invenio digital library software and INSPIRE - one of the largest digital libraries in the world on High-Energy Physics and created by the collaboration of the main laboratories and research centres in this domain (CERN, SLAC, DESY and Fermilab).