Cargando…

Automated methods of textual content analysis and description of text structures

Universal Semantic Language (USL) is a semi-formalized approach for the description of knowledge (a knowledge representation tool). The idea of USL was introduced by Vladimir Smetacek in the system called SEMAN which was used for keyword extraction tasks in the former Information centre of the Czech...

Descripción completa

Detalles Bibliográficos
Autor principal: Chýla, Roman
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:http://cds.cern.ch/record/1450189
_version_ 1780924911005466624
author Chýla, Roman
author_facet Chýla, Roman
author_sort Chýla, Roman
collection CERN
description Universal Semantic Language (USL) is a semi-formalized approach for the description of knowledge (a knowledge representation tool). The idea of USL was introduced by Vladimir Smetacek in the system called SEMAN which was used for keyword extraction tasks in the former Information centre of the Czechoslovak Republic. However due to the dissolution of the centre in early 90's, the system has been lost. This thesis reintroduces the idea of USL in a new context of quantitative content analysis. First we introduce the historical background and the problems of semantics and knowledge representation, semes, semantic fields, semantic primes and universals. The basic methodology of content analysis studies is illustrated on the example of three content analysis tools and we describe the architecture of a new system. The application was built specifically for USL discovery but it can work also in the context of classical content analysis. It contains Natural Language Processing (NLP) components and employs the algorithm for collocation discovery adapted for the case of cooccurences search between semantic annotations. The software is evaluated by comparing its pattern matching mechanism against another existing and established extractor. The semantic translation mechanism is evaluated in the task of automated document classification with special attention to the problem of semantic ambiguity and correct translation. Finally we evaluate the ability of the system to discover statistically significant semantic relationships from textual corpora.
id cern-1450189
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2012
record_format invenio
spelling cern-14501892019-09-30T06:29:59Zhttp://cds.cern.ch/record/1450189engChýla, RomanAutomated methods of textual content analysis and description of text structuresComputing and ComputersUniversal Semantic Language (USL) is a semi-formalized approach for the description of knowledge (a knowledge representation tool). The idea of USL was introduced by Vladimir Smetacek in the system called SEMAN which was used for keyword extraction tasks in the former Information centre of the Czechoslovak Republic. However due to the dissolution of the centre in early 90's, the system has been lost. This thesis reintroduces the idea of USL in a new context of quantitative content analysis. First we introduce the historical background and the problems of semantics and knowledge representation, semes, semantic fields, semantic primes and universals. The basic methodology of content analysis studies is illustrated on the example of three content analysis tools and we describe the architecture of a new system. The application was built specifically for USL discovery but it can work also in the context of classical content analysis. It contains Natural Language Processing (NLP) components and employs the algorithm for collocation discovery adapted for the case of cooccurences search between semantic annotations. The software is evaluated by comparing its pattern matching mechanism against another existing and established extractor. The semantic translation mechanism is evaluated in the task of automated document classification with special attention to the problem of semantic ambiguity and correct translation. Finally we evaluate the ability of the system to discover statistically significant semantic relationships from textual corpora.CERN-THESIS-2011-239oai:cds.cern.ch:14501892012-05-22T08:31:21Z
spellingShingle Computing and Computers
Chýla, Roman
Automated methods of textual content analysis and description of text structures
title Automated methods of textual content analysis and description of text structures
title_full Automated methods of textual content analysis and description of text structures
title_fullStr Automated methods of textual content analysis and description of text structures
title_full_unstemmed Automated methods of textual content analysis and description of text structures
title_short Automated methods of textual content analysis and description of text structures
title_sort automated methods of textual content analysis and description of text structures
topic Computing and Computers
url http://cds.cern.ch/record/1450189
work_keys_str_mv AT chylaroman automatedmethodsoftextualcontentanalysisanddescriptionoftextstructures