Cargando…

Human-competitive automatic topic indexing

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document’s topics helps people judge its relevance quickly. However, assigning...

Descripción completa

Detalles Bibliográficos
Autor principal: Medelyan, Olena
Lenguaje:eng
Publicado: U. 2009
Materias:
Acceso en línea:http://cds.cern.ch/record/1198029
_version_ 1780917285888720896
author Medelyan, Olena
author_facet Medelyan, Olena
author_sort Medelyan, Olena
collection CERN
description Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document’s topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is“human-competitive” because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.
id cern-1198029
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2009
publisher U.
record_format invenio
spelling cern-11980292019-09-30T06:29:59Zhttp://cds.cern.ch/record/1198029engMedelyan, OlenaHuman-competitive automatic topic indexingComputing and ComputersTopic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document’s topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is“human-competitive” because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.U.CERN-THESIS-2009-271oai:cds.cern.ch:11980292009
spellingShingle Computing and Computers
Medelyan, Olena
Human-competitive automatic topic indexing
title Human-competitive automatic topic indexing
title_full Human-competitive automatic topic indexing
title_fullStr Human-competitive automatic topic indexing
title_full_unstemmed Human-competitive automatic topic indexing
title_short Human-competitive automatic topic indexing
title_sort human-competitive automatic topic indexing
topic Computing and Computers
url http://cds.cern.ch/record/1198029
work_keys_str_mv AT medelyanolena humancompetitiveautomatictopicindexing