Cargando…

Automatic Metadata Extraction - The High Energy Physics Use Case

Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF),...

Descripción completa

Detalles Bibliográficos
Autor principal: Boyd, Joseph
Lenguaje:eng
Publicado: 2015
Materias:
Acceso en línea:http://cds.cern.ch/record/2039361
_version_ 1780947727430975488
author Boyd, Joseph
author_facet Boyd, Joseph
author_sort Boyd, Joseph
collection CERN
description Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications.
id cern-2039361
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2015
record_format invenio
spelling cern-20393612019-09-30T06:29:59Zhttp://cds.cern.ch/record/2039361engBoyd, JosephAutomatic Metadata Extraction - The High Energy Physics Use CaseComputing and ComputersAutomatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications.CERN-THESIS-2015-105oai:cds.cern.ch:20393612015-07-30T16:51:15Z
spellingShingle Computing and Computers
Boyd, Joseph
Automatic Metadata Extraction - The High Energy Physics Use Case
title Automatic Metadata Extraction - The High Energy Physics Use Case
title_full Automatic Metadata Extraction - The High Energy Physics Use Case
title_fullStr Automatic Metadata Extraction - The High Energy Physics Use Case
title_full_unstemmed Automatic Metadata Extraction - The High Energy Physics Use Case
title_short Automatic Metadata Extraction - The High Energy Physics Use Case
title_sort automatic metadata extraction - the high energy physics use case
topic Computing and Computers
url http://cds.cern.ch/record/2039361
work_keys_str_mv AT boydjoseph automaticmetadataextractionthehighenergyphysicsusecase