Cargando…

Automatic Metadata Extraction - The High Energy Physics Use Case

Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF),...

Descripción completa

Detalles Bibliográficos
Autor principal:	Boyd, Joseph
Lenguaje:	eng
Publicado:	2015
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/2039361

_version_	1780947727430975488
author	Boyd, Joseph
author_facet	Boyd, Joseph
author_sort	Boyd, Joseph
collection	CERN
description	Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications.
id	cern-2039361
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2015
record_format	invenio
spelling	cern-20393612019-09-30T06:29:59Zhttp://cds.cern.ch/record/2039361engBoyd, JosephAutomatic Metadata Extraction - The High Energy Physics Use CaseComputing and ComputersAutomatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications.CERN-THESIS-2015-105oai:cds.cern.ch:20393612015-07-30T16:51:15Z
spellingShingle	Computing and Computers Boyd, Joseph Automatic Metadata Extraction - The High Energy Physics Use Case
title	Automatic Metadata Extraction - The High Energy Physics Use Case
title_full	Automatic Metadata Extraction - The High Energy Physics Use Case
title_fullStr	Automatic Metadata Extraction - The High Energy Physics Use Case
title_full_unstemmed	Automatic Metadata Extraction - The High Energy Physics Use Case
title_short	Automatic Metadata Extraction - The High Energy Physics Use Case
title_sort	automatic metadata extraction - the high energy physics use case
topic	Computing and Computers
url	http://cds.cern.ch/record/2039361
work_keys_str_mv	AT boydjoseph automaticmetadataextractionthehighenergyphysicsusecase

Automatic Metadata Extraction - The High Energy Physics Use Case

Ejemplares similares