Cargando…
Automatic Metadata Extraction - The High Energy Physics Use Case
Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF),...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2015
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2039361 |
Sumario: | Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications. |
---|