Cargando…

Bibliographic Entity Automatic Recognition and Disambiguation

This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying...

Descripción completa

Detalles Bibliográficos
Autor principal: Al-Natsheh, Hussein
Lenguaje:eng
Publicado: 2015
Materias:
Acceso en línea:http://cds.cern.ch/record/2036112
_version_ 1780947637158019072
author Al-Natsheh, Hussein
author_facet Al-Natsheh, Hussein
author_sort Al-Natsheh, Hussein
collection CERN
description This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity.
id cern-2036112
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2015
record_format invenio
spelling cern-20361122019-09-30T06:29:59Zhttp://cds.cern.ch/record/2036112engAl-Natsheh, HusseinBibliographic Entity Automatic Recognition and DisambiguationComputing and ComputersThis master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity.CERN-THESIS-2015-098oai:cds.cern.ch:20361122015-07-21T07:30:23Z
spellingShingle Computing and Computers
Al-Natsheh, Hussein
Bibliographic Entity Automatic Recognition and Disambiguation
title Bibliographic Entity Automatic Recognition and Disambiguation
title_full Bibliographic Entity Automatic Recognition and Disambiguation
title_fullStr Bibliographic Entity Automatic Recognition and Disambiguation
title_full_unstemmed Bibliographic Entity Automatic Recognition and Disambiguation
title_short Bibliographic Entity Automatic Recognition and Disambiguation
title_sort bibliographic entity automatic recognition and disambiguation
topic Computing and Computers
url http://cds.cern.ch/record/2036112
work_keys_str_mv AT alnatshehhussein bibliographicentityautomaticrecognitionanddisambiguation