Cargando…

Bibliographic Entity Automatic Recognition and Disambiguation

This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying...

Descripción completa

Detalles Bibliográficos
Autor principal:	Al-Natsheh, Hussein
Lenguaje:	eng
Publicado:	2015
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/2036112

_version_	1780947637158019072
author	Al-Natsheh, Hussein
author_facet	Al-Natsheh, Hussein
author_sort	Al-Natsheh, Hussein
collection	CERN
description	This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity.
id	cern-2036112
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2015
record_format	invenio
spelling	cern-20361122019-09-30T06:29:59Zhttp://cds.cern.ch/record/2036112engAl-Natsheh, HusseinBibliographic Entity Automatic Recognition and DisambiguationComputing and ComputersThis master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity.CERN-THESIS-2015-098oai:cds.cern.ch:20361122015-07-21T07:30:23Z
spellingShingle	Computing and Computers Al-Natsheh, Hussein Bibliographic Entity Automatic Recognition and Disambiguation
title	Bibliographic Entity Automatic Recognition and Disambiguation
title_full	Bibliographic Entity Automatic Recognition and Disambiguation
title_fullStr	Bibliographic Entity Automatic Recognition and Disambiguation
title_full_unstemmed	Bibliographic Entity Automatic Recognition and Disambiguation
title_short	Bibliographic Entity Automatic Recognition and Disambiguation
title_sort	bibliographic entity automatic recognition and disambiguation
topic	Computing and Computers
url	http://cds.cern.ch/record/2036112
work_keys_str_mv	AT alnatshehhussein bibliographicentityautomaticrecognitionanddisambiguation

Bibliographic Entity Automatic Recognition and Disambiguation

Ejemplares similares