Sumario: | This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity.
|