Cargando…

Why was this cited? Explainable machine learning applied to COVID-19 research literature

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and avail...

Descripción completa

Detalles Bibliográficos
Autores principales:	Beranová, Lucie, Joachimiak, Marcin P., Kliegr, Tomáš, Rabby, Gollam, Sklenák, Vilém
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993675/ https://www.ncbi.nlm.nih.gov/pubmed/35431364 http://dx.doi.org/10.1007/s11192-022-04314-9

_version_	1784683949843283968
author	Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém
author_facet	Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém
author_sort	Beranová, Lucie
collection	PubMed
description	Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles—mostly from biology and medicine—applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by “black-box” machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a “black-box” method—neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.
format	Online Article Text
id	pubmed-8993675
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-89936752022-04-11 Why was this cited? Explainable machine learning applied to COVID-19 research literature Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém Scientometrics Article Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles—mostly from biology and medicine—applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by “black-box” machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a “black-box” method—neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target. Springer International Publishing 2022-04-09 2022 /pmc/articles/PMC8993675/ /pubmed/35431364 http://dx.doi.org/10.1007/s11192-022-04314-9 Text en © Akadémiai Kiadó, Budapest, Hungary 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém Why was this cited? Explainable machine learning applied to COVID-19 research literature
title	Why was this cited? Explainable machine learning applied to COVID-19 research literature
title_full	Why was this cited? Explainable machine learning applied to COVID-19 research literature
title_fullStr	Why was this cited? Explainable machine learning applied to COVID-19 research literature
title_full_unstemmed	Why was this cited? Explainable machine learning applied to COVID-19 research literature
title_short	Why was this cited? Explainable machine learning applied to COVID-19 research literature
title_sort	why was this cited? explainable machine learning applied to covid-19 research literature
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993675/ https://www.ncbi.nlm.nih.gov/pubmed/35431364 http://dx.doi.org/10.1007/s11192-022-04314-9
work_keys_str_mv	AT beranovalucie whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT joachimiakmarcinp whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT kliegrtomas whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT rabbygollam whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT sklenakvilem whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature

Why was this cited? Explainable machine learning applied to COVID-19 research literature

Ejemplares similares