Cargando…
Why was this cited? Explainable machine learning applied to COVID-19 research literature
Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and avail...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993675/ https://www.ncbi.nlm.nih.gov/pubmed/35431364 http://dx.doi.org/10.1007/s11192-022-04314-9 |
_version_ | 1784683949843283968 |
---|---|
author | Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém |
author_facet | Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém |
author_sort | Beranová, Lucie |
collection | PubMed |
description | Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles—mostly from biology and medicine—applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by “black-box” machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a “black-box” method—neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target. |
format | Online Article Text |
id | pubmed-8993675 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-89936752022-04-11 Why was this cited? Explainable machine learning applied to COVID-19 research literature Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém Scientometrics Article Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles—mostly from biology and medicine—applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by “black-box” machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a “black-box” method—neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target. Springer International Publishing 2022-04-09 2022 /pmc/articles/PMC8993675/ /pubmed/35431364 http://dx.doi.org/10.1007/s11192-022-04314-9 Text en © Akadémiai Kiadó, Budapest, Hungary 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Beranová, Lucie Joachimiak, Marcin P. Kliegr, Tomáš Rabby, Gollam Sklenák, Vilém Why was this cited? Explainable machine learning applied to COVID-19 research literature |
title | Why was this cited? Explainable machine learning applied to COVID-19 research literature |
title_full | Why was this cited? Explainable machine learning applied to COVID-19 research literature |
title_fullStr | Why was this cited? Explainable machine learning applied to COVID-19 research literature |
title_full_unstemmed | Why was this cited? Explainable machine learning applied to COVID-19 research literature |
title_short | Why was this cited? Explainable machine learning applied to COVID-19 research literature |
title_sort | why was this cited? explainable machine learning applied to covid-19 research literature |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993675/ https://www.ncbi.nlm.nih.gov/pubmed/35431364 http://dx.doi.org/10.1007/s11192-022-04314-9 |
work_keys_str_mv | AT beranovalucie whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT joachimiakmarcinp whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT kliegrtomas whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT rabbygollam whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature AT sklenakvilem whywasthiscitedexplainablemachinelearningappliedtocovid19researchliterature |