Cargando…

ScienceQA: a novel resource for question answering on scholarly articles

Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this ag...

Descripción completa

Detalles Bibliográficos
Autores principales: Saikh, Tanik, Ghosal, Tirthankar, Mittal, Amish, Ekbal, Asif, Bhattacharyya, Pushpak
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Berlin Heidelberg 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9297303/
https://www.ncbi.nlm.nih.gov/pubmed/35873651
http://dx.doi.org/10.1007/s00799-022-00329-y
_version_ 1784750448025010176
author Saikh, Tanik
Ghosal, Tirthankar
Mittal, Amish
Ekbal, Asif
Bhattacharyya, Pushpak
author_facet Saikh, Tanik
Ghosal, Tirthankar
Mittal, Amish
Ekbal, Asif
Bhattacharyya, Pushpak
author_sort Saikh, Tanik
collection PubMed
description Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context–question–answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering.
format Online
Article
Text
id pubmed-9297303
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer Berlin Heidelberg
record_format MEDLINE/PubMed
spelling pubmed-92973032022-07-20 ScienceQA: a novel resource for question answering on scholarly articles Saikh, Tanik Ghosal, Tirthankar Mittal, Amish Ekbal, Asif Bhattacharyya, Pushpak Int J Digit Libr Article Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context–question–answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering. Springer Berlin Heidelberg 2022-07-20 2022 /pmc/articles/PMC9297303/ /pubmed/35873651 http://dx.doi.org/10.1007/s00799-022-00329-y Text en © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Saikh, Tanik
Ghosal, Tirthankar
Mittal, Amish
Ekbal, Asif
Bhattacharyya, Pushpak
ScienceQA: a novel resource for question answering on scholarly articles
title ScienceQA: a novel resource for question answering on scholarly articles
title_full ScienceQA: a novel resource for question answering on scholarly articles
title_fullStr ScienceQA: a novel resource for question answering on scholarly articles
title_full_unstemmed ScienceQA: a novel resource for question answering on scholarly articles
title_short ScienceQA: a novel resource for question answering on scholarly articles
title_sort scienceqa: a novel resource for question answering on scholarly articles
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9297303/
https://www.ncbi.nlm.nih.gov/pubmed/35873651
http://dx.doi.org/10.1007/s00799-022-00329-y
work_keys_str_mv AT saikhtanik scienceqaanovelresourceforquestionansweringonscholarlyarticles
AT ghosaltirthankar scienceqaanovelresourceforquestionansweringonscholarlyarticles
AT mittalamish scienceqaanovelresourceforquestionansweringonscholarlyarticles
AT ekbalasif scienceqaanovelresourceforquestionansweringonscholarlyarticles
AT bhattacharyyapushpak scienceqaanovelresourceforquestionansweringonscholarlyarticles