Cargando…
ScienceQA: a novel resource for question answering on scholarly articles
Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this ag...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Berlin Heidelberg
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9297303/ https://www.ncbi.nlm.nih.gov/pubmed/35873651 http://dx.doi.org/10.1007/s00799-022-00329-y |
_version_ | 1784750448025010176 |
---|---|
author | Saikh, Tanik Ghosal, Tirthankar Mittal, Amish Ekbal, Asif Bhattacharyya, Pushpak |
author_facet | Saikh, Tanik Ghosal, Tirthankar Mittal, Amish Ekbal, Asif Bhattacharyya, Pushpak |
author_sort | Saikh, Tanik |
collection | PubMed |
description | Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context–question–answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering. |
format | Online Article Text |
id | pubmed-9297303 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer Berlin Heidelberg |
record_format | MEDLINE/PubMed |
spelling | pubmed-92973032022-07-20 ScienceQA: a novel resource for question answering on scholarly articles Saikh, Tanik Ghosal, Tirthankar Mittal, Amish Ekbal, Asif Bhattacharyya, Pushpak Int J Digit Libr Article Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context–question–answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering. Springer Berlin Heidelberg 2022-07-20 2022 /pmc/articles/PMC9297303/ /pubmed/35873651 http://dx.doi.org/10.1007/s00799-022-00329-y Text en © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Saikh, Tanik Ghosal, Tirthankar Mittal, Amish Ekbal, Asif Bhattacharyya, Pushpak ScienceQA: a novel resource for question answering on scholarly articles |
title | ScienceQA: a novel resource for question answering on scholarly articles |
title_full | ScienceQA: a novel resource for question answering on scholarly articles |
title_fullStr | ScienceQA: a novel resource for question answering on scholarly articles |
title_full_unstemmed | ScienceQA: a novel resource for question answering on scholarly articles |
title_short | ScienceQA: a novel resource for question answering on scholarly articles |
title_sort | scienceqa: a novel resource for question answering on scholarly articles |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9297303/ https://www.ncbi.nlm.nih.gov/pubmed/35873651 http://dx.doi.org/10.1007/s00799-022-00329-y |
work_keys_str_mv | AT saikhtanik scienceqaanovelresourceforquestionansweringonscholarlyarticles AT ghosaltirthankar scienceqaanovelresourceforquestionansweringonscholarlyarticles AT mittalamish scienceqaanovelresourceforquestionansweringonscholarlyarticles AT ekbalasif scienceqaanovelresourceforquestionansweringonscholarlyarticles AT bhattacharyyapushpak scienceqaanovelresourceforquestionansweringonscholarlyarticles |