Cargando…

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study

BACKGROUND: The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented “infodemic”; the velocity and volume of data production have overwhelmed many key stakeholders such as clin...

Descripción completa

Detalles Bibliográficos
Autores principales: Vaghela, Uddhav, Rabinowicz, Simon, Bratsos, Paris, Martin, Guy, Fritzilas, Epameinondas, Markar, Sheraz, Purkayastha, Sanjay, Stringer, Karl, Singh, Harshdeep, Llewellyn, Charlie, Dutta, Debabrata, Clarke, Jonathan M, Howard, Matthew, Serban, Ovidiu, Kinross, James
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8104004/
https://www.ncbi.nlm.nih.gov/pubmed/33835932
http://dx.doi.org/10.2196/25714
_version_ 1783689405246996480
author Vaghela, Uddhav
Rabinowicz, Simon
Bratsos, Paris
Martin, Guy
Fritzilas, Epameinondas
Markar, Sheraz
Purkayastha, Sanjay
Stringer, Karl
Singh, Harshdeep
Llewellyn, Charlie
Dutta, Debabrata
Clarke, Jonathan M
Howard, Matthew
Serban, Ovidiu
Kinross, James
author_facet Vaghela, Uddhav
Rabinowicz, Simon
Bratsos, Paris
Martin, Guy
Fritzilas, Epameinondas
Markar, Sheraz
Purkayastha, Sanjay
Stringer, Karl
Singh, Harshdeep
Llewellyn, Charlie
Dutta, Debabrata
Clarke, Jonathan M
Howard, Matthew
Serban, Ovidiu
Kinross, James
author_sort Vaghela, Uddhav
collection PubMed
description BACKGROUND: The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented “infodemic”; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis–related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. OBJECTIVE: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. METHODS: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. RESULTS: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. CONCLUSIONS: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.
format Online
Article
Text
id pubmed-8104004
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-81040042021-05-12 Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study Vaghela, Uddhav Rabinowicz, Simon Bratsos, Paris Martin, Guy Fritzilas, Epameinondas Markar, Sheraz Purkayastha, Sanjay Stringer, Karl Singh, Harshdeep Llewellyn, Charlie Dutta, Debabrata Clarke, Jonathan M Howard, Matthew Serban, Ovidiu Kinross, James J Med Internet Res Original Paper BACKGROUND: The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented “infodemic”; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis–related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. OBJECTIVE: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. METHODS: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. RESULTS: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. CONCLUSIONS: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation. JMIR Publications 2021-05-06 /pmc/articles/PMC8104004/ /pubmed/33835932 http://dx.doi.org/10.2196/25714 Text en ©Uddhav Vaghela, Simon Rabinowicz, Paris Bratsos, Guy Martin, Epameinondas Fritzilas, Sheraz Markar, Sanjay Purkayastha, Karl Stringer, Harshdeep Singh, Charlie Llewellyn, Debabrata Dutta, Jonathan M Clarke, Matthew Howard, PanSurg REDASA Curators, Ovidiu Serban, James Kinross. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 06.05.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Vaghela, Uddhav
Rabinowicz, Simon
Bratsos, Paris
Martin, Guy
Fritzilas, Epameinondas
Markar, Sheraz
Purkayastha, Sanjay
Stringer, Karl
Singh, Harshdeep
Llewellyn, Charlie
Dutta, Debabrata
Clarke, Jonathan M
Howard, Matthew
Serban, Ovidiu
Kinross, James
Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_full Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_fullStr Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_full_unstemmed Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_short Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_sort using a secure, continually updating, web source processing pipeline to support the real-time data synthesis and analysis of scientific literature: development and validation study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8104004/
https://www.ncbi.nlm.nih.gov/pubmed/33835932
http://dx.doi.org/10.2196/25714
work_keys_str_mv AT vaghelauddhav usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT rabinowiczsimon usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT bratsosparis usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT martinguy usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT fritzilasepameinondas usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT markarsheraz usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT purkayasthasanjay usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT stringerkarl usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT singhharshdeep usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT llewellyncharlie usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT duttadebabrata usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT clarkejonathanm usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT howardmatthew usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT serbanovidiu usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy
AT kinrossjames usingasecurecontinuallyupdatingwebsourceprocessingpipelinetosupporttherealtimedatasynthesisandanalysisofscientificliteraturedevelopmentandvalidationstudy