Cargando…

Models and Processes to Extract Drug-like Molecules From Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be...

Descripción completa

Detalles Bibliográficos
Autores principales: Hong, Zhi, Pauloski, J. Gregory, Ward, Logan, Chard, Kyle, Blaiszik, Ben, Foster, Ian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8435623/
https://www.ncbi.nlm.nih.gov/pubmed/34527701
http://dx.doi.org/10.3389/fmolb.2021.636077
_version_ 1783751834031095808
author Hong, Zhi
Pauloski, J. Gregory
Ward, Logan
Chard, Kyle
Blaiszik, Ben
Foster, Ian
author_facet Hong, Zhi
Pauloski, J. Gregory
Ward, Logan
Chard, Kyle
Blaiszik, Ben
Foster, Ian
author_sort Hong, Zhi
collection PubMed
description Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.
format Online
Article
Text
id pubmed-8435623
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-84356232021-09-14 Models and Processes to Extract Drug-like Molecules From Natural Language Text Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian Front Mol Biosci Molecular Biosciences Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein. Frontiers Media S.A. 2021-08-30 /pmc/articles/PMC8435623/ /pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077 Text en Copyright © 2021 Hong, Pauloski, Ward, Chard, Blaiszik and Foster. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Molecular Biosciences
Hong, Zhi
Pauloski, J. Gregory
Ward, Logan
Chard, Kyle
Blaiszik, Ben
Foster, Ian
Models and Processes to Extract Drug-like Molecules From Natural Language Text
title Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_full Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_fullStr Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_full_unstemmed Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_short Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_sort models and processes to extract drug-like molecules from natural language text
topic Molecular Biosciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8435623/
https://www.ncbi.nlm.nih.gov/pubmed/34527701
http://dx.doi.org/10.3389/fmolb.2021.636077
work_keys_str_mv AT hongzhi modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT pauloskijgregory modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT wardlogan modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT chardkyle modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT blaiszikben modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT fosterian modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext