Cargando…
Models and Processes to Extract Drug-like Molecules From Natural Language Text
Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8435623/ https://www.ncbi.nlm.nih.gov/pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077 |
_version_ | 1783751834031095808 |
---|---|
author | Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian |
author_facet | Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian |
author_sort | Hong, Zhi |
collection | PubMed |
description | Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein. |
format | Online Article Text |
id | pubmed-8435623 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-84356232021-09-14 Models and Processes to Extract Drug-like Molecules From Natural Language Text Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian Front Mol Biosci Molecular Biosciences Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein. Frontiers Media S.A. 2021-08-30 /pmc/articles/PMC8435623/ /pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077 Text en Copyright © 2021 Hong, Pauloski, Ward, Chard, Blaiszik and Foster. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Molecular Biosciences Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian Models and Processes to Extract Drug-like Molecules From Natural Language Text |
title | Models and Processes to Extract Drug-like Molecules From Natural Language Text |
title_full | Models and Processes to Extract Drug-like Molecules From Natural Language Text |
title_fullStr | Models and Processes to Extract Drug-like Molecules From Natural Language Text |
title_full_unstemmed | Models and Processes to Extract Drug-like Molecules From Natural Language Text |
title_short | Models and Processes to Extract Drug-like Molecules From Natural Language Text |
title_sort | models and processes to extract drug-like molecules from natural language text |
topic | Molecular Biosciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8435623/ https://www.ncbi.nlm.nih.gov/pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077 |
work_keys_str_mv | AT hongzhi modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT pauloskijgregory modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT wardlogan modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT chardkyle modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT blaiszikben modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT fosterian modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext |