Cargando…

Models and Processes to Extract Drug-like Molecules From Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hong, Zhi, Pauloski, J. Gregory, Ward, Logan, Chard, Kyle, Blaiszik, Ben, Foster, Ian
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Molecular Biosciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8435623/ https://www.ncbi.nlm.nih.gov/pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077

_version_	1783751834031095808
author	Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian
author_facet	Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian
author_sort	Hong, Zhi
collection	PubMed
description	Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.
format	Online Article Text
id	pubmed-8435623
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-84356232021-09-14 Models and Processes to Extract Drug-like Molecules From Natural Language Text Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian Front Mol Biosci Molecular Biosciences Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein. Frontiers Media S.A. 2021-08-30 /pmc/articles/PMC8435623/ /pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077 Text en Copyright © 2021 Hong, Pauloski, Ward, Chard, Blaiszik and Foster. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Molecular Biosciences Hong, Zhi Pauloski, J. Gregory Ward, Logan Chard, Kyle Blaiszik, Ben Foster, Ian Models and Processes to Extract Drug-like Molecules From Natural Language Text
title	Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_full	Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_fullStr	Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_full_unstemmed	Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_short	Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_sort	models and processes to extract drug-like molecules from natural language text
topic	Molecular Biosciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8435623/ https://www.ncbi.nlm.nih.gov/pubmed/34527701 http://dx.doi.org/10.3389/fmolb.2021.636077
work_keys_str_mv	AT hongzhi modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT pauloskijgregory modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT wardlogan modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT chardkyle modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT blaiszikben modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext AT fosterian modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext

Models and Processes to Extract Drug-like Molecules From Natural Language Text

Ejemplares similares