Cargando…

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach

MOTIVATION: Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of inf...

Descripción completa

Detalles Bibliográficos
Autores principales: Tarasova, O. A., Rudik, A. V., Biziukova, N. Yu., Filimonov, D. A., Poroikov, V. V.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375066/
https://www.ncbi.nlm.nih.gov/pubmed/35964150
http://dx.doi.org/10.1186/s13321-022-00633-4
_version_ 1784767879621640192
author Tarasova, O. A.
Rudik, A. V.
Biziukova, N. Yu.
Filimonov, D. A.
Poroikov, V. V.
author_facet Tarasova, O. A.
Rudik, A. V.
Biziukova, N. Yu.
Filimonov, D. A.
Poroikov, V. V.
author_sort Tarasova, O. A.
collection PubMed
description MOTIVATION: Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. METHODS AND RESULTS: We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. CONCLUSION: The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00633-4.
format Online
Article
Text
id pubmed-9375066
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-93750662022-08-14 Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach Tarasova, O. A. Rudik, A. V. Biziukova, N. Yu. Filimonov, D. A. Poroikov, V. V. J Cheminform Research MOTIVATION: Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. METHODS AND RESULTS: We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. CONCLUSION: The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00633-4. Springer International Publishing 2022-08-13 /pmc/articles/PMC9375066/ /pubmed/35964150 http://dx.doi.org/10.1186/s13321-022-00633-4 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Tarasova, O. A.
Rudik, A. V.
Biziukova, N. Yu.
Filimonov, D. A.
Poroikov, V. V.
Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_full Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_fullStr Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_full_unstemmed Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_short Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_sort chemical named entity recognition in the texts of scientific publications using the naïve bayes classifier approach
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375066/
https://www.ncbi.nlm.nih.gov/pubmed/35964150
http://dx.doi.org/10.1186/s13321-022-00633-4
work_keys_str_mv AT tarasovaoa chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT rudikav chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT biziukovanyu chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT filimonovda chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT poroikovvv chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach