Cargando…

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus

BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here,...

Descripción completa

Detalles Bibliográficos
Autores principales: Papadatos, George, van Westen, Gerard JP, Croset, Samuel, Santos, Rita, Trubian, Simone, Overington, John P
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158272/
https://www.ncbi.nlm.nih.gov/pubmed/25221627
http://dx.doi.org/10.1186/s13321-014-0040-8
_version_ 1782334019753476096
author Papadatos, George
van Westen, Gerard JP
Croset, Samuel
Santos, Rita
Trubian, Simone
Overington, John P
author_facet Papadatos, George
van Westen, Gerard JP
Croset, Samuel
Santos, Rita
Trubian, Simone
Overington, John P
author_sort Papadatos, George
collection PubMed
description BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are `ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining. RESULTS: The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches. CONCLUSIONS: Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data. ABSTRACT: [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0040-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4158272
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-41582722014-09-10 A document classifier for medicinal chemistry publications trained on the ChEMBL corpus Papadatos, George van Westen, Gerard JP Croset, Samuel Santos, Rita Trubian, Simone Overington, John P J Cheminform Software BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are `ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining. RESULTS: The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches. CONCLUSIONS: Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data. ABSTRACT: [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0040-8) contains supplementary material, which is available to authorized users. Springer International Publishing 2014-08-12 /pmc/articles/PMC4158272/ /pubmed/25221627 http://dx.doi.org/10.1186/s13321-014-0040-8 Text en © Papadatos et al.; licensee Chemistry Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Papadatos, George
van Westen, Gerard JP
Croset, Samuel
Santos, Rita
Trubian, Simone
Overington, John P
A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
title A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
title_full A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
title_fullStr A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
title_full_unstemmed A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
title_short A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
title_sort document classifier for medicinal chemistry publications trained on the chembl corpus
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158272/
https://www.ncbi.nlm.nih.gov/pubmed/25221627
http://dx.doi.org/10.1186/s13321-014-0040-8
work_keys_str_mv AT papadatosgeorge adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT vanwestengerardjp adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT crosetsamuel adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT santosrita adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT trubiansimone adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT overingtonjohnp adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT papadatosgeorge documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT vanwestengerardjp documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT crosetsamuel documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT santosrita documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT trubiansimone documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus
AT overingtonjohnp documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus