Cargando…
A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here,...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158272/ https://www.ncbi.nlm.nih.gov/pubmed/25221627 http://dx.doi.org/10.1186/s13321-014-0040-8 |
_version_ | 1782334019753476096 |
---|---|
author | Papadatos, George van Westen, Gerard JP Croset, Samuel Santos, Rita Trubian, Simone Overington, John P |
author_facet | Papadatos, George van Westen, Gerard JP Croset, Samuel Santos, Rita Trubian, Simone Overington, John P |
author_sort | Papadatos, George |
collection | PubMed |
description | BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are `ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining. RESULTS: The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches. CONCLUSIONS: Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data. ABSTRACT: [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0040-8) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4158272 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-41582722014-09-10 A document classifier for medicinal chemistry publications trained on the ChEMBL corpus Papadatos, George van Westen, Gerard JP Croset, Samuel Santos, Rita Trubian, Simone Overington, John P J Cheminform Software BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are `ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining. RESULTS: The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches. CONCLUSIONS: Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data. ABSTRACT: [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0040-8) contains supplementary material, which is available to authorized users. Springer International Publishing 2014-08-12 /pmc/articles/PMC4158272/ /pubmed/25221627 http://dx.doi.org/10.1186/s13321-014-0040-8 Text en © Papadatos et al.; licensee Chemistry Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Papadatos, George van Westen, Gerard JP Croset, Samuel Santos, Rita Trubian, Simone Overington, John P A document classifier for medicinal chemistry publications trained on the ChEMBL corpus |
title | A document classifier for medicinal chemistry publications trained on the ChEMBL corpus |
title_full | A document classifier for medicinal chemistry publications trained on the ChEMBL corpus |
title_fullStr | A document classifier for medicinal chemistry publications trained on the ChEMBL corpus |
title_full_unstemmed | A document classifier for medicinal chemistry publications trained on the ChEMBL corpus |
title_short | A document classifier for medicinal chemistry publications trained on the ChEMBL corpus |
title_sort | document classifier for medicinal chemistry publications trained on the chembl corpus |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4158272/ https://www.ncbi.nlm.nih.gov/pubmed/25221627 http://dx.doi.org/10.1186/s13321-014-0040-8 |
work_keys_str_mv | AT papadatosgeorge adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT vanwestengerardjp adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT crosetsamuel adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT santosrita adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT trubiansimone adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT overingtonjohnp adocumentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT papadatosgeorge documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT vanwestengerardjp documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT crosetsamuel documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT santosrita documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT trubiansimone documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus AT overingtonjohnp documentclassifierformedicinalchemistrypublicationstrainedonthechemblcorpus |