Cargando…

jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints

BACKGROUND: The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is...

Descripción completa

Detalles Bibliográficos
Autores principales: Hinselmann, Georg, Rosenbaum, Lars, Jahn, Andreas, Fechner, Nikolas, Zell, Andreas
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3033338/
https://www.ncbi.nlm.nih.gov/pubmed/21219648
http://dx.doi.org/10.1186/1758-2946-3-3
_version_ 1782197563169964032
author Hinselmann, Georg
Rosenbaum, Lars
Jahn, Andreas
Fechner, Nikolas
Zell, Andreas
author_facet Hinselmann, Georg
Rosenbaum, Lars
Jahn, Andreas
Fechner, Nikolas
Zell, Andreas
author_sort Hinselmann, Georg
collection PubMed
description BACKGROUND: The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats. RESULTS: We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al. CONCLUSIONS: jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.
format Text
id pubmed-3033338
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30333382011-02-25 jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints Hinselmann, Georg Rosenbaum, Lars Jahn, Andreas Fechner, Nikolas Zell, Andreas J Cheminform Research Article BACKGROUND: The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats. RESULTS: We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al. CONCLUSIONS: jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining. BioMed Central 2011-01-10 /pmc/articles/PMC3033338/ /pubmed/21219648 http://dx.doi.org/10.1186/1758-2946-3-3 Text en Copyright ©2011 Hinselmann et al; licensee Chemistry Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Hinselmann, Georg
Rosenbaum, Lars
Jahn, Andreas
Fechner, Nikolas
Zell, Andreas
jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
title jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
title_full jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
title_fullStr jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
title_full_unstemmed jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
title_short jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
title_sort jcompoundmapper: an open source java library and command-line tool for chemical fingerprints
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3033338/
https://www.ncbi.nlm.nih.gov/pubmed/21219648
http://dx.doi.org/10.1186/1758-2946-3-3
work_keys_str_mv AT hinselmanngeorg jcompoundmapperanopensourcejavalibraryandcommandlinetoolforchemicalfingerprints
AT rosenbaumlars jcompoundmapperanopensourcejavalibraryandcommandlinetoolforchemicalfingerprints
AT jahnandreas jcompoundmapperanopensourcejavalibraryandcommandlinetoolforchemicalfingerprints
AT fechnernikolas jcompoundmapperanopensourcejavalibraryandcommandlinetoolforchemicalfingerprints
AT zellandreas jcompoundmapperanopensourcejavalibraryandcommandlinetoolforchemicalfingerprints