Cargando…
Bloom filters for molecules
Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint repres...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10571468/ https://www.ncbi.nlm.nih.gov/pubmed/37828615 http://dx.doi.org/10.1186/s13321-023-00765-1 |
_version_ | 1785120008743944192 |
---|---|
author | Medina, Jorge White, Andrew D. |
author_facet | Medina, Jorge White, Andrew D. |
author_sort | Medina, Jorge |
collection | PubMed |
description | Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV (Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom. |
format | Online Article Text |
id | pubmed-10571468 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-105714682023-10-14 Bloom filters for molecules Medina, Jorge White, Andrew D. J Cheminform Software Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV (Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom. Springer International Publishing 2023-10-12 /pmc/articles/PMC10571468/ /pubmed/37828615 http://dx.doi.org/10.1186/s13321-023-00765-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Medina, Jorge White, Andrew D. Bloom filters for molecules |
title | Bloom filters for molecules |
title_full | Bloom filters for molecules |
title_fullStr | Bloom filters for molecules |
title_full_unstemmed | Bloom filters for molecules |
title_short | Bloom filters for molecules |
title_sort | bloom filters for molecules |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10571468/ https://www.ncbi.nlm.nih.gov/pubmed/37828615 http://dx.doi.org/10.1186/s13321-023-00765-1 |
work_keys_str_mv | AT medinajorge bloomfiltersformolecules AT whiteandrewd bloomfiltersformolecules |