Cargando…

Bloom filters for molecules

Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint repres...

Descripción completa

Detalles Bibliográficos
Autores principales: Medina, Jorge, White, Andrew D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10571468/
https://www.ncbi.nlm.nih.gov/pubmed/37828615
http://dx.doi.org/10.1186/s13321-023-00765-1
_version_ 1785120008743944192
author Medina, Jorge
White, Andrew D.
author_facet Medina, Jorge
White, Andrew D.
author_sort Medina, Jorge
collection PubMed
description Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV (Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom.
format Online
Article
Text
id pubmed-10571468
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-105714682023-10-14 Bloom filters for molecules Medina, Jorge White, Andrew D. J Cheminform Software Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV (Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom. Springer International Publishing 2023-10-12 /pmc/articles/PMC10571468/ /pubmed/37828615 http://dx.doi.org/10.1186/s13321-023-00765-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Medina, Jorge
White, Andrew D.
Bloom filters for molecules
title Bloom filters for molecules
title_full Bloom filters for molecules
title_fullStr Bloom filters for molecules
title_full_unstemmed Bloom filters for molecules
title_short Bloom filters for molecules
title_sort bloom filters for molecules
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10571468/
https://www.ncbi.nlm.nih.gov/pubmed/37828615
http://dx.doi.org/10.1186/s13321-023-00765-1
work_keys_str_mv AT medinajorge bloomfiltersformolecules
AT whiteandrewd bloomfiltersformolecules