Cargando…

Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” me...

Descripción completa

Detalles Bibliográficos
Autores principales: Nielson, Felicity F., Kay, Bill, Young, Stephen J., Colby, Sean M., Renslow, Ryan S., Metz, Thomas O.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9864474/
https://www.ncbi.nlm.nih.gov/pubmed/36677030
http://dx.doi.org/10.3390/metabo13010105
_version_ 1784875593829974016
author Nielson, Felicity F.
Kay, Bill
Young, Stephen J.
Colby, Sean M.
Renslow, Ryan S.
Metz, Thomas O.
author_facet Nielson, Felicity F.
Kay, Bill
Young, Stephen J.
Colby, Sean M.
Renslow, Ryan S.
Metz, Thomas O.
author_sort Nielson, Felicity F.
collection PubMed
description Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” methods require sampling sets of molecular conformers in order to produce high accuracy property predictions. Due to the computational cost of the subsequent calculations for each conformer, there is a need to sample the most relevant subset and avoid repeating calculations on conformers that are nearly identical. The goal of this study is to introduce a heuristic method of finding the most dissimilar conformers from a larger population in order to help speed up reference-free calculation methods and maintain a high property prediction accuracy. Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Because there exists a pairwise relationship between each item and all other items in the population, finding the set of the n most dissimilar items is different than simply sorting an array of numbers. For instance, if you have a set of the most dissimilar n = 4 items, one or more of the items from n = 4 might not be in the set n = 5. An exact solution would have to search all possible combinations of size n in the population exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but it is also more accurate than running Monte Carlo for 1,000,000 iterations, each searching for set sizes n = 3–7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing that SDS produces a solution close to the exact solution in these instances. Using theoretical approaches, we also demonstrate the constraints of the greedy algorithm and its efficacy as a ratio to the exact solution.
format Online
Article
Text
id pubmed-9864474
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98644742023-01-22 Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics Nielson, Felicity F. Kay, Bill Young, Stephen J. Colby, Sean M. Renslow, Ryan S. Metz, Thomas O. Metabolites Article Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” methods require sampling sets of molecular conformers in order to produce high accuracy property predictions. Due to the computational cost of the subsequent calculations for each conformer, there is a need to sample the most relevant subset and avoid repeating calculations on conformers that are nearly identical. The goal of this study is to introduce a heuristic method of finding the most dissimilar conformers from a larger population in order to help speed up reference-free calculation methods and maintain a high property prediction accuracy. Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Because there exists a pairwise relationship between each item and all other items in the population, finding the set of the n most dissimilar items is different than simply sorting an array of numbers. For instance, if you have a set of the most dissimilar n = 4 items, one or more of the items from n = 4 might not be in the set n = 5. An exact solution would have to search all possible combinations of size n in the population exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but it is also more accurate than running Monte Carlo for 1,000,000 iterations, each searching for set sizes n = 3–7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing that SDS produces a solution close to the exact solution in these instances. Using theoretical approaches, we also demonstrate the constraints of the greedy algorithm and its efficacy as a ratio to the exact solution. MDPI 2023-01-09 /pmc/articles/PMC9864474/ /pubmed/36677030 http://dx.doi.org/10.3390/metabo13010105 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Nielson, Felicity F.
Kay, Bill
Young, Stephen J.
Colby, Sean M.
Renslow, Ryan S.
Metz, Thomas O.
Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_full Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_fullStr Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_full_unstemmed Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_short Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_sort similarity downselection: finding the n most dissimilar molecular conformers for reference-free metabolomics
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9864474/
https://www.ncbi.nlm.nih.gov/pubmed/36677030
http://dx.doi.org/10.3390/metabo13010105
work_keys_str_mv AT nielsonfelicityf similaritydownselectionfindingthenmostdissimilarmolecularconformersforreferencefreemetabolomics
AT kaybill similaritydownselectionfindingthenmostdissimilarmolecularconformersforreferencefreemetabolomics
AT youngstephenj similaritydownselectionfindingthenmostdissimilarmolecularconformersforreferencefreemetabolomics
AT colbyseanm similaritydownselectionfindingthenmostdissimilarmolecularconformersforreferencefreemetabolomics
AT renslowryans similaritydownselectionfindingthenmostdissimilarmolecularconformersforreferencefreemetabolomics
AT metzthomaso similaritydownselectionfindingthenmostdissimilarmolecularconformersforreferencefreemetabolomics