Cargando…
Combatting over-specialization bias in growing chemical databases
BACKGROUND: Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ expe...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197453/ https://www.ncbi.nlm.nih.gov/pubmed/37208694 http://dx.doi.org/10.1186/s13321-023-00716-w |
_version_ | 1785044555485151232 |
---|---|
author | Dost, Katharina Pullar-Strecker, Zac Brydon, Liam Zhang, Kunyang Hafner, Jasmin Riddle, Patricia J. Wicker, Jörg S. |
author_facet | Dost, Katharina Pullar-Strecker, Zac Brydon, Liam Zhang, Kunyang Hafner, Jasmin Riddle, Patricia J. Wicker, Jörg S. |
author_sort | Dost, Katharina |
collection | PubMed |
description | BACKGROUND: Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. PROPOSED SOLUTION: In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. RESULTS: An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels. |
format | Online Article Text |
id | pubmed-10197453 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-101974532023-05-20 Combatting over-specialization bias in growing chemical databases Dost, Katharina Pullar-Strecker, Zac Brydon, Liam Zhang, Kunyang Hafner, Jasmin Riddle, Patricia J. Wicker, Jörg S. J Cheminform Research BACKGROUND: Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. PROPOSED SOLUTION: In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. RESULTS: An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels. Springer International Publishing 2023-05-19 /pmc/articles/PMC10197453/ /pubmed/37208694 http://dx.doi.org/10.1186/s13321-023-00716-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Dost, Katharina Pullar-Strecker, Zac Brydon, Liam Zhang, Kunyang Hafner, Jasmin Riddle, Patricia J. Wicker, Jörg S. Combatting over-specialization bias in growing chemical databases |
title | Combatting over-specialization bias in growing chemical databases |
title_full | Combatting over-specialization bias in growing chemical databases |
title_fullStr | Combatting over-specialization bias in growing chemical databases |
title_full_unstemmed | Combatting over-specialization bias in growing chemical databases |
title_short | Combatting over-specialization bias in growing chemical databases |
title_sort | combatting over-specialization bias in growing chemical databases |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197453/ https://www.ncbi.nlm.nih.gov/pubmed/37208694 http://dx.doi.org/10.1186/s13321-023-00716-w |
work_keys_str_mv | AT dostkatharina combattingoverspecializationbiasingrowingchemicaldatabases AT pullarstreckerzac combattingoverspecializationbiasingrowingchemicaldatabases AT brydonliam combattingoverspecializationbiasingrowingchemicaldatabases AT zhangkunyang combattingoverspecializationbiasingrowingchemicaldatabases AT hafnerjasmin combattingoverspecializationbiasingrowingchemicaldatabases AT riddlepatriciaj combattingoverspecializationbiasingrowingchemicaldatabases AT wickerjorgs combattingoverspecializationbiasingrowingchemicaldatabases |