Cargando…
A benchmark dataset for machine learning in ecotoxicology
The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for th...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10584858/ https://www.ncbi.nlm.nih.gov/pubmed/37853023 http://dx.doi.org/10.1038/s41597-023-02612-2 |
_version_ | 1785122827124342784 |
---|---|
author | Schür, Christoph Gasser, Lilian Perez-Cruz, Fernando Schirmer, Kristin Baity-Jesi, Marco |
author_facet | Schür, Christoph Gasser, Lilian Perez-Cruz, Fernando Schirmer, Kristin Baity-Jesi, Marco |
author_sort | Schür, Christoph |
collection | PubMed |
description | The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches. |
format | Online Article Text |
id | pubmed-10584858 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-105848582023-10-20 A benchmark dataset for machine learning in ecotoxicology Schür, Christoph Gasser, Lilian Perez-Cruz, Fernando Schirmer, Kristin Baity-Jesi, Marco Sci Data Data Descriptor The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches. Nature Publishing Group UK 2023-10-18 /pmc/articles/PMC10584858/ /pubmed/37853023 http://dx.doi.org/10.1038/s41597-023-02612-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Data Descriptor Schür, Christoph Gasser, Lilian Perez-Cruz, Fernando Schirmer, Kristin Baity-Jesi, Marco A benchmark dataset for machine learning in ecotoxicology |
title | A benchmark dataset for machine learning in ecotoxicology |
title_full | A benchmark dataset for machine learning in ecotoxicology |
title_fullStr | A benchmark dataset for machine learning in ecotoxicology |
title_full_unstemmed | A benchmark dataset for machine learning in ecotoxicology |
title_short | A benchmark dataset for machine learning in ecotoxicology |
title_sort | benchmark dataset for machine learning in ecotoxicology |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10584858/ https://www.ncbi.nlm.nih.gov/pubmed/37853023 http://dx.doi.org/10.1038/s41597-023-02612-2 |
work_keys_str_mv | AT schurchristoph abenchmarkdatasetformachinelearninginecotoxicology AT gasserlilian abenchmarkdatasetformachinelearninginecotoxicology AT perezcruzfernando abenchmarkdatasetformachinelearninginecotoxicology AT schirmerkristin abenchmarkdatasetformachinelearninginecotoxicology AT baityjesimarco abenchmarkdatasetformachinelearninginecotoxicology AT schurchristoph benchmarkdatasetformachinelearninginecotoxicology AT gasserlilian benchmarkdatasetformachinelearninginecotoxicology AT perezcruzfernando benchmarkdatasetformachinelearninginecotoxicology AT schirmerkristin benchmarkdatasetformachinelearninginecotoxicology AT baityjesimarco benchmarkdatasetformachinelearninginecotoxicology |