Cargando…

A benchmark dataset for machine learning in ecotoxicology

The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for th...

Descripción completa

Detalles Bibliográficos
Autores principales: Schür, Christoph, Gasser, Lilian, Perez-Cruz, Fernando, Schirmer, Kristin, Baity-Jesi, Marco
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10584858/
https://www.ncbi.nlm.nih.gov/pubmed/37853023
http://dx.doi.org/10.1038/s41597-023-02612-2
_version_ 1785122827124342784
author Schür, Christoph
Gasser, Lilian
Perez-Cruz, Fernando
Schirmer, Kristin
Baity-Jesi, Marco
author_facet Schür, Christoph
Gasser, Lilian
Perez-Cruz, Fernando
Schirmer, Kristin
Baity-Jesi, Marco
author_sort Schür, Christoph
collection PubMed
description The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.
format Online
Article
Text
id pubmed-10584858
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-105848582023-10-20 A benchmark dataset for machine learning in ecotoxicology Schür, Christoph Gasser, Lilian Perez-Cruz, Fernando Schirmer, Kristin Baity-Jesi, Marco Sci Data Data Descriptor The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches. Nature Publishing Group UK 2023-10-18 /pmc/articles/PMC10584858/ /pubmed/37853023 http://dx.doi.org/10.1038/s41597-023-02612-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Data Descriptor
Schür, Christoph
Gasser, Lilian
Perez-Cruz, Fernando
Schirmer, Kristin
Baity-Jesi, Marco
A benchmark dataset for machine learning in ecotoxicology
title A benchmark dataset for machine learning in ecotoxicology
title_full A benchmark dataset for machine learning in ecotoxicology
title_fullStr A benchmark dataset for machine learning in ecotoxicology
title_full_unstemmed A benchmark dataset for machine learning in ecotoxicology
title_short A benchmark dataset for machine learning in ecotoxicology
title_sort benchmark dataset for machine learning in ecotoxicology
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10584858/
https://www.ncbi.nlm.nih.gov/pubmed/37853023
http://dx.doi.org/10.1038/s41597-023-02612-2
work_keys_str_mv AT schurchristoph abenchmarkdatasetformachinelearninginecotoxicology
AT gasserlilian abenchmarkdatasetformachinelearninginecotoxicology
AT perezcruzfernando abenchmarkdatasetformachinelearninginecotoxicology
AT schirmerkristin abenchmarkdatasetformachinelearninginecotoxicology
AT baityjesimarco abenchmarkdatasetformachinelearninginecotoxicology
AT schurchristoph benchmarkdatasetformachinelearninginecotoxicology
AT gasserlilian benchmarkdatasetformachinelearninginecotoxicology
AT perezcruzfernando benchmarkdatasetformachinelearninginecotoxicology
AT schirmerkristin benchmarkdatasetformachinelearninginecotoxicology
AT baityjesimarco benchmarkdatasetformachinelearninginecotoxicology