Cargando…
Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size,...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8319566/ https://www.ncbi.nlm.nih.gov/pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262 |
_version_ | 1783730475446042624 |
---|---|
author | Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. |
author_facet | Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. |
author_sort | Henderson, Ashley N. |
collection | PubMed |
description | Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models. |
format | Online Article Text |
id | pubmed-8319566 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-83195662021-08-02 Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. Data Brief Data Article Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models. Elsevier 2021-07-13 /pmc/articles/PMC8319566/ /pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262 Text en © 2021 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Data Article Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
title | Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
title_full | Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
title_fullStr | Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
title_full_unstemmed | Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
title_short | Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
title_sort | benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8319566/ https://www.ncbi.nlm.nih.gov/pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262 |
work_keys_str_mv | AT hendersonashleyn benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics AT kauwestevenk benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics AT sparkstaylord benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics |