Cargando…

Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics

Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size,...

Descripción completa

Detalles Bibliográficos
Autores principales: Henderson, Ashley N., Kauwe, Steven K., Sparks, Taylor D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8319566/
https://www.ncbi.nlm.nih.gov/pubmed/34345637
http://dx.doi.org/10.1016/j.dib.2021.107262
_version_ 1783730475446042624
author Henderson, Ashley N.
Kauwe, Steven K.
Sparks, Taylor D.
author_facet Henderson, Ashley N.
Kauwe, Steven K.
Sparks, Taylor D.
author_sort Henderson, Ashley N.
collection PubMed
description Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models.
format Online
Article
Text
id pubmed-8319566
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-83195662021-08-02 Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. Data Brief Data Article Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models. Elsevier 2021-07-13 /pmc/articles/PMC8319566/ /pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262 Text en © 2021 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Data Article
Henderson, Ashley N.
Kauwe, Steven K.
Sparks, Taylor D.
Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_full Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_fullStr Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_full_unstemmed Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_short Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_sort benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8319566/
https://www.ncbi.nlm.nih.gov/pubmed/34345637
http://dx.doi.org/10.1016/j.dib.2021.107262
work_keys_str_mv AT hendersonashleyn benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics
AT kauwestevenk benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics
AT sparkstaylord benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics