Cargando…

Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics

Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Henderson, Ashley N., Kauwe, Steven K., Sparks, Taylor D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2021
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8319566/ https://www.ncbi.nlm.nih.gov/pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262

_version_	1783730475446042624
author	Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D.
author_facet	Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D.
author_sort	Henderson, Ashley N.
collection	PubMed
description	Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models.
format	Online Article Text
id	pubmed-8319566
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-83195662021-08-02 Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. Data Brief Data Article Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models. Elsevier 2021-07-13 /pmc/articles/PMC8319566/ /pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262 Text en © 2021 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Data Article Henderson, Ashley N. Kauwe, Steven K. Sparks, Taylor D. Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title	Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_full	Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_fullStr	Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_full_unstemmed	Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_short	Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
title_sort	benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8319566/ https://www.ncbi.nlm.nih.gov/pubmed/34345637 http://dx.doi.org/10.1016/j.dib.2021.107262
work_keys_str_mv	AT hendersonashleyn benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics AT kauwestevenk benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics AT sparkstaylord benchmarkdatasetsincorporatingdiversetaskssamplesizesmaterialsystemsanddataheterogeneityformaterialsinformatics

Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics

Ejemplares similares