Cargando…

Auto-generated database of semiconductor band gaps using ChemDataExtractor

Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap rec...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Qingyang, Cole, Jacqueline M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9065101/
https://www.ncbi.nlm.nih.gov/pubmed/35504897
http://dx.doi.org/10.1038/s41597-022-01294-6
_version_ 1784699510856876032
author Dong, Qingyang
Cole, Jacqueline M.
author_facet Dong, Qingyang
Cole, Jacqueline M.
author_sort Dong, Qingyang
collection PubMed
description Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery.
format Online
Article
Text
id pubmed-9065101
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-90651012022-05-04 Auto-generated database of semiconductor band gaps using ChemDataExtractor Dong, Qingyang Cole, Jacqueline M. Sci Data Data Descriptor Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery. Nature Publishing Group UK 2022-05-03 /pmc/articles/PMC9065101/ /pubmed/35504897 http://dx.doi.org/10.1038/s41597-022-01294-6 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Data Descriptor
Dong, Qingyang
Cole, Jacqueline M.
Auto-generated database of semiconductor band gaps using ChemDataExtractor
title Auto-generated database of semiconductor band gaps using ChemDataExtractor
title_full Auto-generated database of semiconductor band gaps using ChemDataExtractor
title_fullStr Auto-generated database of semiconductor band gaps using ChemDataExtractor
title_full_unstemmed Auto-generated database of semiconductor band gaps using ChemDataExtractor
title_short Auto-generated database of semiconductor band gaps using ChemDataExtractor
title_sort auto-generated database of semiconductor band gaps using chemdataextractor
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9065101/
https://www.ncbi.nlm.nih.gov/pubmed/35504897
http://dx.doi.org/10.1038/s41597-022-01294-6
work_keys_str_mv AT dongqingyang autogenerateddatabaseofsemiconductorbandgapsusingchemdataextractor
AT colejacquelinem autogenerateddatabaseofsemiconductorbandgapsusingchemdataextractor