Cargando…
Auto-generated database of semiconductor band gaps using ChemDataExtractor
Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap rec...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9065101/ https://www.ncbi.nlm.nih.gov/pubmed/35504897 http://dx.doi.org/10.1038/s41597-022-01294-6 |
_version_ | 1784699510856876032 |
---|---|
author | Dong, Qingyang Cole, Jacqueline M. |
author_facet | Dong, Qingyang Cole, Jacqueline M. |
author_sort | Dong, Qingyang |
collection | PubMed |
description | Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery. |
format | Online Article Text |
id | pubmed-9065101 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-90651012022-05-04 Auto-generated database of semiconductor band gaps using ChemDataExtractor Dong, Qingyang Cole, Jacqueline M. Sci Data Data Descriptor Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery. Nature Publishing Group UK 2022-05-03 /pmc/articles/PMC9065101/ /pubmed/35504897 http://dx.doi.org/10.1038/s41597-022-01294-6 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Data Descriptor Dong, Qingyang Cole, Jacqueline M. Auto-generated database of semiconductor band gaps using ChemDataExtractor |
title | Auto-generated database of semiconductor band gaps using ChemDataExtractor |
title_full | Auto-generated database of semiconductor band gaps using ChemDataExtractor |
title_fullStr | Auto-generated database of semiconductor band gaps using ChemDataExtractor |
title_full_unstemmed | Auto-generated database of semiconductor band gaps using ChemDataExtractor |
title_short | Auto-generated database of semiconductor band gaps using ChemDataExtractor |
title_sort | auto-generated database of semiconductor band gaps using chemdataextractor |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9065101/ https://www.ncbi.nlm.nih.gov/pubmed/35504897 http://dx.doi.org/10.1038/s41597-022-01294-6 |
work_keys_str_mv | AT dongqingyang autogenerateddatabaseofsemiconductorbandgapsusingchemdataextractor AT colejacquelinem autogenerateddatabaseofsemiconductorbandgapsusingchemdataextractor |