Cargando…

Snowball 2.0: Generic Material Data Parser for ChemDataExtractor

[Image: see text] The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software too...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Qingyang, Cole, Jacqueline M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10685441/
https://www.ncbi.nlm.nih.gov/pubmed/37934697
http://dx.doi.org/10.1021/acs.jcim.3c01281
_version_ 1785151631518597120
author Dong, Qingyang
Cole, Jacqueline M.
author_facet Dong, Qingyang
Cole, Jacqueline M.
author_sort Dong, Qingyang
collection PubMed
description [Image: see text] The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and F-score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15–20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness.
format Online
Article
Text
id pubmed-10685441
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-106854412023-11-30 Snowball 2.0: Generic Material Data Parser for ChemDataExtractor Dong, Qingyang Cole, Jacqueline M. J Chem Inf Model [Image: see text] The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and F-score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15–20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness. American Chemical Society 2023-11-07 /pmc/articles/PMC10685441/ /pubmed/37934697 http://dx.doi.org/10.1021/acs.jcim.3c01281 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Dong, Qingyang
Cole, Jacqueline M.
Snowball 2.0: Generic Material Data Parser for ChemDataExtractor
title Snowball 2.0: Generic Material Data Parser for ChemDataExtractor
title_full Snowball 2.0: Generic Material Data Parser for ChemDataExtractor
title_fullStr Snowball 2.0: Generic Material Data Parser for ChemDataExtractor
title_full_unstemmed Snowball 2.0: Generic Material Data Parser for ChemDataExtractor
title_short Snowball 2.0: Generic Material Data Parser for ChemDataExtractor
title_sort snowball 2.0: generic material data parser for chemdataextractor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10685441/
https://www.ncbi.nlm.nih.gov/pubmed/37934697
http://dx.doi.org/10.1021/acs.jcim.3c01281
work_keys_str_mv AT dongqingyang snowball20genericmaterialdataparserforchemdataextractor
AT colejacquelinem snowball20genericmaterialdataparserforchemdataextractor