Cargando…

Materials information extraction via automatically generated corpus

Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate lab...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Rongen, Jiang, Xue, Wang, Weiren, Dang, Depeng, Su, Yanjing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9279422/
https://www.ncbi.nlm.nih.gov/pubmed/35831367
http://dx.doi.org/10.1038/s41597-022-01492-2
_version_ 1784746394434666496
author Yan, Rongen
Jiang, Xue
Wang, Weiren
Dang, Depeng
Su, Yanjing
author_facet Yan, Rongen
Jiang, Xue
Wang, Weiren
Dang, Depeng
Su, Yanjing
author_sort Yan, Rongen
collection PubMed
description Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials.
format Online
Article
Text
id pubmed-9279422
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-92794222022-07-15 Materials information extraction via automatically generated corpus Yan, Rongen Jiang, Xue Wang, Weiren Dang, Depeng Su, Yanjing Sci Data Article Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials. Nature Publishing Group UK 2022-07-13 /pmc/articles/PMC9279422/ /pubmed/35831367 http://dx.doi.org/10.1038/s41597-022-01492-2 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Yan, Rongen
Jiang, Xue
Wang, Weiren
Dang, Depeng
Su, Yanjing
Materials information extraction via automatically generated corpus
title Materials information extraction via automatically generated corpus
title_full Materials information extraction via automatically generated corpus
title_fullStr Materials information extraction via automatically generated corpus
title_full_unstemmed Materials information extraction via automatically generated corpus
title_short Materials information extraction via automatically generated corpus
title_sort materials information extraction via automatically generated corpus
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9279422/
https://www.ncbi.nlm.nih.gov/pubmed/35831367
http://dx.doi.org/10.1038/s41597-022-01492-2
work_keys_str_mv AT yanrongen materialsinformationextractionviaautomaticallygeneratedcorpus
AT jiangxue materialsinformationextractionviaautomaticallygeneratedcorpus
AT wangweiren materialsinformationextractionviaautomaticallygeneratedcorpus
AT dangdepeng materialsinformationextractionviaautomaticallygeneratedcorpus
AT suyanjing materialsinformationextractionviaautomaticallygeneratedcorpus