Cargando…
Materials information extraction via automatically generated corpus
Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate lab...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9279422/ https://www.ncbi.nlm.nih.gov/pubmed/35831367 http://dx.doi.org/10.1038/s41597-022-01492-2 |
_version_ | 1784746394434666496 |
---|---|
author | Yan, Rongen Jiang, Xue Wang, Weiren Dang, Depeng Su, Yanjing |
author_facet | Yan, Rongen Jiang, Xue Wang, Weiren Dang, Depeng Su, Yanjing |
author_sort | Yan, Rongen |
collection | PubMed |
description | Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials. |
format | Online Article Text |
id | pubmed-9279422 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-92794222022-07-15 Materials information extraction via automatically generated corpus Yan, Rongen Jiang, Xue Wang, Weiren Dang, Depeng Su, Yanjing Sci Data Article Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials. Nature Publishing Group UK 2022-07-13 /pmc/articles/PMC9279422/ /pubmed/35831367 http://dx.doi.org/10.1038/s41597-022-01492-2 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Yan, Rongen Jiang, Xue Wang, Weiren Dang, Depeng Su, Yanjing Materials information extraction via automatically generated corpus |
title | Materials information extraction via automatically generated corpus |
title_full | Materials information extraction via automatically generated corpus |
title_fullStr | Materials information extraction via automatically generated corpus |
title_full_unstemmed | Materials information extraction via automatically generated corpus |
title_short | Materials information extraction via automatically generated corpus |
title_sort | materials information extraction via automatically generated corpus |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9279422/ https://www.ncbi.nlm.nih.gov/pubmed/35831367 http://dx.doi.org/10.1038/s41597-022-01492-2 |
work_keys_str_mv | AT yanrongen materialsinformationextractionviaautomaticallygeneratedcorpus AT jiangxue materialsinformationextractionviaautomaticallygeneratedcorpus AT wangweiren materialsinformationextractionviaautomaticallygeneratedcorpus AT dangdepeng materialsinformationextractionviaautomaticallygeneratedcorpus AT suyanjing materialsinformationextractionviaautomaticallygeneratedcorpus |