Cargando…

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline...

Descripción completa

Detalles Bibliográficos
Autores principales: Shetty, Pranav, Rajan, Arunkumar Chitteth, Kuenneth, Chris, Gupta, Sonakshi, Panchumarti, Lakshmi Prerana, Holm, Lauren, Zhang, Chao, Ramprasad, Rampi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073792/
https://www.ncbi.nlm.nih.gov/pubmed/37033291
http://dx.doi.org/10.1038/s41524-023-01003-w
_version_ 1785019646176395264
author Shetty, Pranav
Rajan, Arunkumar Chitteth
Kuenneth, Chris
Gupta, Sonakshi
Panchumarti, Lakshmi Prerana
Holm, Lauren
Zhang, Chao
Ramprasad, Rampi
author_facet Shetty, Pranav
Rajan, Arunkumar Chitteth
Kuenneth, Chris
Gupta, Sonakshi
Panchumarti, Lakshmi Prerana
Holm, Lauren
Zhang, Chao
Ramprasad, Rampi
author_sort Shetty, Pranav
collection PubMed
description The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information.
format Online
Article
Text
id pubmed-10073792
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-100737922023-04-05 A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing Shetty, Pranav Rajan, Arunkumar Chitteth Kuenneth, Chris Gupta, Sonakshi Panchumarti, Lakshmi Prerana Holm, Lauren Zhang, Chao Ramprasad, Rampi NPJ Comput Mater Article The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information. Nature Publishing Group UK 2023-04-05 2023 /pmc/articles/PMC10073792/ /pubmed/37033291 http://dx.doi.org/10.1038/s41524-023-01003-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Shetty, Pranav
Rajan, Arunkumar Chitteth
Kuenneth, Chris
Gupta, Sonakshi
Panchumarti, Lakshmi Prerana
Holm, Lauren
Zhang, Chao
Ramprasad, Rampi
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
title A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
title_full A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
title_fullStr A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
title_full_unstemmed A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
title_short A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
title_sort general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073792/
https://www.ncbi.nlm.nih.gov/pubmed/37033291
http://dx.doi.org/10.1038/s41524-023-01003-w
work_keys_str_mv AT shettypranav ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT rajanarunkumarchitteth ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT kuennethchris ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT guptasonakshi ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT panchumartilakshmiprerana ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT holmlauren ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT zhangchao ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT ramprasadrampi ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT shettypranav generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT rajanarunkumarchitteth generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT kuennethchris generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT guptasonakshi generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT panchumartilakshmiprerana generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT holmlauren generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT zhangchao generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing
AT ramprasadrampi generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing