Cargando…
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073792/ https://www.ncbi.nlm.nih.gov/pubmed/37033291 http://dx.doi.org/10.1038/s41524-023-01003-w |
_version_ | 1785019646176395264 |
---|---|
author | Shetty, Pranav Rajan, Arunkumar Chitteth Kuenneth, Chris Gupta, Sonakshi Panchumarti, Lakshmi Prerana Holm, Lauren Zhang, Chao Ramprasad, Rampi |
author_facet | Shetty, Pranav Rajan, Arunkumar Chitteth Kuenneth, Chris Gupta, Sonakshi Panchumarti, Lakshmi Prerana Holm, Lauren Zhang, Chao Ramprasad, Rampi |
author_sort | Shetty, Pranav |
collection | PubMed |
description | The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information. |
format | Online Article Text |
id | pubmed-10073792 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-100737922023-04-05 A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing Shetty, Pranav Rajan, Arunkumar Chitteth Kuenneth, Chris Gupta, Sonakshi Panchumarti, Lakshmi Prerana Holm, Lauren Zhang, Chao Ramprasad, Rampi NPJ Comput Mater Article The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information. Nature Publishing Group UK 2023-04-05 2023 /pmc/articles/PMC10073792/ /pubmed/37033291 http://dx.doi.org/10.1038/s41524-023-01003-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Shetty, Pranav Rajan, Arunkumar Chitteth Kuenneth, Chris Gupta, Sonakshi Panchumarti, Lakshmi Prerana Holm, Lauren Zhang, Chao Ramprasad, Rampi A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
title | A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
title_full | A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
title_fullStr | A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
title_full_unstemmed | A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
title_short | A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
title_sort | general-purpose material property data extraction pipeline from large polymer corpora using natural language processing |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073792/ https://www.ncbi.nlm.nih.gov/pubmed/37033291 http://dx.doi.org/10.1038/s41524-023-01003-w |
work_keys_str_mv | AT shettypranav ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT rajanarunkumarchitteth ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT kuennethchris ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT guptasonakshi ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT panchumartilakshmiprerana ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT holmlauren ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT zhangchao ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT ramprasadrampi ageneralpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT shettypranav generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT rajanarunkumarchitteth generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT kuennethchris generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT guptasonakshi generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT panchumartilakshmiprerana generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT holmlauren generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT zhangchao generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing AT ramprasadrampi generalpurposematerialpropertydataextractionpipelinefromlargepolymercorporausingnaturallanguageprocessing |