Cargando…

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compar...

Descripción completa

Detalles Bibliográficos
Autores principales: Trewartha, Amalie, Walker, Nicholas, Huo, Haoyan, Lee, Sanghoon, Cruse, Kevin, Dagdelen, John, Dunn, Alexander, Persson, Kristin A., Ceder, Gerbrand, Jain, Anubhav
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9024010/
https://www.ncbi.nlm.nih.gov/pubmed/35465225
http://dx.doi.org/10.1016/j.patter.2022.100488
_version_ 1784690470584057856
author Trewartha, Amalie
Walker, Nicholas
Huo, Haoyan
Lee, Sanghoon
Cruse, Kevin
Dagdelen, John
Dunn, Alexander
Persson, Kristin A.
Ceder, Gerbrand
Jain, Anubhav
author_facet Trewartha, Amalie
Walker, Nicholas
Huo, Haoyan
Lee, Sanghoon
Cruse, Kevin
Dagdelen, John
Dunn, Alexander
Persson, Kristin A.
Ceder, Gerbrand
Jain, Anubhav
author_sort Trewartha, Amalie
collection PubMed
description A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERT(BASE)-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.
format Online
Article
Text
id pubmed-9024010
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-90240102022-04-23 Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science Trewartha, Amalie Walker, Nicholas Huo, Haoyan Lee, Sanghoon Cruse, Kevin Dagdelen, John Dunn, Alexander Persson, Kristin A. Ceder, Gerbrand Jain, Anubhav Patterns (N Y) Article A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERT(BASE)-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature. Elsevier 2022-04-08 /pmc/articles/PMC9024010/ /pubmed/35465225 http://dx.doi.org/10.1016/j.patter.2022.100488 Text en © 2022 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Trewartha, Amalie
Walker, Nicholas
Huo, Haoyan
Lee, Sanghoon
Cruse, Kevin
Dagdelen, John
Dunn, Alexander
Persson, Kristin A.
Ceder, Gerbrand
Jain, Anubhav
Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
title Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
title_full Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
title_fullStr Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
title_full_unstemmed Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
title_short Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
title_sort quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9024010/
https://www.ncbi.nlm.nih.gov/pubmed/35465225
http://dx.doi.org/10.1016/j.patter.2022.100488
work_keys_str_mv AT trewarthaamalie quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT walkernicholas quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT huohaoyan quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT leesanghoon quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT crusekevin quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT dagdelenjohn quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT dunnalexander quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT perssonkristina quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT cedergerbrand quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience
AT jainanubhav quantifyingtheadvantageofdomainspecificpretrainingonnamedentityrecognitiontasksinmaterialsscience