Cargando…

Building an annotated corpus for automatic metadata extraction from multilingual journal article references

Bibliographic references containing citation information of academic literature play an important role as a medium connecting earlier and recent studies. As references contain machine-readable metadata such as author name, title, or publication year, they have been widely used in the field of citati...

Descripción completa

Detalles Bibliográficos
Autores principales: Choi, Wonjun, Yoon, Hwa-Mook, Hyun, Mi-Hwan, Lee, Hye-Jin, Seol, Jae-Wook, Lee, Kangsan Dajeong, Yoon, Young Joon, Kong, Hyesoo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9858828/
https://www.ncbi.nlm.nih.gov/pubmed/36662818
http://dx.doi.org/10.1371/journal.pone.0280637
_version_ 1784874201860014080
author Choi, Wonjun
Yoon, Hwa-Mook
Hyun, Mi-Hwan
Lee, Hye-Jin
Seol, Jae-Wook
Lee, Kangsan Dajeong
Yoon, Young Joon
Kong, Hyesoo
author_facet Choi, Wonjun
Yoon, Hwa-Mook
Hyun, Mi-Hwan
Lee, Hye-Jin
Seol, Jae-Wook
Lee, Kangsan Dajeong
Yoon, Young Joon
Kong, Hyesoo
author_sort Choi, Wonjun
collection PubMed
description Bibliographic references containing citation information of academic literature play an important role as a medium connecting earlier and recent studies. As references contain machine-readable metadata such as author name, title, or publication year, they have been widely used in the field of citation information services including search services for scholarly information and research trend analysis. Many institutions around the world manually extract and continuously accumulate reference metadata to provide various scholarly services. However, manually collection of reference metadata every year continues to be a burden because of the associated cost and time consumption. With the accumulation of a large volume of academic literature, several tools, including GROBID and CERMINE, that automatically extract reference metadata have been released. However, these tools have some limitations. For example, they are only applicable to references written in English, the types of extractable metadata are limited for each tool, and the performance of the tools is insufficient to replace the manual extraction of reference metadata. Therefore, in this study, we focused on constructing a high-quality corpus to automatically extract metadata from multilingual journal article references. Using our constructed corpus, we trained and evaluated a BERT-based transfer-learning model. Furthermore, we compared the performance of the BERT-based model with that of the existing model, GROBID. Currently, our corpus contains 3,815,987 multilingual references, mainly in English and Korean, with labels for 13 different metadata types. According to our experiment, the BERT-based model trained using our corpus showed excellent performance in extracting metadata not only from journal references written in English but also in other languages, particularly Korean. This corpus is available at http://doi.org/10.23057/47.
format Online
Article
Text
id pubmed-9858828
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-98588282023-01-21 Building an annotated corpus for automatic metadata extraction from multilingual journal article references Choi, Wonjun Yoon, Hwa-Mook Hyun, Mi-Hwan Lee, Hye-Jin Seol, Jae-Wook Lee, Kangsan Dajeong Yoon, Young Joon Kong, Hyesoo PLoS One Research Article Bibliographic references containing citation information of academic literature play an important role as a medium connecting earlier and recent studies. As references contain machine-readable metadata such as author name, title, or publication year, they have been widely used in the field of citation information services including search services for scholarly information and research trend analysis. Many institutions around the world manually extract and continuously accumulate reference metadata to provide various scholarly services. However, manually collection of reference metadata every year continues to be a burden because of the associated cost and time consumption. With the accumulation of a large volume of academic literature, several tools, including GROBID and CERMINE, that automatically extract reference metadata have been released. However, these tools have some limitations. For example, they are only applicable to references written in English, the types of extractable metadata are limited for each tool, and the performance of the tools is insufficient to replace the manual extraction of reference metadata. Therefore, in this study, we focused on constructing a high-quality corpus to automatically extract metadata from multilingual journal article references. Using our constructed corpus, we trained and evaluated a BERT-based transfer-learning model. Furthermore, we compared the performance of the BERT-based model with that of the existing model, GROBID. Currently, our corpus contains 3,815,987 multilingual references, mainly in English and Korean, with labels for 13 different metadata types. According to our experiment, the BERT-based model trained using our corpus showed excellent performance in extracting metadata not only from journal references written in English but also in other languages, particularly Korean. This corpus is available at http://doi.org/10.23057/47. Public Library of Science 2023-01-20 /pmc/articles/PMC9858828/ /pubmed/36662818 http://dx.doi.org/10.1371/journal.pone.0280637 Text en © 2023 Choi et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Choi, Wonjun
Yoon, Hwa-Mook
Hyun, Mi-Hwan
Lee, Hye-Jin
Seol, Jae-Wook
Lee, Kangsan Dajeong
Yoon, Young Joon
Kong, Hyesoo
Building an annotated corpus for automatic metadata extraction from multilingual journal article references
title Building an annotated corpus for automatic metadata extraction from multilingual journal article references
title_full Building an annotated corpus for automatic metadata extraction from multilingual journal article references
title_fullStr Building an annotated corpus for automatic metadata extraction from multilingual journal article references
title_full_unstemmed Building an annotated corpus for automatic metadata extraction from multilingual journal article references
title_short Building an annotated corpus for automatic metadata extraction from multilingual journal article references
title_sort building an annotated corpus for automatic metadata extraction from multilingual journal article references
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9858828/
https://www.ncbi.nlm.nih.gov/pubmed/36662818
http://dx.doi.org/10.1371/journal.pone.0280637
work_keys_str_mv AT choiwonjun buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT yoonhwamook buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT hyunmihwan buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT leehyejin buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT seoljaewook buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT leekangsandajeong buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT yoonyoungjoon buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences
AT konghyesoo buildinganannotatedcorpusforautomaticmetadataextractionfrommultilingualjournalarticlereferences