Cargando…

A clinical specific BERT developed using a huge Japanese clinical text corpus

Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the...

Descripción completa

Detalles Bibliográficos
Autores principales: Kawazoe, Yoshimasa, Shibata, Daisaku, Shinohara, Emiko, Aramaki, Eiji, Ohe, Kazuhiko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8577751/
https://www.ncbi.nlm.nih.gov/pubmed/34752490
http://dx.doi.org/10.1371/journal.pone.0259763
_version_ 1784596124118548480
author Kawazoe, Yoshimasa
Shibata, Daisaku
Shinohara, Emiko
Aramaki, Eiji
Ohe, Kazuhiko
author_facet Kawazoe, Yoshimasa
Shibata, Daisaku
Shinohara, Emiko
Aramaki, Eiji
Ohe, Kazuhiko
author_sort Kawazoe, Yoshimasa
collection PubMed
description Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.
format Online
Article
Text
id pubmed-8577751
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-85777512021-11-10 A clinical specific BERT developed using a huge Japanese clinical text corpus Kawazoe, Yoshimasa Shibata, Daisaku Shinohara, Emiko Aramaki, Eiji Ohe, Kazuhiko PLoS One Research Article Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed. Public Library of Science 2021-11-09 /pmc/articles/PMC8577751/ /pubmed/34752490 http://dx.doi.org/10.1371/journal.pone.0259763 Text en © 2021 Kawazoe et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Kawazoe, Yoshimasa
Shibata, Daisaku
Shinohara, Emiko
Aramaki, Eiji
Ohe, Kazuhiko
A clinical specific BERT developed using a huge Japanese clinical text corpus
title A clinical specific BERT developed using a huge Japanese clinical text corpus
title_full A clinical specific BERT developed using a huge Japanese clinical text corpus
title_fullStr A clinical specific BERT developed using a huge Japanese clinical text corpus
title_full_unstemmed A clinical specific BERT developed using a huge Japanese clinical text corpus
title_short A clinical specific BERT developed using a huge Japanese clinical text corpus
title_sort clinical specific bert developed using a huge japanese clinical text corpus
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8577751/
https://www.ncbi.nlm.nih.gov/pubmed/34752490
http://dx.doi.org/10.1371/journal.pone.0259763
work_keys_str_mv AT kawazoeyoshimasa aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT shibatadaisaku aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT shinoharaemiko aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT aramakieiji aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT ohekazuhiko aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT kawazoeyoshimasa clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT shibatadaisaku clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT shinoharaemiko clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT aramakieiji clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT ohekazuhiko clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus