Cargando…

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in vario...

Descripción completa

Detalles Bibliográficos
Autores principales: Xiong, Ying, Wang, Zhongmin, Jiang, Dehuan, Wang, Xiaolong, Chen, Qingcai, Xu, Hua, Yan, Jun, Tang, Buzhou
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6454584/
https://www.ncbi.nlm.nih.gov/pubmed/30961602
http://dx.doi.org/10.1186/s12911-019-0770-7
_version_ 1783409564816769024
author Xiong, Ying
Wang, Zhongmin
Jiang, Dehuan
Wang, Xiaolong
Chen, Qingcai
Xu, Hua
Yan, Jun
Tang, Buzhou
author_facet Xiong, Ying
Wang, Zhongmin
Jiang, Dehuan
Wang, Xiaolong
Chen, Qingcai
Xu, Hua
Yan, Jun
Tang, Buzhou
author_sort Xiong, Ying
collection PubMed
description BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words. METHODS: In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus. RESULTS: When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure. CONCLUSIONS: Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.
format Online
Article
Text
id pubmed-6454584
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64545842019-04-17 A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text Xiong, Ying Wang, Zhongmin Jiang, Dehuan Wang, Xiaolong Chen, Qingcai Xu, Hua Yan, Jun Tang, Buzhou BMC Med Inform Decis Mak Research BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words. METHODS: In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus. RESULTS: When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure. CONCLUSIONS: Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus. BioMed Central 2019-04-09 /pmc/articles/PMC6454584/ /pubmed/30961602 http://dx.doi.org/10.1186/s12911-019-0770-7 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Xiong, Ying
Wang, Zhongmin
Jiang, Dehuan
Wang, Xiaolong
Chen, Qingcai
Xu, Hua
Yan, Jun
Tang, Buzhou
A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
title A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
title_full A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
title_fullStr A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
title_full_unstemmed A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
title_short A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
title_sort fine-grained chinese word segmentation and part-of-speech tagging corpus for clinical text
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6454584/
https://www.ncbi.nlm.nih.gov/pubmed/30961602
http://dx.doi.org/10.1186/s12911-019-0770-7
work_keys_str_mv AT xiongying afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT wangzhongmin afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT jiangdehuan afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT wangxiaolong afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT chenqingcai afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT xuhua afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT yanjun afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT tangbuzhou afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT xiongying finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT wangzhongmin finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT jiangdehuan finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT wangxiaolong finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT chenqingcai finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT xuhua finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT yanjun finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext
AT tangbuzhou finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext